2.3.3. Streaming stdout & stderr with tail¶
When submitting a job in HTCondor (and any other computing facilities), often your job will be run on another node at a later time. If you eager to look at the output (stdout & stderr) as soon as it is running, HTCondor provided a facility to do that together with some UNIX utilities.
Firstly, HTCondor has a facility to stream the stdout & stderr from the worker nodes back to the submit node you are working on. To use a specific example from Vanilla universe,
...
output = hello_world.out
error = hello_world.err
log = hello_world.log
stream_output = True
stream_error = True
...
The stream_output
& stream_error
instructs the job to stream the stdout & stderr back to your submit node in real time (which would normally be transferred back only after the job terminates).
If we submit the job and run the tail
command at once, like this
condor_submit example.ini; tail -F hello_world.log hello_world.out hello_world.err
Then the UNIX command tail
would follow the files listed (which are the output
, error
and log
specified in your ClassAd) as soon as new contents are available.
2.3.3.1. Detailed explanations¶
As an example, the output would looks something like
$ condor_submit example.ini; tail -F hello_world.log hello_world.out hello_world.err
Submitting job(s).
1 job(s) submitted to cluster 511.
which is the stdout from condor_submit example.ini
. Then
==> hello_world.log <==
is the tail command working immediately to follow contents of hello_world.log
, with the following contents:
000 (511.000.000) 2023-08-29 23:39:35 Job submitted from host: <195.194.109.199:9618?addrs=195.194.109.199-9618+[2001-630-22-d0ff-5054-ff-fe9a-b662]-9618&alias=vm77.tier2.hep.manchester.ac.uk&noUDP&sock=schedd_2377818_f2b3>
...
Then
tail: cannot open ‘hello_world.out’ for reading: No such file or directory
tail: cannot open ‘hello_world.err’ for reading: No such file or directory
is tail
telling us that hello_world.out
& hello_world.err
does not exist yet, as the job hasn’t started. tail
will follow them as soon as they are available. Then hello_world.log
continues to have more content, indicating its progress:
==> hello_world.log <==
040 (511.000.000) 2023-08-29 23:39:35 Started transferring input files
Transferring to host: <195.194.109.209:9618?addrs=195.194.109.209-9618+[2001-630-22-d0ff-5054-ff-fee9-c3d]-9618&alias=vm75.in.tier2.hep.manchester.ac.uk&noUDP&sock=slot1_4_1883_7e66_41406>
...
040 (511.000.000) 2023-08-29 23:39:35 Finished transferring input files
...
Then
tail: ‘hello_world.out’ has appeared; following end of new file
tail: ‘hello_world.err’ has appeared; following end of new file
tells us that these files finally appeared (as the job has started). Then
001 (511.000.000) 2023-08-29 23:39:36 Job executing on host: <195.194.109.209:9618?addrs=195.194.109.209-9618+[2001-630-22-d0ff-5054-ff-fee9-c3d]-9618&alias=vm75.in.tier2.hep.manchester.ac.uk&noUDP&sock=startd_1389_5123>
...
continues to show more log from hello_world.log
. This part
==> hello_world.out <==
hello world
Is the content of hello_world.out
as soon as it appears, where in the end it has the following log:
==> hello_world.log <==
006 (511.000.000) 2023-08-29 23:39:36 Image size of job updated: 35
0 - MemoryUsage of job (MB)
0 - ResidentSetSize of job (KB)
...
005 (511.000.000) 2023-08-29 23:39:36 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
0 - Run Bytes Sent By Job
33088 - Run Bytes Received By Job
0 - Total Bytes Sent By Job
33088 - Total Bytes Received By Job
Partitionable Resources : Usage Request Allocated
Cpus : 1 1
Disk (KB) : 44 35 832179
Memory (MB) : 0 1 100
Job terminated of its own accord at 2023-08-29T22:39:36Z.
...
You will notice that the tail
process has never ended, as if it is halting. The reason is that you are not looking at the output of the job itself, but monitoring the streaming output from the job via tail
. As far as tail
is concerned, it will continue to monitor (follow) any new contents from these 3 files and print it on your screen.
From the content itself, you see Job terminated of its own accord...
meaning that your job has ended, and you should now press Ctrl + c
to terminate the tail
command.
You can also checkout Monitor your jobs to see how to monitor the status of your job, and from it you can tell this job has indeed ended.