SNS Condor Project

Condor is software which enables high-throughput computing on collections of distributively owned computing resources. We are running Condor on the Linux workstations in user offices. You can use any of these machines to submit jobs to a separate central management machine that doesn't itself run jobs but coordinates the queing process to run jobs on other machines in the pool. It's easier for you to manage running jobs if you always submit them from the same machine so I suggest that Linux users use their desktop. Windows users please contact SNS help and we'll let you know a machine to use.

The condor manual is available under /home/condor/condor-V6_6_10-Manual.pdf. Section 2 of the manual covers job submission in detail.

Parallel processing across machines is not possible. Since such jobs cannot either suspend (stop running when the machine becomes busy), or checkpoint (stop running when the machine becomes busy and migrate to another machine), a dedicated resource would be required to run jobs in parallel.

Introduction And "Test Run"

If you'd like to use Condor, please conduct the following test to verify that Condor can work for you before proceeding (you should receive 5 emails within a couple of hours when each of 5 small jobs has completed):

Firstly, setup these variables: Under tcsh:

% setenv PATH ${PATH}:/home/condor/condor/bin
% setenv CONDOR_CONFIG /home/condor/condor/etc/condor_config

or, assuming bash or sh:

% export PATH=${PATH}:/home/condor/condor/bin
% export CONDOR_CONFIG=/home/condor/condor/etc/condor_config

Now execute these commands:

mkdir /home/username/condor
cd /home/username/condor
cp /home/condor/condor/examples/loop.c .
cp /home/condor/condor/examples/loop.cmd .
condor_compile gcc loop.c -o loop.remote
condor_submit loop.cmd

Note: Compilation has to be preceded by condor_compile to link binaries for Condor in the standard universe. Don't worry about the mstemp warning).

Submitting job(s).....
Logging submit event(s).....
5 job(s) submitted to cluster 1.

If the above test works then please feel free to submit your own job ... please just one job at a time though until you get email/output back and know things are working.

Commands

Viewing The Queue

condor_q enables you to view your queue to see what's running:

-- Submitter: pr178.sns.ias.edu : <172.16.31.178:49998> : 
pr178.sns.ias.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
   1.0   jns             5/9  10:00   0+00:00:00 I  0   3.4  loop.remote 200   
   1.1   jns             5/9  10:00   0+00:00:00 I  0   3.4  loop.remote 200   
   1.2   jns             5/9  10:00   0+00:00:00 I  0   3.4  loop.remote 300   
   1.3   jns             5/9  10:00   0+00:00:00 I  0   3.4  loop.remote 300   
   1.4   jns             5/9  10:00   0+00:00:00 I  0   3.4  loop.remote 500   
5 jobs; 5 idle, 0 running, 0 held

Initally, no jobs are running. A short while later one job is running on each client machine:

bash-2.05a$ condor_q
-- Submitter: pr184.sns.ias.edu : <172.16.31.184:49989> : 
pr184.sns.ias.edu
 ID      OWNER            SUBMITTED     RUN_TIME ST PRI SIZE CMD               
  25.0   jns             5/9  13:11   0+00:00:44 R  0   4.1  loop.remote 200   
  25.1   jns             5/9  13:11   0+00:00:42 R  0   4.1  loop.remote 200   
  25.2   jns             5/9  13:11   0+00:00:40 R  0   4.1  loop.remote 300   
  25.3   jns             5/9  13:11   0+00:00:38 R  0   4.1  loop.remote 300   
  25.4   jns             5/9  13:11   0+00:00:00 I  0   3.4  loop.remote 500   

5 jobs; 1 idle, 4 running, 0 held

If you are interested, use condor_q -global to see a list of all submitted jobs.

Obtaining Help

To obtain help on a command use command -h e.g.

[root@artemis docs]# condor_q -h
Usage: condor_q [options]
     where [options] are
             -global                 Get global queue
             -submitter (submitter)  Get queue of specific submitter
             -help                   This screen
             -name             Name of schedd
             -pool             Use host as the central manager to query
             -long                   Verbose output
             -format (fmt) (attr)    Print attribute attr using format fmt
             -analyze                Perform schedulability analysis on jobs
             -run                    Get information about running jobs
             -hold                   Get information about jobs placed on hold
             -goodput                Display job goodput statistics
             -cputime                Display CPU_TIME instead of RUN_TIME
             -currentrun             Display times only for current run
             -io                     Show information regarding I/O
             -dag                    Sort DAG jobs under their DAGMan
             -expert                 Display shorter error messages
             restriction list
        where each restriction may be one of
             (cluster)               Get information about specific cluster
             (cluster).(proc)        Get information about specific job
             (owner)                 Information about jobs owned by 
             -constraint (expr)      Add constraint on classads

Removing jobs

condor_rm is used to remove jobs. e.g.

bash-2.05a$ condor_rm 25.1 OR
bash-2.05a$ condor_rm 25 (to remove all the 25.x jobs)

Viewing the pool

Use condor_status to view the status of the pool at this stage the pool contained 4 machines; now there are 40)

bash-2.05a$ condor_status

Name          OpSys       Arch   State      Activity   LoadAv Mem   
ActvtyTime

pr175.sns.ias LINUX       INTEL  Claimed    Busy       1.220   501  
0+00:00:04
pr178.sns.ias LINUX       INTEL  Claimed    Busy       0.010   501  
0+00:00:04
pr179.sns.ias LINUX       INTEL  Claimed    Busy       0.410   501  
0+00:00:04
pr184.sns.ias LINUX       INTEL  Claimed    Busy       0.020   501  
0+00:00:02

                     Machines Owner Claimed Unclaimed Matched Preempting

         INTEL/LINUX        4     0       4         0       0          0

               Total        4     0       4         0       0          0

When it comes to running your own jobs then modify loop.cmd from the example as your submit script.

Condor's Vanilla Universe

If you'd like to submit jobs that use shell scripts as opposed to compiled executables as in the example above, you will need to run jobs in what is known as the vanilla universe. Unfortunately, such jobs cannot checkpoint (stop running on a busy machine and then run on another less busy machine) as for the Standard Universe. Read more about each condor universe in the condor manual on page 15.

Add Universe = vanilla to your submit script (loop.cmd in the above example) to run jobs under the vanilla universe:

Things To Watch Out For ...

  1. Because jobs can be run on any machine in the pool, they must be able to access the input and output files from any of these machines .... which means you must use /home/username AND/OR /work/username for these files and not say /tmp, which is local to each machine.
  2. The condor user needs permission to access to any input files.
  3. Always submit jobs from the same machine. It's easier to keep an eye on them with condor_q. If you don't get output/email back from a job make sure the job isn't still running somewhere and kill it with condor_rm.
  4. If you don't see job output until completion, and wish to check on your jobs progress as it is running, adjust condors buffering parameters in the submit script, e.g.:
    buffer_size = 1000
    buffer_blocvk_size = 256