1 Einstein Drive
Princeton, NJ 08540, USA
Condor is software which enables high-throughput computing on collections of distributively owned computing resources. We are running Condor on the Linux workstations in user offices. You can use any of these machines to submit jobs to a separate central management machine that doesn't itself run jobs but coordinates the queing process to run jobs on other machines in the pool. It's easier for you to manage running jobs if you always submit them from the same machine so I suggest that Linux users use their desktop. Windows users please contact SNS help and we'll let you know a machine to use.
The condor manual is available under
/home/condor/condor-V6_6_10-Manual.pdf. Section 2 of the
manual covers job submission in detail.
Parallel processing across machines is not possible. Since such jobs cannot either suspend (stop running when the machine becomes busy), or checkpoint (stop running when the machine becomes busy and migrate to another machine), a dedicated resource would be required to run jobs in parallel.
If you'd like to use Condor, please conduct the following test to verify that Condor can work for you before proceeding (you should receive 5 emails within a couple of hours when each of 5 small jobs has completed):
Firstly, setup these variables: Under tcsh:
% setenv PATH ${PATH}:/home/condor/condor/bin
% setenv CONDOR_CONFIG /home/condor/condor/etc/condor_config
or, assuming bash or sh:
% export PATH=${PATH}:/home/condor/condor/bin
% export CONDOR_CONFIG=/home/condor/condor/etc/condor_config
Now execute these commands:
mkdir /home/username/condor cd /home/username/condor cp /home/condor/condor/examples/loop.c . cp /home/condor/condor/examples/loop.cmd . condor_compile gcc loop.c -o loop.remote condor_submit loop.cmd
Note: Compilation has to be preceded by
condor_compile to link binaries for Condor in the standard
universe. Don't worry about the mstemp warning).
Submitting job(s)..... Logging submit event(s)..... 5 job(s) submitted to cluster 1.
If the above test works then please feel free to submit your own job ... please just one job at a time though until you get email/output back and know things are working.
condor_q enables you to view your queue to see what's
running:
-- Submitter: pr178.sns.ias.edu : <172.16.31.178:49998> : pr178.sns.ias.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 1.0 jns 5/9 10:00 0+00:00:00 I 0 3.4 loop.remote 200 1.1 jns 5/9 10:00 0+00:00:00 I 0 3.4 loop.remote 200 1.2 jns 5/9 10:00 0+00:00:00 I 0 3.4 loop.remote 300 1.3 jns 5/9 10:00 0+00:00:00 I 0 3.4 loop.remote 300 1.4 jns 5/9 10:00 0+00:00:00 I 0 3.4 loop.remote 500 5 jobs; 5 idle, 0 running, 0 held
Initally, no jobs are running. A short while later one job is running on each client machine:
bash-2.05a$ condor_q -- Submitter: pr184.sns.ias.edu : <172.16.31.184:49989> : pr184.sns.ias.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 25.0 jns 5/9 13:11 0+00:00:44 R 0 4.1 loop.remote 200 25.1 jns 5/9 13:11 0+00:00:42 R 0 4.1 loop.remote 200 25.2 jns 5/9 13:11 0+00:00:40 R 0 4.1 loop.remote 300 25.3 jns 5/9 13:11 0+00:00:38 R 0 4.1 loop.remote 300 25.4 jns 5/9 13:11 0+00:00:00 I 0 3.4 loop.remote 500 5 jobs; 1 idle, 4 running, 0 held
If you are interested, use condor_q -global to see a
list of all submitted jobs.
To obtain help on a command use command -h e.g.
[root@artemis docs]# condor_q -h
Usage: condor_q [options]
where [options] are
-global Get global queue
-submitter (submitter) Get queue of specific submitter
-help This screen
-name Name of schedd
-pool Use host as the central manager to query
-long Verbose output
-format (fmt) (attr) Print attribute attr using format fmt
-analyze Perform schedulability analysis on jobs
-run Get information about running jobs
-hold Get information about jobs placed on hold
-goodput Display job goodput statistics
-cputime Display CPU_TIME instead of RUN_TIME
-currentrun Display times only for current run
-io Show information regarding I/O
-dag Sort DAG jobs under their DAGMan
-expert Display shorter error messages
restriction list
where each restriction may be one of
(cluster) Get information about specific cluster
(cluster).(proc) Get information about specific job
(owner) Information about jobs owned by
-constraint (expr) Add constraint on classads
condor_rm is used to remove jobs. e.g.
bash-2.05a$ condor_rm 25.1 OR
bash-2.05a$ condor_rm 25 (to remove all the 25.x jobs)
Use condor_status to view the status of the pool
at this stage the pool contained 4 machines; now there are 40)
bash-2.05a$ condor_status
Name OpSys Arch State Activity LoadAv Mem
ActvtyTime
pr175.sns.ias LINUX INTEL Claimed Busy 1.220 501
0+00:00:04
pr178.sns.ias LINUX INTEL Claimed Busy 0.010 501
0+00:00:04
pr179.sns.ias LINUX INTEL Claimed Busy 0.410 501
0+00:00:04
pr184.sns.ias LINUX INTEL Claimed Busy 0.020 501
0+00:00:02
Machines Owner Claimed Unclaimed Matched Preempting
INTEL/LINUX 4 0 4 0 0 0
Total 4 0 4 0 0 0
When it comes to running your own jobs then modify
loop.cmd from the example as your submit script.
If you'd like to submit jobs that use shell scripts as opposed to compiled executables as in the example above, you will need to run jobs in what is known as the vanilla universe. Unfortunately, such jobs cannot checkpoint (stop running on a busy machine and then run on another less busy machine) as for the Standard Universe. Read more about each condor universe in the condor manual on page 15.
Add Universe = vanilla to your submit script
(loop.cmd in the above example) to run jobs under the
vanilla universe:
/home/username AND/OR
/work/username for these files and not say
/tmp, which is local to each machine.condor_q. If you don't get output/email back
from a job make sure the job isn't still running somewhere and kill it
with condor_rm.buffer_size = 1000 buffer_blocvk_size = 256