Note: these are instructions for the new Sun x86_64 based mate cluster. We encourage you to use the new cluster whenever possible.
Instructions for the old IBM cluster there are avaliable: here
Instructions to use Condor
To use condor within AFS, you must first set up a group to allow condor access to your directories.
Edit the file 'condor_group.sh' and change your_login_name to your CS account name.
Then run the script.
Once you have created this group,
you can then add this group to your working
directory:
find ./your_working_directory -type d -exec fs sa -dir {} -acl your_login_name:condor rlidwk \;
Note: your working directory must be in a directory tree that <username>:condor has access to.
Now, you must create a condor job
file. The provided example is 'ee.sub'.
You must include at least the first two lines:
the first tells condor to use the 'vanilla' universe, which is suitable for all job types.
the second line, 'executable = ' gives the full path to your program.
The rest of the lines are as
follows:
arguments = #these are passed on the command line to your program input = #this
file will become the stdin for your program output = #this file will become
stdout error = #this file will become stderr
initialdir = #you can set up a separate initial working directory for each
instance of your program; can be relative or absolute
Finally, the queue command tells condor to take all the lines preceding it, and
submit a job with these parameters. You can then change one or more of the
parameters, and issue another queue command. This will be a different instance
of the program. You can have as many queue commands as you want, but you must be
careful to change working directories if your program outputs to a file or to
stdout/stderr.
To run:
Of course, I forgot to tell you how to start condor once you have created your
job file. Once you have this file (ee.sub, in my
example), login to s1.mate.cs.pitt.edu, and issue:
condor_submit ee.sub
It will then start queuing your jobs. You can check the status of the cluster
with this command:
/sbin/service/condor status
and you can see what your jobs are doing with this command:
/cluster/condor/x86_64-linux-26/bin/condor_q
The above command will give you an ID
number for each task; that ID will be in the format xxx.yy . You can kill a
whole batch of tasks by issuing:
/cluster/condor/x86_64-linux-26/bin/condor_rm xxx
or you can kill a specific task by:
/cluster/condor/x86_64-linux-26/bin/condor_rm xxx.yy
To start/restart condor service:
This is needed in case condor does not "see" all the machines in the mate
cluster. In case you are prompted for a password while using the following
commands, use your AFS password.
To start the condor service on that machine:
sudo /sbin/service condor start
To stop the condor service on that machine:
sudo /sbin/service condor stop