Note: these are instructions for the new Sun x86_64 based mate cluster. We encourage you to use the new cluster whenever possible.

Instructions for the old IBM cluster there are avaliable: here

Instructions to use Condor

 

To use condor within AFS, you must first set up a group to allow condor access to your directories.

  1. Edit the file 'condor_group.sh' and change your_login_name to your CS account name.

  2. Then run the script.

  3. Once you have created this group, you can then add this group to your working
    directory:

find ./your_working_directory -type d -exec fs sa -dir {} -acl your_login_name:condor rlidwk \;

Note: your working directory must be in a directory tree that <username>:condor has access to.

Now, you must create a condor job file. The provided example is 'ee.sub'.
You must include at least the first two lines:

The rest of the lines are as follows:

arguments = #these are passed on the command line to your program input = #this file will become the stdin for your program output = #this file will become stdout error = #this file will become stderr

initialdir = #you can set up a separate initial working directory for each instance of your program; can be relative or absolute

Finally, the queue command tells condor to take all the lines preceding it, and submit a job with these parameters. You can then change one or more of the parameters, and issue another queue command. This will be a different instance of the program. You can have as many queue commands as you want, but you must be careful to change working directories if your program outputs to a file or to stdout/stderr.

To run:

Of course, I forgot to tell you how to start condor once you have created your job file. Once you have this file (ee.sub, in my example), login to s1.mate.cs.pitt.edu, and issue:

condor_submit ee.sub

It will then start queuing your jobs. You can check the status of the cluster with this command:

/sbin/service/condor status

and you can see what your jobs are doing with this command:

/cluster/condor/x86_64-linux-26/bin/condor_q
 

The above command will give you an ID number for each task; that ID will be in the format xxx.yy . You can kill a whole batch of tasks by issuing:

/cluster/condor/x86_64-linux-26/bin/condor_rm xxx

or you can kill a specific task by:

/cluster/condor/x86_64-linux-26/bin/condor_rm xxx.yy
 


To start/restart condor service:

This is needed in case condor does not "see" all the machines in the mate cluster. In case you are prompted for a password while using the following commands, use your AFS password.

To start the condor service on that machine:

sudo /sbin/service condor start

To stop the condor service on that machine:

sudo /sbin/service condor stop