How to run ATHENA-ATLFAST via globus -- multiple jobs on a job scheduler

You'll need four scripts: requirements, run-atlfast-globus-multi, atlfast-setup, and atlfast-globus-multi.

The requirements file will need to be in the home directory of the (remote) account where your globus job is going to run, so you'll need to copy it there:

$ globus-rcp -p ~/requirements ouhep1.nhn.ou.edu:
(Or just uncomment the line in run-atlfast-globus-multi. Also, it's already in the grid account here on ouhep1, so you don't have to worry about it if you're running here.)

Then, in run-atlfast-globus-multi, you'll need to set the gatekeeper, the working directory (relative to the remote $HOME directory -- be sure that directory actually exists on the gatekeeper!) in which you'd like your atlfast job to run, your BNL AFS userid, the Release, and the number of jobs you'd like to run simultaneously.

In atlfast-setup, you may have to twiddle with the klog and unlog path, if those commands aren't found on the machine you're trying to run. They're usually either in /usr/bin/ or /usr/afsws/bin/, but ...

You'll also need to customize in that script what kind of job you'd like to run (Pythia or Isajet), how many events, and if you want to run multiple jobs or not.

Then you're ready to roll, just run run-atlfast-globus-multi, which will prompt you for your BNL AFS password, and then it'll copy and run first atlfast-setup and then atlfast-globus-multi (which have to be in the same directory, of course) on the gatekeeper, which will check out, compile, and run first an athena-atlfast test job and then submit the requested number of jobs to the job scheduler. That's currently condor, but can easily be changed by changing /jobmanager-condor to /jobmanager-pbs or /jobmanager-lsf, ..., depending on what job scheduler you have available, in run-atlfast-globus-multi.

Then you can get the status of your job with globus-job-status <jobID>, and the output and stderr with globus-job-get-output <jobID> and globus-job-get-output -err <jobID>.

At this point the output and error files only seem to return one of the jobs, or even a mixture of them, I haven't figured that out yet. Seems to be a bug in the globus-condor interface somehow. But that doesn't seem to affect the execution of the jobs, as all the ntuples seem to be fine. Every now and then there is a problem with file locking, since apparently nfs is too slow when multiple jobs start up simultaneously, but for the most part it seems to work fine. If you have any suggestions as to how to improve this, or encounter any problems, please let me know.

If you want to run on other testbed sites, you need to be aware of several things: Athena for some reason requires the existence of libXm.so.1, which is an old version (LessTif/Motif1.2, as opposed to 2.0). You may be able to get away with making a soft link to libXm.so.2, I haven't tried that.

Also, it wants libstdc++-libc6.1-2.so.3, which is a version number that's not in the standard RedHat distribution as far as I know. The way to get around that is to make a soft link:

# ln -s libstdc++-2-libc6.1-1-2.9.0.so /usr/lib/libstdc++-libc6.1-2.so.3
That does work, I've done it here.

Also, for the time being, the BNL scripts require access to both the /afs/usatlas.bnl.gov/ and /afs/rhic/ trees, until all instances of rhic have been eliminated.


Horst Severini <hs@mail.nhn.ou.edu>
Last modified: Wed May 29 16:57:15 CDT 2002