>>> to add remove nodes (now done in ~mcfarm/bin/add_node.py):

No need to stop/restart mcfarm - just alter the distribute.conf after you
know you can do the ssh stuff and see it thru NFS, and it can see 000 through 
nfs (and same for it: ssh to 000 and see 000's dir thru nfs).  Then do
start_execute 2 3 etc.

Comment-out of distribute.conf to take a node out of the farm 
(but be sure no gathering is going on at the moment).

>>> to launch jobs:

cd $FARM_JOB_SUBMIT
get_request requestID  # [--ignore_status]  if owned by another farm 

launch_request requestID target_farm #events jobtype --priority=0 

e.g. launch_request 6969 $FARM_BASE 100000 PDSRT --priority=3

The launch-request command now allows

--priority=N

as the last argument (unless there's an input filename, which is still the
last arg).  Use -p=0 to keep mcfarm from doing any nice-leveling of its own.
You will want to do this for all your jobs on that farm, and you won't be
able to put anything higher than 0 in McFarm's distribution order 
(but you can always re-arrange jobs in the gather.conf file yourself).

>>> to kill jobs:

The recommended way is to    check_job -k jobname   I usually do mass kills
using some filtered command like

check_job -k dist_queue/OU*   dist_queue/0*/OU*

which will kill everything in the farm, undistributed and distributed, 
and will remove from the condor queue anything that's in-process.  
More restrictive filtering of course in future cases.

>>> to launch replacement jobs  (now, run ~mcfarm/bin/replace_job NNNNN):

Assuming each request is done originally like

launch_request NNNNN /home/mcfarm/mcfarm 10000 PDSRT --priority=3

and your have 250 as the parallel-segment-size in request.conf.  
So any PDSRT jobs that has to be killed, replace with

launch_request NNNNN /home/mcfarm/mcfarm 250 PDSRT --priority=3

To replace a specific DSRT job: skip SSS events, then launch NNN events:

launch_request NNNNN $FARM_BASE SSS:NNN DSRT

i.e., to replace the third job of a request:

launch_request 13072 $FARM_BASE 500:250 DSRT

Just launch replacement within 2 days of killing the old job.

By the way, to avoid the 2-day wait period after the last PDSRT job is done
and gathered, and the final tmb merge is performed, just go ahead and move
the req_NNNNN subdir from ~/job_submit to ~/job_submit/done.  The merge
daemon will find the subdir in the done section, and proceed immediately
with the final tmb merge and store.  Now you just have to remember to
close_request after the files disappear from the stage dir.

>>> to launch minibias job:

Use get and launch_request on 5257, which is the magic number for the minbi
request in the current epoch of d0 processing, and use job types PD which
should declare and cache the d0g files.

e.g. launch_request 5257 $FARM_BASE 100000 PD --priority=0

>>> to check if request is done (now, run ~mcfarm/bin/nreqjobs NNNNN):

grep ReqNNNNN gather.conf
ls $FARM_MERGE_STAGE_DIR | grep ReqNNNNN
# If both commands give no output, then the request is done.
# Then close with:
cd $FARM_JOB_SUBMIT
close_request NNNNN

>>> to get new version of cardfiles not available via upd 
    (now in ~products/bin/add_cardfile.sh)

# as d0sshcvs enabled user do:
cd /tmp
cvs co -r  v00-04-43 cardfiles
# as products do:
cd /d0dist/dist/packages/cardfiles
cp -r /tmp/cardfiles v00-04-43
# then launch jobs as usual

>>> to modify request status in sam:

sam modify request --request-id=12380 --status=running  # (approved) (finished) (terminated)

>>> to check on the status of a request in sam:

sam get request info --requestId=13433 --format=dict
sam get request details --requestid=13433  # [--dictfmt]

>>> to get info on a particular request from sam:

sam translate constraints --dim="global.requestid=11637"
sam translate constraints --dim="global.requestid=12899 and data_tier reconstructed and availability_status available and content_status good"

>>> to get number of tmbs produced:

sam translate constraints --dim="global.FacilityName=ouhep0.nhn.ou.edu and data_tier=thumbnail" # (or data_tier=reconstructed)
sam translate constraints --dim="global.FacilityName=luhep02.lunet.edu and data_tier=thumbnail and create_date>=8/24/2003"
sam translate constraints --dim="file_name %SPRACE%JIM_MERGE% and data_tier=thumbnail and appl_name d0reco"

>>> to locate a file in sam:

sam locate merger-tmb-p14.05.02_OU-Team-NumEv-10000_jes_mcp14_ouhep0.nhn.ou.edu_11637_01076935065

or, with wildcards (use % instead of *):

sam translate constraints --dim="file_name merger-tmb-p14.05.02_OU-Team-NumEv-%_jes_mcp14_ouhep0.nhn.ou.edu_11637_%"

>>> to check if file is properly declared to sam:

sam list files --name=d0reco-merger_NumEv-500_OU-Team_higgs_mcp14_ouhep0.nhn.ou.edu_6706_01055802115
sam list files --name="merger-tmb-p14.05.02_OU-Team-NumEv-%_jes_mcp14_ouhep0.nhn.ou.edu_11637_%"
sam list files --dim="file_name merger-tmb-p14.05.02_OU-Team-NumEv-%_jes_mcp14_ouhep0.nhn.ou.edu_11637_%"

>>> to get metadata on a file stored in sam:

sam get metadata --filename=d0reco-merger_NumEv-0_bertram_bphysics_mcp14_lancs_6400_03141140713

>>> to get info on a dataset/project definition:

sam describe definition --defname="overlapset_mcp14_cteq5l-tuneA_simulated"

>>> to re-submit a file to sam which got stuck in the cache 
    (now, run ~mcfarm/bin/resubmit_file <filename>):

cd scr0xx/gath_queue/$jobid
sam cancel file store --station=central-analysis <filename>
sam store --resubmit --source=. --descrip=./import_kw_<filename>.py

>>> if jobs fail to distribute:

Check job.phase file in top level of job dir in dist_queue dir.

>>> if start_farm never finishes, stalling in start_execute:

Kill all mcfarm processes and delete all files in lock directory, then retry.

>>> if job is stuck in initiating state and condor idle:

condor_rm other idle jobs left on node.
If no other idle jobs on node then make sure condor is running on node.

>>> when jobs won't distribute:

Check accuracy of $FARM_BASE/archive/nodeinfo.  
Delete it and it will be recreated.

>>> if no merge_b happening:

Empty orphaned files in $FARM_BASE/metadata