>>> to add remove nodes (now done in ~mcfarm/bin/add_node.py): No need to stop/restart mcfarm - just alter the distribute.conf after you know you can do the ssh stuff and see it thru NFS, and it can see 000 through nfs (and same for it: ssh to 000 and see 000's dir thru nfs). Then do start_execute 2 3 etc. Comment-out of distribute.conf to take a node out of the farm (but be sure no gathering is going on at the moment). >>> to launch jobs: cd $FARM_JOB_SUBMIT get_request requestID # [--ignore_status] if owned by another farm launch_request requestID target_farm #events jobtype --priority=0 e.g. launch_request 6969 $FARM_BASE 100000 PDSRT --priority=3 The launch-request command now allows --priority=N as the last argument (unless there's an input filename, which is still the last arg). Use -p=0 to keep mcfarm from doing any nice-leveling of its own. You will want to do this for all your jobs on that farm, and you won't be able to put anything higher than 0 in McFarm's distribution order (but you can always re-arrange jobs in the gather.conf file yourself). >>> to kill jobs: The recommended way is to check_job -k jobname I usually do mass kills using some filtered command like check_job -k dist_queue/OU* dist_queue/0*/OU* which will kill everything in the farm, undistributed and distributed, and will remove from the condor queue anything that's in-process. More restrictive filtering of course in future cases. >>> to launch replacement jobs (now, run ~mcfarm/bin/replace_job NNNNN): Assuming each request is done originally like launch_request NNNNN /home/mcfarm/mcfarm 10000 PDSRT --priority=3 and your have 250 as the parallel-segment-size in request.conf. So any PDSRT jobs that has to be killed, replace with launch_request NNNNN /home/mcfarm/mcfarm 250 PDSRT --priority=3 To replace a specific DSRT job: skip SSS events, then launch NNN events: launch_request NNNNN $FARM_BASE SSS:NNN DSRT i.e., to replace the third job of a request: launch_request 13072 $FARM_BASE 500:250 DSRT Just launch replacement within 2 days of killing the old job. By the way, to avoid the 2-day wait period after the last PDSRT job is done and gathered, and the final tmb merge is performed, just go ahead and move the req_NNNNN subdir from ~/job_submit to ~/job_submit/done. The merge daemon will find the subdir in the done section, and proceed immediately with the final tmb merge and store. Now you just have to remember to close_request after the files disappear from the stage dir. >>> to launch minibias job: Use get and launch_request on 5257, which is the magic number for the minbi request in the current epoch of d0 processing, and use job types PD which should declare and cache the d0g files. e.g. launch_request 5257 $FARM_BASE 100000 PD --priority=0 >>> to check if request is done (now, run ~mcfarm/bin/nreqjobs NNNNN): grep ReqNNNNN gather.conf ls $FARM_MERGE_STAGE_DIR | grep ReqNNNNN # If both commands give no output, then the request is done. # Then close with: cd $FARM_JOB_SUBMIT close_request NNNNN >>> to get new version of cardfiles not available via upd (now in ~products/bin/add_cardfile.sh) # as d0sshcvs enabled user do: cd /tmp cvs co -r v00-04-43 cardfiles # as products do: cd /d0dist/dist/packages/cardfiles cp -r /tmp/cardfiles v00-04-43 # then launch jobs as usual >>> to modify request status in sam: sam modify request --request-id=12380 --status=running # (approved) (finished) (terminated) >>> to check on the status of a request in sam: sam get request info --requestId=13433 --format=dict sam get request details --requestid=13433 # [--dictfmt] >>> to get info on a particular request from sam: sam translate constraints --dim="global.requestid=11637" sam translate constraints --dim="global.requestid=12899 and data_tier reconstructed and availability_status available and content_status good" >>> to get number of tmbs produced: sam translate constraints --dim="global.FacilityName=ouhep0.nhn.ou.edu and data_tier=thumbnail" # (or data_tier=reconstructed) sam translate constraints --dim="global.FacilityName=luhep02.lunet.edu and data_tier=thumbnail and create_date>=8/24/2003" sam translate constraints --dim="file_name %SPRACE%JIM_MERGE% and data_tier=thumbnail and appl_name d0reco" >>> to locate a file in sam: sam locate merger-tmb-p14.05.02_OU-Team-NumEv-10000_jes_mcp14_ouhep0.nhn.ou.edu_11637_01076935065 or, with wildcards (use % instead of *): sam translate constraints --dim="file_name merger-tmb-p14.05.02_OU-Team-NumEv-%_jes_mcp14_ouhep0.nhn.ou.edu_11637_%" >>> to check if file is properly declared to sam: sam list files --name=d0reco-merger_NumEv-500_OU-Team_higgs_mcp14_ouhep0.nhn.ou.edu_6706_01055802115 sam list files --name="merger-tmb-p14.05.02_OU-Team-NumEv-%_jes_mcp14_ouhep0.nhn.ou.edu_11637_%" sam list files --dim="file_name merger-tmb-p14.05.02_OU-Team-NumEv-%_jes_mcp14_ouhep0.nhn.ou.edu_11637_%" >>> to get metadata on a file stored in sam: sam get metadata --filename=d0reco-merger_NumEv-0_bertram_bphysics_mcp14_lancs_6400_03141140713 >>> to get info on a dataset/project definition: sam describe definition --defname="overlapset_mcp14_cteq5l-tuneA_simulated" >>> to re-submit a file to sam which got stuck in the cache (now, run ~mcfarm/bin/resubmit_file ): cd scr0xx/gath_queue/$jobid sam cancel file store --station=central-analysis sam store --resubmit --source=. --descrip=./import_kw_.py >>> if jobs fail to distribute: Check job.phase file in top level of job dir in dist_queue dir. >>> if start_farm never finishes, stalling in start_execute: Kill all mcfarm processes and delete all files in lock directory, then retry. >>> if job is stuck in initiating state and condor idle: condor_rm other idle jobs left on node. If no other idle jobs on node then make sure condor is running on node. >>> when jobs won't distribute: Check accuracy of $FARM_BASE/archive/nodeinfo. Delete it and it will be recreated. >>> if no merge_b happening: Empty orphaned files in $FARM_BASE/metadata