Context Navigation

Changes between Version 5 and Version 6 of General-Scripts

Timestamp:: Jun 14, 2015, 11:56:45 PM (10 years ago)
Author:: Olivier Mattelaer
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

General-Scripts

-              v5
+              v6
 == Running MadGraph / MadEvent in the parallel mode on your own cluster ==
 …
 You can add/edit those cluster to fit your need. More information on that on this link:
+https://answers.launchpad.net/madgraph5/+faq/2249
+If you have a generic implementation of a cluster. Don't hesitate to send it to us, such that we can include in the list of official cluster. This might be very useful for other user.
+=== Description of cluster.py ===
+All the command related to cluster submission are done in the following file:
+madgraph/various/cluster.py
+This files contains:
+. The class Cluster which contains the basic command for all practical cluster implementation. You don't have to touch to this.
+. A class for each specific cluster: (CondorCluster, GECluster, LSFCluster, MultiCore, ...)
+. a dictionary from_name: which indicates which class to use for a specific name.
+{{{
+from_name = {'condor':CondorCluster, 'pbs': PBSCluster, 'sge': SGECluster,
+             'lsf': LSFCluster, 'ge':GECluster}
+}}}
+      (you need to keep it updated)
+. The rest are not important.
+If you want to add the support for a new cluster, you need to add new class MYCluster
+which has a the class Cluster as parent:
+{{{
+class MYCluster(Cluster):
+    """Basic class for dealing with cluster submission"""
+    name = 'mycluster'
+    job_id = 'JOBID'
+}}}
+Two class attribute should be define (as shown above)
+. '''name''': the name associate to the cluster
+. '''job_id''': should be a shell environment variable set by your cluster in order to give to the job a unique identification. (use only in  "no central disk" mode.
+Then you have to define the following functions:
+. '''submit'''
+{{{
+    @multiple_try()
+    def submit(self, prog, argument=[], cwd=None, stdout=None, stderr=None, log=None):
+        """Submit a job prog to a GE cluster"""
+}}}
+       a. '''prog''' is the program to run
+       a. '''argument''' is the argument to pass to the program
+       a.  '''cwd''' is the directory from which the script has to be run [default is here]
+       a. '''stdout''' indicates where to write the output
+       a. '''stderr''' indicates where to write the error
+       a. '''log''' indicates where to write the cluster log/statistic of the job
+       For the three last, you have to define your own default (often /dev/null)
+       Note that stderr can be -2. In that case this means that stdout should be written in the same file as stdout
+     This function should return a the identification number associate to the run submission.
+     Note the @multiple_try() before the definition allows to catch all error occurring and retry a couple of second later. This prevent to make the code crash if the server is too busy at a given point.
+. control_one_job:
+{{{
+   @multiple_try()
+    def control_one_job(self, id):
+        """ control the status of a single job with it's cluster id """
+}}}
+       a. '''id''': cluster identification number
+       b. the function should return either:
+. 'I': for job is not yet running
+. 'R': job is running
+. 'F':  finish
+. anything else will be consider  the job as failed.
+. control:
+       {{{
+    @multiple_try()
+    def control(self, me_dir=None):
+       }}}
+       a. '''me_dir''' is the MadEvent directory from which the jobs are runned (most of the time useless)
+       a. should return 4 number associate to the number of jobs in Idle, Running, Finished, Fail
+The rest are optional
+. '''remove'''
+     {{{
+    @multiple_try()
+    def remove(self, *args):
+        """Clean the jobs on the cluster"""
+     }}}
+     Should remove all jobs submitted on your cluster. This will be call if you hit ctrl-c or if some jobs failed.
+. '''submit2''' submit2 is the same as submit 1 but is use when the user define the optional argument (cluster_temp_path = XXX)
+In that case, no central disk are going to be use and all data are going to be written on path XXX --which is supose to be a node local disk--. and finally send back to the share disk. [[BR]]By default, we will first copy all require resource on XXX then use the "submit" command to launch the job.
+and at the end of the job send back all data.
+If your cluster didn't have central disk this is the command to edit.
+=== TRICK===
+. One good idea is to  store the identification id in the list self.submitted_ids
+This help to know which jobs are running/ to delete/...
+. always put     @multiple_try() before the function definition. And raise an error as soon as somthing going wrong.
+the multiple_try function will then retry later.
+== Cluster specific help ==
 == MadGraph4 information
+=== SGE ===
+=== If you have a ''central'' data disk ===
+The current MadGraph / MadEvent version assumes all the scripts are ran from a central data disk mounted on all cluster nodes (e.g. a home directory). If you have access to such a central disk, read the following. If not, please refer to the last section. Some scripts may assume the existence of a specific queue called
+{{{
+madgraph
+}}}
+. If you have difficulties with these, simply create this queue on your cluster (this can help to limit the number of CPU for example), or remove the =-q madgraph= options in the
+{{{
+run_XXX
+}}}
+scripts located in
+{{{
+bin
+}}}
+.
+==== If you have a PBS-compatible batch managing system (PBSPro, !OpenPBS, Torque, ...) ====
+This is the easiest case since the default configuration should work out of the box using the
+{{{
+qsub
+}}}
+command (and
+{{{
+qstat
+}}}
+and/or
+{{{
+qdel
+}}}
+if the whole web interface is present). There is nothing special to do, just run the
+{{{
+generate_events
+}}}
+script as usual and select parallel mode.
+==== If you use the Condor batch managing system ====
+A "translation" script exists (see attachments of this page) to emulate the
+{{{
+qsub
+}}}
+command using a Condor syntax. This script should be tuned to fit your Condor installation and put in a directory of the =$PATH= variable.
+==== If you use another batch managing system ====
+This is the most complicated case, you can either:
+   * Modify manually the
+{{{
+survey
+}}}
+,
+{{{
+refine
+}}}
+and
+{{{
+run_XXX
+}}}
+scripts located in the
+{{{
+bin
+}}}
+directory to force them to use your submission system.
+   * Write a "translation" script like the one available for Condor (see attachment) to emulate the
+{{{
+qsub
+}}}
+command. If you manage to do this and want to share your script to help other MadGraph / MadEvent users, please feel free to edit this page.
+=== If you do not have a ''central'' data disk ===
+We are aware the central disk solution may be inefficient or even impossible to set up on large clusters. We are thus working on a permanent solution. In the meantime, an intermediate solution using temporary directory and the
+{{{
+scp
+}}}
+command exists.
+-- Main.MichelHerquet - 02 Mar 2009
+. On some cluster it can be usefull to replace:
+        {{{
+        command = ['qsub','-o', stdout,
+                   '-N', me_dir,
+                   '-e', stderr,
+                   '-V']
+        }}}
+     into
+        {{{
+        command = ['qsub',
+                   '-S', '/bin/bash',
+                   '-o', stdout,
+                   '-N', me_dir,
+                   '-e', stderr,
+                   '-V']
+        }}}