wiki:General-Scripts

Version 6 (modified by Olivier Mattelaer, 9 years ago) ( diff )

--

Running MadGraph / MadEvent in the parallel mode on your own cluster

In MadGraph5, a list of cluster are supported internally. In the file input/mg5_configuration.txt you have the list of available cluster and the option to configure those.

You can add/edit those cluster to fit your need. More information on that on this link: If you have a generic implementation of a cluster. Don't hesitate to send it to us, such that we can include in the list of official cluster. This might be very useful for other user.

Description of cluster.py

All the command related to cluster submission are done in the following file: madgraph/various/cluster.py

This files contains:

  1. The class Cluster which contains the basic command for all practical cluster implementation. You don't have to touch to this.
  2. A class for each specific cluster: (CondorCluster, GECluster, LSFCluster, MultiCore, ...)
  3. a dictionary from_name: which indicates which class to use for a specific name.
    from_name = {'condor':CondorCluster, 'pbs': PBSCluster, 'sge': SGECluster, 
                 'lsf': LSFCluster, 'ge':GECluster}
    
    

(you need to keep it updated)

  1. The rest are not important.

If you want to add the support for a new cluster, you need to add new class MYCluster which has a the class Cluster as parent:

class MYCluster(Cluster):
    """Basic class for dealing with cluster submission"""
    
    name = 'mycluster'
    job_id = 'JOBID'

Two class attribute should be define (as shown above)

  1. name: the name associate to the cluster
  2. job_id: should be a shell environment variable set by your cluster in order to give to the job a unique identification. (use only in "no central disk" mode.

Then you have to define the following functions:

  1. submit
        @multiple_try()
        def submit(self, prog, argument=[], cwd=None, stdout=None, stderr=None, log=None):
            """Submit a job prog to a GE cluster"""
    
    1. prog is the program to run
    2. argument is the argument to pass to the program
    3. cwd is the directory from which the script has to be run [default is here]
    4. stdout indicates where to write the output
    5. stderr indicates where to write the error
    6. log indicates where to write the cluster log/statistic of the job For the three last, you have to define your own default (often /dev/null) Note that stderr can be -2. In that case this means that stdout should be written in the same file as stdout

This function should return a the identification number associate to the run submission. Note the @multiple_try() before the definition allows to catch all error occurring and retry a couple of second later. This prevent to make the code crash if the server is too busy at a given point.

  1. control_one_job:
       @multiple_try()
        def control_one_job(self, id):
            """ control the status of a single job with it's cluster id """
    
    1. id: cluster identification number
    2. the function should return either:
      1. 'I': for job is not yet running
      2. 'R': job is running
      3. 'F': finish
      4. anything else will be consider the job as failed.
  1. control:
        @multiple_try()
        def control(self, me_dir=None):
    
  1. me_dir is the MadEvent directory from which the jobs are runned (most of the time useless)
  2. should return 4 number associate to the number of jobs in Idle, Running, Finished, Fail

The rest are optional

  1. remove
        @multiple_try()
        def remove(self, *args):
            """Clean the jobs on the cluster"""
    
    Should remove all jobs submitted on your cluster. This will be call if you hit ctrl-c or if some jobs failed.
  1. submit2 submit2 is the same as submit 1 but is use when the user define the optional argument (cluster_temp_path = XXX)

In that case, no central disk are going to be use and all data are going to be written on path XXX --which is supose to be a node local disk--. and finally send back to the share disk.
By default, we will first copy all require resource on XXX then use the "submit" command to launch the job. and at the end of the job send back all data. If your cluster didn't have central disk this is the command to edit.

TRICK

  1. One good idea is to store the identification id in the list self.submitted_ids

This help to know which jobs are running/ to delete/...

  1. always put @multiple_try() before the function definition. And raise an error as soon as somthing going wrong.

the multiple_try function will then retry later.

Cluster specific help

SGE

  1. On some cluster it can be usefull to replace:
    command = ['qsub','-o', stdout,
               '-N', me_dir,
               '-e', stderr,
               '-V']
    
    into
    command = ['qsub',
               '-S', '/bin/bash',
               '-o', stdout,
               '-N', me_dir,
               '-e', stderr,
               '-V']
    

Attachments (1)

Download all attachments as: .zip

Note: See TracWiki for help on using the wiki.