Changes between Version 5 and Version 6 of General-Scripts


Ignore:
Timestamp:
Jun 14, 2015, 11:56:45 PM (9 years ago)
Author:
Olivier Mattelaer
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • General-Scripts

    v5 v6  
    1 
    21
    32== Running MadGraph / MadEvent in the parallel mode on your own cluster ==
     
    98
    109You can add/edit those cluster to fit your need. More information on that on this link:
    11 https://answers.launchpad.net/madgraph5/+faq/2249
     10If you have a generic implementation of a cluster. Don't hesitate to send it to us, such that we can include in the list of official cluster. This might be very useful for other user.
    1211
    1312
     13=== Description of cluster.py ===
     14
     15All the command related to cluster submission are done in the following file:
     16madgraph/various/cluster.py
     17
     18This files contains:
     19   1. The class Cluster which contains the basic command for all practical cluster implementation. You don't have to touch to this.
     20   2. A class for each specific cluster: (CondorCluster, GECluster, LSFCluster, MultiCore, ...)
     21   3. a dictionary from_name: which indicates which class to use for a specific name.
     22{{{
     23from_name = {'condor':CondorCluster, 'pbs': PBSCluster, 'sge': SGECluster,
     24             'lsf': LSFCluster, 'ge':GECluster}
     25
     26}}}
     27
     28      (you need to keep it updated)
     29   4. The rest are not important.
     30
     31If you want to add the support for a new cluster, you need to add new class MYCluster
     32which has a the class Cluster as parent:
     33
     34{{{
     35class MYCluster(Cluster):
     36    """Basic class for dealing with cluster submission"""
     37   
     38    name = 'mycluster'
     39    job_id = 'JOBID'
     40}}}
     41
     42Two class attribute should be define (as shown above)
     43   1. '''name''': the name associate to the cluster
     44   2. '''job_id''': should be a shell environment variable set by your cluster in order to give to the job a unique identification. (use only in  "no central disk" mode.
     45
     46Then you have to define the following functions:
     47   1. '''submit'''
     48{{{
     49    @multiple_try()
     50    def submit(self, prog, argument=[], cwd=None, stdout=None, stderr=None, log=None):
     51        """Submit a job prog to a GE cluster"""
     52}}}
     53       a. '''prog''' is the program to run
     54       a. '''argument''' is the argument to pass to the program
     55       a.  '''cwd''' is the directory from which the script has to be run [default is here]
     56       a. '''stdout''' indicates where to write the output
     57       a. '''stderr''' indicates where to write the error
     58       a. '''log''' indicates where to write the cluster log/statistic of the job
     59       For the three last, you have to define your own default (often /dev/null)
     60       Note that stderr can be -2. In that case this means that stdout should be written in the same file as stdout
     61
     62     This function should return a the identification number associate to the run submission.
     63     Note the @multiple_try() before the definition allows to catch all error occurring and retry a couple of second later. This prevent to make the code crash if the server is too busy at a given point.
     64
     65   2. control_one_job:
     66{{{
     67   @multiple_try()
     68    def control_one_job(self, id):
     69        """ control the status of a single job with it's cluster id """
     70}}}
     71       a. '''id''': cluster identification number
     72       b. the function should return either:
     73           1. 'I': for job is not yet running
     74           1. 'R': job is running
     75           1. 'F':  finish
     76           1. anything else will be consider  the job as failed.
     77
     78   3. control:
     79       {{{   
     80    @multiple_try()
     81    def control(self, me_dir=None):
     82       }}}
     83
     84       a. '''me_dir''' is the MadEvent directory from which the jobs are runned (most of the time useless)
     85       a. should return 4 number associate to the number of jobs in Idle, Running, Finished, Fail
     86
     87The rest are optional
     88
     89   1. '''remove'''
     90     {{{
     91    @multiple_try()
     92    def remove(self, *args):
     93        """Clean the jobs on the cluster"""
     94     }}}
     95     Should remove all jobs submitted on your cluster. This will be call if you hit ctrl-c or if some jobs failed.
     96
     97   2. '''submit2''' submit2 is the same as submit 1 but is use when the user define the optional argument (cluster_temp_path = XXX)
     98In that case, no central disk are going to be use and all data are going to be written on path XXX --which is supose to be a node local disk--. and finally send back to the share disk. [[BR]]By default, we will first copy all require resource on XXX then use the "submit" command to launch the job.
     99and at the end of the job send back all data.
     100If your cluster didn't have central disk this is the command to edit.
     101
     102=== TRICK===
     103
     104   1. One good idea is to  store the identification id in the list self.submitted_ids
     105This help to know which jobs are running/ to delete/...
     106   
     107   2. always put     @multiple_try() before the function definition. And raise an error as soon as somthing going wrong.
     108the multiple_try function will then retry later.
    14109
    15110
     111== Cluster specific help ==
    16112
    17 == MadGraph4 information
     113=== SGE ===
    18114
    19 === If you have a ''central'' data disk ===
    20 The current MadGraph / MadEvent version assumes all the scripts are ran from a central data disk mounted on all cluster nodes (e.g. a home directory). If you have access to such a central disk, read the following. If not, please refer to the last section. Some scripts may assume the existence of a specific queue called
    21 {{{
    22 madgraph
    23 }}}
    24 . If you have difficulties with these, simply create this queue on your cluster (this can help to limit the number of CPU for example), or remove the =-q madgraph= options in the
    25 {{{
    26 run_XXX
    27 }}}
    28 scripts located in
    29 {{{
    30 bin
    31 }}}
    32 .
    33 
    34 ==== If you have a PBS-compatible batch managing system (PBSPro, !OpenPBS, Torque, ...) ====
    35 This is the easiest case since the default configuration should work out of the box using the
    36 {{{
    37 qsub
    38 }}}
    39 command (and
    40 {{{
    41 qstat
    42 }}}
    43 and/or
    44 {{{
    45 qdel
    46 }}}
    47 if the whole web interface is present). There is nothing special to do, just run the
    48 {{{
    49 generate_events
    50 }}}
    51 script as usual and select parallel mode.
    52 
    53 ==== If you use the Condor batch managing system ====
    54 A "translation" script exists (see attachments of this page) to emulate the
    55 {{{
    56 qsub
    57 }}}
    58 command using a Condor syntax. This script should be tuned to fit your Condor installation and put in a directory of the =$PATH= variable.
    59 
    60 ==== If you use another batch managing system ====
    61 This is the most complicated case, you can either:
    62    * Modify manually the
    63 {{{
    64 survey
    65 }}}
    66 ,
    67 {{{
    68 refine
    69 }}}
    70 and
    71 {{{
    72 run_XXX
    73 }}}
    74 scripts located in the
    75 {{{
    76 bin
    77 }}}
    78 directory to force them to use your submission system.
    79    * Write a "translation" script like the one available for Condor (see attachment) to emulate the
    80 {{{
    81 qsub
    82 }}}
    83 command. If you manage to do this and want to share your script to help other MadGraph / MadEvent users, please feel free to edit this page.
    84 
    85 === If you do not have a ''central'' data disk ===
    86 We are aware the central disk solution may be inefficient or even impossible to set up on large clusters. We are thus working on a permanent solution. In the meantime, an intermediate solution using temporary directory and the
    87 {{{
    88 scp
    89 }}}
    90 command exists.
    91 
    92 -- Main.MichelHerquet - 02 Mar 2009
    93 
    94 
     115  1. On some cluster it can be usefull to replace:
     116        {{{
     117        command = ['qsub','-o', stdout,
     118                   '-N', me_dir,
     119                   '-e', stderr,
     120                   '-V']
     121        }}}
     122     into
     123        {{{
     124        command = ['qsub',
     125                   '-S', '/bin/bash',
     126                   '-o', stdout,
     127                   '-N', me_dir,
     128                   '-e', stderr,
     129                   '-V']
     130        }}}