| 13 | === Description of cluster.py === |
| 14 | |
| 15 | All the command related to cluster submission are done in the following file: |
| 16 | madgraph/various/cluster.py |
| 17 | |
| 18 | This files contains: |
| 19 | 1. The class Cluster which contains the basic command for all practical cluster implementation. You don't have to touch to this. |
| 20 | 2. A class for each specific cluster: (CondorCluster, GECluster, LSFCluster, MultiCore, ...) |
| 21 | 3. a dictionary from_name: which indicates which class to use for a specific name. |
| 22 | {{{ |
| 23 | from_name = {'condor':CondorCluster, 'pbs': PBSCluster, 'sge': SGECluster, |
| 24 | 'lsf': LSFCluster, 'ge':GECluster} |
| 25 | |
| 26 | }}} |
| 27 | |
| 28 | (you need to keep it updated) |
| 29 | 4. The rest are not important. |
| 30 | |
| 31 | If you want to add the support for a new cluster, you need to add new class MYCluster |
| 32 | which has a the class Cluster as parent: |
| 33 | |
| 34 | {{{ |
| 35 | class MYCluster(Cluster): |
| 36 | """Basic class for dealing with cluster submission""" |
| 37 | |
| 38 | name = 'mycluster' |
| 39 | job_id = 'JOBID' |
| 40 | }}} |
| 41 | |
| 42 | Two class attribute should be define (as shown above) |
| 43 | 1. '''name''': the name associate to the cluster |
| 44 | 2. '''job_id''': should be a shell environment variable set by your cluster in order to give to the job a unique identification. (use only in "no central disk" mode. |
| 45 | |
| 46 | Then you have to define the following functions: |
| 47 | 1. '''submit''' |
| 48 | {{{ |
| 49 | @multiple_try() |
| 50 | def submit(self, prog, argument=[], cwd=None, stdout=None, stderr=None, log=None): |
| 51 | """Submit a job prog to a GE cluster""" |
| 52 | }}} |
| 53 | a. '''prog''' is the program to run |
| 54 | a. '''argument''' is the argument to pass to the program |
| 55 | a. '''cwd''' is the directory from which the script has to be run [default is here] |
| 56 | a. '''stdout''' indicates where to write the output |
| 57 | a. '''stderr''' indicates where to write the error |
| 58 | a. '''log''' indicates where to write the cluster log/statistic of the job |
| 59 | For the three last, you have to define your own default (often /dev/null) |
| 60 | Note that stderr can be -2. In that case this means that stdout should be written in the same file as stdout |
| 61 | |
| 62 | This function should return a the identification number associate to the run submission. |
| 63 | Note the @multiple_try() before the definition allows to catch all error occurring and retry a couple of second later. This prevent to make the code crash if the server is too busy at a given point. |
| 64 | |
| 65 | 2. control_one_job: |
| 66 | {{{ |
| 67 | @multiple_try() |
| 68 | def control_one_job(self, id): |
| 69 | """ control the status of a single job with it's cluster id """ |
| 70 | }}} |
| 71 | a. '''id''': cluster identification number |
| 72 | b. the function should return either: |
| 73 | 1. 'I': for job is not yet running |
| 74 | 1. 'R': job is running |
| 75 | 1. 'F': finish |
| 76 | 1. anything else will be consider the job as failed. |
| 77 | |
| 78 | 3. control: |
| 79 | {{{ |
| 80 | @multiple_try() |
| 81 | def control(self, me_dir=None): |
| 82 | }}} |
| 83 | |
| 84 | a. '''me_dir''' is the MadEvent directory from which the jobs are runned (most of the time useless) |
| 85 | a. should return 4 number associate to the number of jobs in Idle, Running, Finished, Fail |
| 86 | |
| 87 | The rest are optional |
| 88 | |
| 89 | 1. '''remove''' |
| 90 | {{{ |
| 91 | @multiple_try() |
| 92 | def remove(self, *args): |
| 93 | """Clean the jobs on the cluster""" |
| 94 | }}} |
| 95 | Should remove all jobs submitted on your cluster. This will be call if you hit ctrl-c or if some jobs failed. |
| 96 | |
| 97 | 2. '''submit2''' submit2 is the same as submit 1 but is use when the user define the optional argument (cluster_temp_path = XXX) |
| 98 | In that case, no central disk are going to be use and all data are going to be written on path XXX --which is supose to be a node local disk--. and finally send back to the share disk. [[BR]]By default, we will first copy all require resource on XXX then use the "submit" command to launch the job. |
| 99 | and at the end of the job send back all data. |
| 100 | If your cluster didn't have central disk this is the command to edit. |
| 101 | |
| 102 | === TRICK=== |
| 103 | |
| 104 | 1. One good idea is to store the identification id in the list self.submitted_ids |
| 105 | This help to know which jobs are running/ to delete/... |
| 106 | |
| 107 | 2. always put @multiple_try() before the function definition. And raise an error as soon as somthing going wrong. |
| 108 | the multiple_try function will then retry later. |
19 | | === If you have a ''central'' data disk === |
20 | | The current MadGraph / MadEvent version assumes all the scripts are ran from a central data disk mounted on all cluster nodes (e.g. a home directory). If you have access to such a central disk, read the following. If not, please refer to the last section. Some scripts may assume the existence of a specific queue called |
21 | | {{{ |
22 | | madgraph |
23 | | }}} |
24 | | . If you have difficulties with these, simply create this queue on your cluster (this can help to limit the number of CPU for example), or remove the =-q madgraph= options in the |
25 | | {{{ |
26 | | run_XXX |
27 | | }}} |
28 | | scripts located in |
29 | | {{{ |
30 | | bin |
31 | | }}} |
32 | | . |
33 | | |
34 | | ==== If you have a PBS-compatible batch managing system (PBSPro, !OpenPBS, Torque, ...) ==== |
35 | | This is the easiest case since the default configuration should work out of the box using the |
36 | | {{{ |
37 | | qsub |
38 | | }}} |
39 | | command (and |
40 | | {{{ |
41 | | qstat |
42 | | }}} |
43 | | and/or |
44 | | {{{ |
45 | | qdel |
46 | | }}} |
47 | | if the whole web interface is present). There is nothing special to do, just run the |
48 | | {{{ |
49 | | generate_events |
50 | | }}} |
51 | | script as usual and select parallel mode. |
52 | | |
53 | | ==== If you use the Condor batch managing system ==== |
54 | | A "translation" script exists (see attachments of this page) to emulate the |
55 | | {{{ |
56 | | qsub |
57 | | }}} |
58 | | command using a Condor syntax. This script should be tuned to fit your Condor installation and put in a directory of the =$PATH= variable. |
59 | | |
60 | | ==== If you use another batch managing system ==== |
61 | | This is the most complicated case, you can either: |
62 | | * Modify manually the |
63 | | {{{ |
64 | | survey |
65 | | }}} |
66 | | , |
67 | | {{{ |
68 | | refine |
69 | | }}} |
70 | | and |
71 | | {{{ |
72 | | run_XXX |
73 | | }}} |
74 | | scripts located in the |
75 | | {{{ |
76 | | bin |
77 | | }}} |
78 | | directory to force them to use your submission system. |
79 | | * Write a "translation" script like the one available for Condor (see attachment) to emulate the |
80 | | {{{ |
81 | | qsub |
82 | | }}} |
83 | | command. If you manage to do this and want to share your script to help other MadGraph / MadEvent users, please feel free to edit this page. |
84 | | |
85 | | === If you do not have a ''central'' data disk === |
86 | | We are aware the central disk solution may be inefficient or even impossible to set up on large clusters. We are thus working on a permanent solution. In the meantime, an intermediate solution using temporary directory and the |
87 | | {{{ |
88 | | scp |
89 | | }}} |
90 | | command exists. |
91 | | |
92 | | -- Main.MichelHerquet - 02 Mar 2009 |
93 | | |
94 | | |
| 115 | 1. On some cluster it can be usefull to replace: |
| 116 | {{{ |
| 117 | command = ['qsub','-o', stdout, |
| 118 | '-N', me_dir, |
| 119 | '-e', stderr, |
| 120 | '-V'] |
| 121 | }}} |
| 122 | into |
| 123 | {{{ |
| 124 | command = ['qsub', |
| 125 | '-S', '/bin/bash', |
| 126 | '-o', stdout, |
| 127 | '-N', me_dir, |
| 128 | '-e', stderr, |
| 129 | '-V'] |
| 130 | }}} |