# Changes between Version 4 and Version 5 of IntroGrid

Ignore:
Timestamp:
05/02/12 09:34:17 (8 years ago)
Comment:

--

### Legend:

Unmodified
 v4 ==== How are the jobs organized and chosen?   ==== As we all know the error in an event sample with N events scales like sqrt(N). Hence, also in the generation of events there is no need to go beyond this precision. So, when generating a small number of events, many channels that have only a small contribution to the total cross section can be ignored. This is what the running the gridpack does: Because you are only generating a very small number of events with a given gridpack, the error is large and many subprocesses can be ignored. As we all know the error in an event sample with N events scales like $\sqrt(N)$. Hence, also in the generation of events there is no need to go beyond this precision. So, when generating a small number of events, many channels that have only a small contribution to the total cross section can be ignored. This is what the running the gridpack does: Because you are only generating a very small number of events with a given gridpack, the error is large and many subprocesses can be ignored. However, we cannot simply ignore all the smallest subprocesses and only evaluate the largest ones: we want to combine the events from many gridpack jobs to one big event file. The relative error in this event file should be much smaller, and therefore also events from subprocesses that have only a small contribution should be evaluated. To overcome this problem the contributions from each of the subprocesses is calculated with high precision in the creation of the gridpack. Then the gridpack jobs ''randomly include subprocesses based on their relative contributions'' to the total cross section. Because we know the relative error for a given number of events for a single job, we can put a minimum to the events generated from one subprocess. We call this minimum number the "granularity". By default we set the granularity to the square root of the number of events. Therefore the minimum number of events generated from each subprocess is sqrt(N) and, hence the maximum number of subprocesses calculated per gridpack job is N/sqrt(N). Setting the granularity to the square root of the number of events makes sure that the smallest number of subprocesses needs to be calculated, but keeping the events calculated by a single gridpack job correct in the sense that they are distributed correctly over all the subprocesses and phase space within the expected uncertainty. Because we know the relative error for a given number of events for a single job, we can put a minimum to the events generated from one subprocess. We call this minimum number the "granularity". By default we set the granularity to the square root of the number of events. Therefore the minimum number of events generated from each subprocess is $\sqrt(N)$ and, hence the maximum number of subprocesses calculated per gridpack job is $\frac{N}{\sqrt(N)}$. Setting the granularity to the square root of the number of events makes sure that the smallest number of subprocesses needs to be calculated, but keeping the events calculated by a single gridpack job correct in the sense that they are distributed correctly over all the subprocesses and phase space within the expected uncertainty. ====== An example ====== Suppose you want to generate 1 million events. With the gridpack you could choose to do 200 runs in which each run generates N=5000 events. (Remember to use a different random number seed for each run). Each of the event samples returned by a single gridpack run has a physically distributed set of events, i.e. with an expected error of sqrt(5000)=71. So the granularity can safely be set to the same number as giving it a lower value does not improve the error. Because the error is relatively large, there might be many important subprocesses that are not evaluated, but because the channels are chosen randomly, each gridpack run evaluates a different set of channels such that the total error on the 1 million events is only sqrt(10^6)=1000. Suppose you want to generate 1 million events. With the gridpack you could choose to do 200 runs in which each run generates N=5000 events. (Remember to use a different random number seed for each run). Each of the event samples returned by a single gridpack run has a physically distributed set of events, i.e. with an expected error of $\sqrt(5000)=71$. So the granularity can safely be set to the same number as giving it a lower value does not improve the error. Because the error is relatively large, there might be many important subprocesses that are not evaluated, but because the channels are chosen randomly, each gridpack run evaluates a different set of channels such that the total error on the 1 million events is only $\sqrt(10^6)=1000$. In specific cases the granularity could be increased, ''e.g.'', if you know beforehand that you will produce a lot of events. In this example, where the total number of events will be a million, the uncertainty will be $\sqrt{10^6}=1000$. Hence the granularity could have been set to 1000 as this will generate at least 1000 events for each channel. Hence you'll never be off by more than 1000 events. (The value for the granularity can be set by passing it as the 3rd argument when executing the {{{./run.sh}}} script). However, this should be used with great care, because in the case where not all the gridpack jobs can be retrieved, or if only a subset of the total number of events is analyzed, you are making an error. It is therefore '''highly recommended to not touch the default value for the granularity''' at leave the 3rd argument of the =./run.sh= script empty. ==== Difference between normal cluster running / gridpack running. ==== For the generation of events in normal cluster running all the possible contributions to a given processes are chopped into small parts and send as jobs simultaneously to a computer cluster. All these little jobs execute a part of the total contribution and generate events for this small part. Only after ''all'' the jobs (and their generated events) are retrieved and combined the final event sample is created. On the other hand, the gridpack should be executed on a single machine. In principle it will run all the little jobs described above in serial, except that the requested number of events is in general much smaller. Therefore only a small subset of this large number of jobs needs to be executed, chosen in such a way to have an unbiased sample. The number of subprocess evaluated is controlled by the granularity setting, see above. As the number of events per job is small and the error on an event sample scales like sqrt(N), where N is the number of events, a great optimization procedure can be included here. The major difference between normal cluster run and the gridpack running is therefore: For the generation of events in normal cluster running all the possible contributions to a given processes are chopped into small parts and send as jobs simultaneously to a computer cluster. All these little jobs execute a part of the total contribution and generate events for this small part. Only after ''all'' the jobs (and their generated events) are retrieved and combined the final event sample is created. On the other hand, the gridpack should be executed on a single machine. In principle it will run all the little jobs described above in serial, except that the requested number of events is in general much smaller. Therefore only a small subset of this large number of jobs needs to be executed, chosen in such a way to have an unbiased sample. The number of subprocess evaluated is controlled by the granularity setting, see above. As the number of events per job is small and the error on an event sample scales like $\sqrt(N)$, where N is the number of events, a great optimization procedure can be included here. The major difference between normal cluster run and the gridpack running is therefore: '''While for the a normal madgraph run events from all the subprocesses are included in the final event sample, for a gridpack run only a subset of the subprocesses are evaluated. This subset is randomly chosen according to their weight to the total cross section, keeping in mind that the error in N produced events scales like sqrt(N).''' '''While for the a normal madgraph run events from all the subprocesses are included in the final event sample, for a gridpack run only a subset of the subprocesses are evaluated. This subset is randomly chosen according to their weight to the total cross section, keeping in mind that the error in N produced events scales like $\sqrt(N)$.'''