Recipes for common tasks
Using scalefactors
Scalefactors—CMS jargon for efficiency corrections for MC, typically
binned in lepton or jet kinematic variables—can be generalized to
functions that take some properties of a physics object and return a single
floating-point number.
The bamboo.scalefactors
module provides support for constructing
such callable objects from two different JSON formats, the CMS correctionlib
format, and the one used in the CP3-llbb framework, and the CMS BTV CSV
format.
CMS correctionlib JSON format
The bamboo.scalefactors.get_correction()
method loads a Correction
from the CorrectionSet
stored in a JSON file, and constructs a helper object
to use it in bamboo.
Since corrections are usually parameterised as function of e.g. kineamtic
properties of a reconstructed object, callables can be passed as parameters,
and the helper object called with a reconstructed object, e.g.
from bamboo.scalefactors import get_correction
sf = get_correction(..., params={"pt": lambda obj : obj.pt, ...})
mySel = noSel.refine(..., weight=sf(el))
Many of the arguments to bamboo.scalefactors.get_correction()
are
related to automatic systematic uncertainties: the name of a category axis in
the Correction
, the mapping between its categories and systematic variations
in bamboo, and the name of the nominal category—more details and
a complete example can be found in the reference documentation.
Please note that this needs the correctionlib package to be installed, see
the installation guide for more details.
Helper methods to configure and combine individual corrections for the purpose
of applying b-tagging scale factor and uncertainties are provided, see
bamboo.scalefactors.makeBtagWeightMeth1a()
and
bamboo.scalefactors.makeBtagWeightItFit()
.
CP3-llbb JSON format
Warning
The CP3-llbb json format and associated Bamboo functionalities are soon going to be deprecated in favour of the central JSON format and correctionlib (see above).
The bamboo.scalefactors
module provides support for constructing
such callable objects from the JSON format used in the CP3-llbb framework,
see some examples
here
(these JSON files are produced from the txt or ROOT files provided by the POGs
using simple python
scripts).
Like their inputs, the JSON files contain the nominal scale factor as well as
its up and down systematic variations, so the
ScaleFactor
behaves as a callable that takes
a physics object and an optional variation
keyword argument (technically,
it wraps a C++ object that gets the correct value from the JSON file).
The get_scalefactor()
method constructs such
objects from a nested dictionary:
the first key is a tag (as an example: “electron_2015_76”, for electrons in
2015 data, analysed with a CMSSW_7_6_X
release) and the second key is an
identifier of the selection they correspond to (e.g. id_Loose
).
The value inside this dictionary can be either a single path to a JSON file,
or a list of (periods, path)
pairs, where periods
is a list of run periods, in case scalefactors for different
running periods need to be combined (the periods
keyword argument to
get_scalefactor()
can be used to select only
a certain set of these periods).
The combination is done by either weighting or randomly sampling from the
different periods, according to the fraction of the integrated luminosity in
each (by passing combine="weight"
or combine="sample"
, respectively).
Jet flavour tagging and dilepton (e.g. trigger) scalefactors can also be
specified by passing tuples of the light, c-jet and b-jet scalefactor paths,
and tuples of first-if-leading, first-if-subleading, second-if-leading,
and second-if-subleading (to be reviewed for NanoAOD) scalefactor paths,
respectively, instead of a single path.
Histogram variations representing the shape systematic uncertainty due to an
uncertainty on the scalefactor values can be automatically produced by passing
a name to the systName
keyword argument of the
get_scalefactor()
method.
As an example, some basic lepton ID and jet tagging scalefactors could be included in an analysis on NanoAOD by defining
import bamboo.scalefactors
from itertools import chain
import os.path
# scalefactor JSON files are in ScaleFactors/<era>/ alongside the module
def localize_myanalysis(aPath, era="2016legacy"):
return os.path.join(os.path.dirname(os.path.abspath(__file__)), "ScaleFactors", era, aPath)
# nested dictionary with path names of scalefactor JSON files
# { tag : { selection : absole-json-path } }
myScalefactors = {
"electron_2016_94" : {
"id_Loose" : localize_myanalysis("Electron_EGamma_SF2D_Loose.json")
"id_Medium" : localize_myanalysis("Electron_EGamma_SF2D_Medium.json")
"id_Tight" : localize_myanalysis("Electron_EGamma_SF2D_Tight.json")
},
"btag_2016_94" : dict((k, (tuple(localize_myanalysis(fv) for fv in v))) for k,v in dict(
( "{algo}_{wp}".format(algo=algo, wp=wp),
tuple("BTagging_{wp}_{flav}_{calib}_{algo}.json".format(wp=wp, flav=flav, calib=calib, algo=algo)
for (flav, calib) in (("lightjets", "incl"), ("cjets", "comb"), ("bjets","comb")))
) for wp in ("loose", "medium", "tight") for algo in ("DeepCSV", "DeepJet") ).items())
}
# fill in some defaults: myScalefactors and bamboo.scalefactors.binningVariables_nano
def get_scalefactor(objType, key, periods=None, combine=None, additionalVariables=None, systName=None):
return bamboo.scalefactors.get_scalefactor(objType, key, periods=periods, combine=combine,
additionalVariables=(additionalVariables if additionalVariables else dict()),
sfLib=myScalefactors, paramDefs=bamboo.scalefactors.binningVariables_nano, systName=systName)
and adding the weights to the appropriate Selection
instances with
electrons = op.select(t.Electron, lambda ele : op.AND(ele.cutBased >= 2, ele.p4.Pt() > 20., op.abs(ele.p4.Eta()) < 2.5))
elLooseIDSF = get_scalefactor("lepton", ("electron_2016_94", "id_Loose"), systName="elID")
hasTwoEl = noSel.refine("hasTwoEl", cut=[ op.rng_len(electrons) > 1 ],
weight=[ elLooseIDSF(electrons[0]), elLooseIDSF(electrons[1]) ])
jets = op.select(t.Jet, lambda j : j.p4.Pt() > 30.)
bJets = op.select(jets, lambda j : j.btagDeepFlavB > 0.2217) ## DeepFlavour loose b-tag working point
deepFlavB_discriVar = { "BTagDiscri": lambda j : j.btagDeepFlavB }
deepBLooseSF = get_scalefactor("jet", ("btag_2016_94", "DeepJet_loose"), additionalVariables=deepFlavB_discriVar, systName="bTag")
hasTwoElTwoB = hasTwoEl.refine("hasTwoElTwoB", cut=[ op.rng_len(bJets) > 1 ],
weight=[ deepBLooseSF(bJets[0]), deepBLooseSF(bJets[1]) ])
Note that the user is responsible for making sure that the weights are only applied to simulated events, and not to real data!
CMS BTV CSV format
Warning
The BTV CSV reader and associated Bamboo functionalities are soon going to be deprecated in favour of the central JSON format and correctionlib (see above).
The bamboo.scalefactors.BtagSF
class wraps the
BTagCalibrationReader
provided by the BTV POG to read the custom CSV
format for b-tagging scalefactors (more details usage instructions can be
found in the reference documentation).
Please note that this will only read the scalefactors, which for most
methods for applying b-tagging scalefactors
need to be combined with efficiency and mistag probability maps measured
in simulation in the analysis phase space.
Pileup reweighting
Warning
The pileup weights maker and associated Bamboo functionalities are soon going to be deprecated in favour of the central JSON format and correctionlib (see above).
Pileup reweighting to make the pileup distribution in simulation match the one
in data is very similar to applying a scalefactor, except that the efficiency
correction is for the whole event or per-object—so the same code can be
used.
The makePUReWeightJSON
script included in bamboo can be used to make
a JSON file with weights out of a data pileup profile obtained by running
pileupcalc.py
(inside CMSSW, see the pileupcalc documentation for details), e.g. with
something like
pileupCalc.py -i ~/Cert_271036-284044_13TeV_23Sep2016ReReco_Collisions16_JSON.txt --inputLumiJSON /afs/cern.ch/cms/CAF/CMSCOMM/COMM_DQM/certification/Collisions16/13TeV/PileUp/pileup_latest.txt --calcMode true --minBiasXsec 69200 --maxPileupBin 80 --numPileupBins 80 ./2016PUHist_nominal.root
and a MC pileup profile.
Data pileup distributions corresponding to the golden JSON files for Run 2 are
provided by the luminosity POG, see
this hypernews annoncement.
The MC pileup profiles for used official CMSSW productions are
currently hardcoded inside the makePUReWeightJSON
, and can be specified
by their tag or name in that list; the available tags can be listed by
specifying the --listmcprofiles
option. The full command then becomes
something like
makePUReWeightJSON --mcprofile "Moriond17_25ns" --nominal=2016PUHist.root --up=2016PUHist_up.root --down=2016PUHist_down.root --makePlot
To include the weight when filling plots, it is sufficient to add the weight to
a selection (usually one of the topmost in the analysis, e.g. in the
prepareTree
method of the analysis module).
The bamboo.analysisutils.makePileupWeight()
method can be used to build
an expression for the weight, starting from the path of the JSON file with
weights from above, and an expression for the true number of interactions in the
event (mean of the Poissonian used), e.g. tree.Pileup_nTrueInt
for NanoAOD.
Cleaning collections
The CMS reconstruction sometimes ends up double-counting some objects. This can be because of the different quality criteria used to identify each object or because of the different performance and inner working of the reconstruction algorithms. Tau reconstruction for example operates on clusters that are usually reconstructed as jets, and on top of that it can easily pick up even isolated muons or electrons as taus (i.e. as clusters of energy with one, two, or three tracks).
It is oftentimes necessary therefore to clean a collection of objects by excluding any object that is spatially in the sample place of another object whose reconstruction we trust more.
We trust more muon and electron reconstrution than tau reconstruction, after all the quality cuts (ID efficiencies for muons and electrons are around 99.X%, whereas tau ID efficiencies are of the order of 70%. Misidentification rates are similarly quite different), and therefore we exclude from the tau collection any tau that happens to include within its reconstruction cone a muon or an electron.
Bamboo provides a handy syntax for that, resulting in something like
cleanedTaus = op.select(taus, lambda it : op.AND(
op.NOT(op.rng_any(electrons, lambda ie : op.deltaR(it.p4, ie.p4) < 0.3 )),
op.NOT(op.rng_any(muons, lambda im : op.deltaR(it.p4, im.p4) < 0.3 ))
))
In this example, we assume that the collections taus
, electrons
, and
muons
, have already been defined via
taus = op.select(t.Tau, lambda tau : ...)
, and we move on to use the method
op.rng_any()
to filter all taus that are within a cone of a given size
(0.3, in the example) from any selected electron or muon.
Jet and MET systematics
For propagating uncertainties related to the jet energy calibration, and the difference in jet energy resolution between data and simulation, each jet in the reconstructed jet collection should be modified, the collection sorted, and any derived quantity re-evaluated.
How to do this depends on the input trees: in production NanoAOD the modified
momenta need to be calculated using the jet energy correction parameters; it is
also possible to add them when post-processing with the
jetmetUncertainties module of the NanoAODTools package.
In the latter case the NanoAOD decoration method will pick up the modified
branches if an appropriate
NanoSystematicVarSpec
entry (e.g.
nanoReadJetMETVar
or
nanoReadJetMETVar_METFixEE2017
) is added to
the systVariations
attribute of the
NanoAODDescription
that is passed to the
prepareTree()
(or
decorateNanoAOD()
) method.
To calculate the variations on the fly, two things are needed: when decorating
the tree, some redirections should be set up to pick up the variations from a
calculator module, and then this module needs to be configured with the correct
JEC and resolution parameters.
The first step can be done by adding
nanoJetMETCalc
(or
nanoJetMETCalc_METFixEE2017
) to the
systVariations
attribute of the
NanoAODDescription
that is passed to the
prepareTree()
method (which will
pass this to the decorateNanoAOD()
method);
these will also make sure that all these variations are propagated to the
missing transverse momentum.
Next, a calculator must be added and configured.
This can be done with the bamboo.analysisutils.configureJets()
and
bamboo.analysisutils.configureType1MET()
methods, which provide a
convenient way to correct the jet resolution in MC, apply a different JEC, and
add variations due to different sources of uncertainty in the jet energy scale,
for the jet collection and MET, respectively (the arguments should be the same
in most cases).
Note
The jet and MET calculators were moved to a separate package.
Since these calculators are C++ classes with an RDF-friendly interface and
minimal dependencies, they are not only useful from bamboo, but also from
other (RDF-based or similar) frameworks.
Therefore they were moved to a separate repository
cp3-cms/CMSJMECalculators.
They can be installed with e.g.
pip install git+https://gitlab.cern.ch/cp3-cms/CMSJMECalculators.git
.
As an example, the relevant code of a NanoAOD analysis module could look like this to apply a newer JEC to 2016 data and perform smearing, add uncertainties to 2016 MC, and the same for the MET:
class MyAnalysisModule(NanoAODHistoModule):
def prepareTree(self, tree, sample=None, sampleCfg=None):
tree,noSel,be,lumiArgs = super(MyAnalysisModule, self).prepareTree(tree, sample=sample, sampleCfg=sampleCfg
, NanoAODDescription.get("v5", year="2016", isMC=self.isMC(sample), systVariations=[nanoJetMETCalc]))
from bamboo.analysisutils import configureJets, configureType1MET
isNotWorker = (self.args.distributed != "worker")
era = sampleCfg["era"]
if era == "2016":
if self.isMC(sample): # can be inferred from sample name
configureJets(tree._Jet, "AK4PFchs",
jec="Summer16_07Aug2017_V20_MC",
smear="Summer16_25nsV1_MC",
jesUncertaintySources=["Total"],
mayWriteCache=isNotWorker,
isMC=self.isMC(sample), backend=be)
configureType1MET(tree._MET,
jec="Summer16_07Aug2017_V20_MC",
smear="Summer16_25nsV1_MC",
jesUncertaintySources=["Total"],
mayWriteCache=isNotWorker,
isMC=self.isMC(sample), backend=be)
else:
if "2016G" in sample or "2016H" in sample:
configureJets(tree._Jet, "AK4PFchs",
jec="Summer16_07Aug2017GH_V11_DATA",
mayWriteCache=isNotWorker,
isMC=self.isMC(sample), backend=be)
configureType1MET(tree._MET,
jec="Summer16_07Aug2017GH_V11_DATA",
mayWriteCache=isNotWorker,
isMC=self.isMC(sample), backend=be)
elif ...: ## other 2016 periods
pass
return tree,noSel,be,lumiArgs
Both with variations read from a postprocessed NanoAOD and calculated on the
fly, the different jet collections are available from t._Jet
, e.g.
t._Jet["nom"]
(postprocessed) or t._Jet["nominal"]
(calculated),
t._Jet["jerup"]
, t._Jet["jerdown"]
, t._Jet["jesTotalUp"]
,
t._Jet["jesTotalDown"]
etc. depending on the configured variations
(when accessing these directly, t._Jet[variation][j.idx]
should be used
to retrieve the entry corresponding to a specific jet j
, if the latter is
obtained from a selected and/or sorted version of the original collection—object.idx
is always the index in the collection as found in the tree).
t.Jet
will be changed for one of the above for each systematic variation,
if it affects a plot, in case automatically producing the systematic variations
is enabled (the collections from t._Jet
will not be changed).
The automatic calculation of systematic variations can be disabled globally
or on a per-selection basis (see above), and for on the fly calculation also by
passing enableSystematics=[]
to
bamboo.analysisutils.configureJets()
).
The jet collection as stored on the input file, finally, can be retrieved as
t._Jet.orig
.
Important
Sorting the jets
No sorting is done as part of the above procedure, so if relevant this
should be added by the user (e.g. using
jets = op.sort(t.Jet, lambda j : -j.pt)
for sorting by decreasing
transverse momentum).
In a previous version of the code this was included, but since some selection
is usually applied on the jets anyway, it is simpler (and more efficient) to
perform the sorting then.
Important
Bamboo runs outside CMSSW and has no access to the conditions
database, so it fetches the necessary txt files from the repositories
on github (they are quite large, so this is more efficient than storing
a clone locally). They can be automatically updated if the upstream
repository changes; the mayWriteCache
argument to
bamboo.analysisutils.configureJets()
(see the example above)
helps ensure that only one process write to the cache at a time.
In practice, updating the local cache when the corrections have changed
can be done by running an analysis module once in non-distributed mode
using the –onlyprepare –maxFiles 1 arguments.
In case of doubt one can use the checkCMSJMEDatabaseCaches
script
to update or check the cache interactively and, as a last resort, remove
the cache directories and status files from ~/.bamboo/cache
:
they will be recreated automatically at the next use.
Note
Isn’t it slow to calculate jet corrections on the fly?
It does take a bit of time, but the calculation is done in one C++ module,
which should not be executed more than once per event (see the explanation
of the bamboo.analysisutils.forceDefine()
method in the
section above).
In most realistic cases, the bottleneck is in reading and decompressing the
input files, so the performance hit from the jet corrections should usually
be acceptable.
Rochester correction for muons
The so-called
Rochester correction
removes a bias in the muon momentum, and improves the agreement between data
and simulation in the description of the Z boson peak.
As for the jet correction and variations described in the previous section,
this can either be done during postprocessing, with the
muonScaleResProducer module of the NanoAODTools package, or on the fly.
To adjust the decorators, a suitable
NanoSystematicVarSpec
instance to read the
corrected values, or nanoRochesterCalc
to use
the calculated values, should be added to the systVariations
attribute of the NanoAODDescription
that is
passed to the prepareTree()
(or
decorateNanoAOD()
) method.
The on the fly calculator should be added and configured with the
bamboo.analysisutils.configureRochesterCorrection()
method,
as in the example below.
tree._Muon
keeps track of everything related to the calculator; the
uncorrected muon collection can be found in tree._Muon.orig
, and the
corrected muons are in tree.Muon
.
class MyAnalysisModule(NanoAODHistoModule):
def prepareTree(self, tree, sample=None, sampleCfg=None):
tree,noSel,be,lumiArgs = NanoAODHistoModule.prepareTree(self, tree, sample=sample, sampleCfg=sampleCfg, calcToAdd=["nMuon"])
from bamboo.analysisutils import configureRochesterCorrection
era = sampleCfg["era"]
if era == "2016":
configureRochesterCorrection(tree._Muon, "RoccoR2016.txt", isMC=self.isMC(sample), backend=be)
return tree,noSel,be,lumiArgs
Energy correction for taus
Similar to muons, the energy of taus also requires correction. This can be done
on the fly. To adjust the decorators, nanoTauESCalc
to use
the calculated values, should be added to the systVariations
attribute of the NanoAODDescription
that is
passed to the prepareTree()
(or
decorateNanoAOD()
) method.
The on the fly calculator should be added and configured with the
bamboo.analysisutils.configureTauESCorrection()
method,
as in the example below.
tree._Tau
keeps track of everything related to the calculator; the
uncorrected tau collection can be found in tree._Tau.orig
, and the
corrected taus are in tree.Tau
.
class MyAnalysisModule(NanoAODHistoModule):
def prepareTree(self, tree, sample=None, sampleCfg=None):
tree,noSel,be,lumiArgs = NanoAODHistoModule.prepareTree(self, tree, sample=sample, sampleCfg=sampleCfg, calcToAdd=["nTau"])
from bamboo.analysisutils import configureTauESCorrection
era = sampleCfg["era"]
if era == "2016":
configureTauESCorrection(tree._Tau, "tau2016.json.gz", tauIdAlgo="DeepTau2017v2p1", backend=be)
return tree,noSel,be,lumiArgs
Correlating systematic variations
To understand how systematic variations are implemented in bamboo, and how to
take advantage of that to correlate e.g. a b-tagging scalefactor variation with
a jet and MET kinematic variation, it is useful to remember that your code
creates expressions that are converted to C++ code, and
imagine a variable with a systematic uncertainty as a single nominal value with
a dictionary of alternative values: the keys of this dictionary are the
variation names, e.g. elIDup
or jerdown
.
This is also how they are represented in the expression objects tree.
When creating a plot or selection, the variable(s), weight(s), and cut(s) are
scanned for such nodes with systematic variations, and additional RDataFrame
nodes are created for all the variations.
There are two interesting consequences of the this dictionary with variations:
all variations are equal, i.e. there is no concept of “uncertainty X with
e.g. up and down variations”—although this is very common in practice, and
trivial to reconstruct from the dictionary where needed—and all expression
nodes with the same variation change together.
The latter is necessary in many cases, e.g. when passing the MET and some jet
pt’s to a multivariate classifier, both should pass the jerdown
variation
to get the corresponding variation of the classifier output.
It also provides a very transparent way to correlate variations: if the name is
the same, they will be simultaneously varied—so it is enough that
a b-tagging scalefactor variation is called jesAbsup
to be varied together
with that variation of the jet pt’s; turning that around: to be varied
independently, the names must be made different (this is why up
and down
by themselves as variation names lead to an error message being printed).
Splitting a sample into sub-components
It is frequently necessary to split a single Monte-Carlo sample into different processes, depending on generator-level information, or simply to add some cuts at generator level (e.g. to stitch binned samples together). This can be achieved by duplicating that sample in the analysis configuration file for as many splits as are needed, and putting (any) additional information into that sample’s entry, e.g. as:
ttbb:
db: das:/TTToSemiLeptonic_TuneCP5_13TeV-powheg-pythia8/RunIIAutumn18NanoAODv5-Nano1June2019_102X_upgrade2018_realistic_v19-v1/NANOAODSIM
era: 2018
group: ttbb
subprocess: ttbb
cross-section: 366.
generated-events: genEventSumw
ttjj:
db: das:/TTToSemiLeptonic_TuneCP5_13TeV-powheg-pythia8/RunIIAutumn18NanoAODv5-Nano1June2019_102X_upgrade2018_realistic_v19-v1/NANOAODSIM
era: 2018
group: ttjj
subprocess: ttjj
cross-section: 366.
generated-events: genEventSumw
That information can then be retrieved in the analysis module through the sampleCfg
keyword argument, to add additional cuts to the selection when preparing the tree:
def prepareTree(self, tree, sample=None, sampleCfg=None):
tree,noSel,be,lumiArgs = super(MyAnalysisModule, self).prepareTree(tree, sample=sample, sampleCfg=sampleCfg)
if "subprocess" in sampleCfg:
subProc = sampleCfg["subprocess"]
if subProc == "ttbb":
noSel = noSel.refine(subProc, cut=(tree.genTtbarId % 100) >= 52)
elif subProc == "ttjj":
noSel = noSel.refine(subProc, cut=(tree.genTtbarId % 100) < 41)
return tree,noSel,be,lumiArgs
Adding command-line arguments
The base analysis module,
bamboo.analysismodules.AnalysisModule
, calls the
addArgs()
method (the default
implementation does nothing) when constructing the command-line arguments
parser (using the argparse module).
Analysis modules can reimplement this method to specify more arguments, e.g.
class MyModule(...):
def addArgs(self, parser):
super(MyModule, self).addArgs(parser)
parser.add_argument("--whichPlots", type=str,
default="control",
help="Set of plots to produce")
The parsed arguments are available under the args
member variable, e.g.
self.args.whichPlots
in the example above.
The complete list of command-line options (including those specified in the
analysis module) can be printed with bambooRun -h -m myModule.py.MyModule
.
In fact the parser argument is an
argument group,
so they are listed separately from those in the base class.
This is also used to copy all user-defined arguments to the commands that are
passed to the worker tasks, when running in distributed mode.
Editing the analysis configuration
Similarly to the above, it is possible to modify the analysis configuration (loaded from the YAML file) from a module before the configuration is used to create jobs (in distributed mode), run on any file (in sequential mode), or run plotIt (in the postprocessing step). This allows e.g. to change the samples that are going to be used, change the list of systematics, etc., without having to edit manually the YAML file or maintaining separate files. Below is an example of how this works:
class MyModule(...):
def customizeAnalysisCfg(self, analysisCfg):
for smp in list(analysisCfg["samples"]):
if not analysisCfg["samples"][smp].get("is_signal", False):
del analysisCfg["samples"][smp]
Evaluate an MVA classifier
Several external libraries can be used to evaluate the response of MVA
classifiers inside expressions.
For convenience, a uniform interface is defined that uses a vector of floats
as input and output, with implementations available for PyTorch,
Tensorflow, lwtnn, TMVA, ONNX Runtime, and SOFIE.
That works as follows (see the documentation for the
bamboo.treefunctions.mvaEvaluator()
method for a detailed description,
additional options may be needed, depending on the type):
mu = tree.Muon[0]
nn1 = mvaEvaluator("nn1.pt", mvaType="Torch")
Plot.make1D("mu_nn1", nn1(mu.pt, mu.eta, mu.phi), hasMu)
For Tensorflow, PyTorch, and ONNX Runtime multiple inputs (and inputs with
different types) are also supported.
In that case, no automatic conversion is performed, so the inputs should be
passed in the correct format (most of the time the number of inputs per node
is known, so arrays constructed with bamboo.treefunctions.array()
are a good choice).
Warning
Especially for PyTorch and Tensorflow, setting up an installation where the necessary C(++) libraries are correctly identified, and compatible with the CPU capabilities, is not always trivial. See this section in the installation guide for more information.
Skims for training a classifier can also straightforwardly be produced with bamboo.
Obtaining a classifier in the right format
All MVA inference is done through the C(++) APIs provided by the different machine learning and inference libraries, which means that the model should be stored in the appropriate format (often with some conversion step).
ONNX and lwtnn are formats for the exchange and inference of neural networks, so they need converters from the model building and/or training framework. Converting Keras models to lwtnn is described on the lwtnn wiki. PyTorch comes with ONNX export included. Most Keras models can also easily be exporter to ONNX with keras2onnx.
The PyTorch evaluator uses TorchScript, this tutorial is a very good starting point if your model is trained with PyTorch.
TMVA uses an XML format which probably also just works. The TMVA reader will work with multi-threading, but the reader class uses locking because the internal TMVA classes are not thread-safe, so performance will be degraded if aggressive multi-threading is used and the TMVA evaluation dominates the CPU usage.
For Keras models Tensorflow is the most natural fit. Please note that the frozen graph is needed, see e.g. keras_to_tensorflow, this detailed explanation, and this script for an example of how to do so.
SOFIE allows one to evaluate models in ONNX, PyTorch or Keras format, provided they have been first converted into a header and weight files using the helpers available in ROOT (see the ROOT documentation and tutorials for how to convert models). While a limited set of models are supported (only a few types of layers are implemented in SOFIE), if the conversion is possible, model evaluation in SOFIE has the potential to be significantly faster than using the ONNX, Tensorflow or PyTorch APIs. Note that SOFIE is only supported in ROOT >= 6.26.04 but is not enabled by default, so you’ll need to make sure that your ROOT build has SOFIE enabled.
Testing the evaluation outside RDataFrame
MVA inference with all the libraries described above is done by creating
an instance of an evaluator class, which provides a similar
RDataFrame-friendly interface: the filename of te saved model and additional
options are passed to the constructor, and an evaluate method that takes the
input values and returns the returns the MVA outputs is called from inside the
RDataFrame graph.
It is straightforward to do the same from PyROOT: for each framework there is a
method in the bamboo.root
to load the necessary shared libraries and
evaluator class.
After calling this method, an evaluator can be instantiated and tested with
some simple arguments.
This is done in the bamboo tests,
so these can serve as an example (links for the the relevant fragments:
test_tensorflow,
test_lwtnn,
test_libtorch;
TMVA is directly included in ROOT, so it is sufficient to retrieve the
TMVA::Experimental::RReader
class).
Make combined plots for different selections
It is rather common to define categories with e.g. different lepton flavours
and selections, but then make plots with the entries from these (disjoint)
sets of events combined.
Given the structure of the RDataFrame graph and the
Selection
tree, the most convenient way to achieve
this is by defining the histograms for each category, and make a merged
histogram later on.
The SummedPlot
class does exactly this, and since it
presents the same interface to the analysis module as a regular
Plot
, it can simply be added to the same plot list
(to produce only the combined plot and not those for the individual
contributions, it is sufficient to not add the latter to the plot list), e.g.
from bamboo.plots import Plot, SummedPlot, EquidistantBinning
mjj_mumu = Plot.make1D("Mjj_MuMu", op.invariant_mass(jets[0].p4, jets[1].p4),
sel_mumu, EquidistantBinning(50, 20., 120.))
mjj_elel = Plot.make1D("Mjj_ElEl", op.invariant_mass(jets[0].p4, jets[1].p4),
sel_elel, EquidistantBinning(50, 20., 120.))
mjj_sum = SummedPlot("Mjj", [mjj_mumu, mjj_elel], title="m(jj)")
plots += [ mjj_mumu, mjj_elel, mjj_sum ] # produce all plots
The other plot properties of a SummedPlot
(titles,
labels etc.) can be specified with keyword arguments to the constructor;
otherwise they are taken from the first component plot.
Note
SummedPlot
simply adds up the histograms,
it is up to the user to make sure an event can only enter one of the
categories, if this is what it is used for.
Producing skimmed trees
The bamboo.plots.Skim
class allows to define skimmed trees to save
in the output file.
Since this uses the Snapshot
method from RDataFrame, there will be an entry
for each event that passes the selection, so in some cases (e.g. MVA training)
additional manipulations may need to be done on these outputs.
A second limitation is that, as for plots, a skim is attached to a selection,
which means that if different categories need to be combined, multiple skims
should be defined, and the stored products merged—but multiple skims
can now be produced in the same job, thanks to the lazy Snapshot calls.
The main advantage over the SkimmerModule
(which still exists for backwards compatibility) is that the same module can
produce plots and skims, with the same selections and definitions (in practice
a command-line option would usually
be added to select some products), e.g.
from bamboo.plots import Plot, Skim
twoMuSel = noSel.refine("twoMuons", cut=[ op.rng_len(muons) > 1 ])
mll = op.invariant_mass(muons[0].p4, muons[1].p4)
if self.args.makeSkim:
plots.append(Skim("dimuSkim", {
"run": None, # copy from input file
"luminosityBlock": None,
"event": None,
"dimu_M": mll,
"mu1_pt": muons[0].pt,
"mu2_pt": muons[1].pt,
}, twoMuSel))
else:
plots.append(Plot.make1D("dimu_M", mll, twoMuSel,
EquidistantBinning(100, 20., 120.)))
Postprocessing beyond plotIt
The HistogramsModule
postprocessing method
calls plotIt to make the usual data and simulation stack plots (for the
different eras that are considered), and prints the counter values of cut flow
reports, but since all possible (meta-)information is available there, as well
as the filled histograms, it can be useful to do any further processing there
(e.g. running fits to the distributions, dividing histograms to obtain scale
factors or fake rates, exporting counts and histograms to a different format).
For many simple cases, it should be sufficient to override the
postProcess()
method, first
call the base class method, and then do any additional processing.
If the base class method is not called, the plot list should be constructed
by calling the getPlotList()
method.
Most of the other code, e.g. to generate the plotIt YAML configuration file,
is factored out in helper methods to allow reuse from user-defined additions—see the bamboo.analysisutils.writePlotIt()
and
bamboo.analysisutils.printCutFlowReports()
methods, and their
implementation.
Note
getPlotList()
,
when called without a specified file and sample, will read a so-called
skeleton file for an arbitrary sample (essentially an empty tree with the
same format as the input—typically for the first sample encountered)
from the results directory and calls the
definePlots()
method with
that to obtain the list of defined plots.
This is also done when running with the --onlypost
option, and works as
expected when the same plots are defined for all samples.
If this assumption does not hold, some customisation of the
definePlots()
method will
be necessary.
It is also possible to skip the writing of a plotIt YAML file, and directly load the configuration as it would be parsed by plotIt with its partial python reimplementation pyplotit, which makes it easy to access the scaled grouped and stacked histograms.
As an example, a simple visualisation of 2D histograms could be obtained with
def postProcess(self, taskList, config=None, workdir=None, resultsdir=None):
super(MyModule, self).postProcess(taskList, config=config, workdir=workdir, resultsdir=resultsdir)
from bamboo.plots import Plot, DerivedPlot
plotList_2D = [ ap for ap in self.plotList if ( isinstance(ap, Plot) or isinstance(ap, DerivedPlot) ) and len(ap.binnings) == 2 ]
from bamboo.analysisutils import loadPlotIt
p_config, samples, plots_2D, systematics, legend = loadPlotIt(config, plotList_2D, eras=self.args.eras[1], workdir=workdir, resultsdir=resultsdir, readCounters=self.readCounters, vetoFileAttributes=self.__class__.CustomSampleAttributes, plotDefaults=self.plotDefaults)
from plotit.plotit import Stack
from bamboo.root import gbl
for plot in plots_2D:
obsStack = Stack(smp.getHist(plot) for smp in samples if smp.cfg.type == "DATA")
expStack = Stack(smp.getHist(plot) for smp in samples if smp.cfg.type == "MC")
cv = gbl.TCanvas(f"c{plot.name}")
cv.Divide(2)
cv.cd(1)
expStack.obj.Draw("COLZ")
cv.cd(2)
obsStack.obj.Draw("COLZ")
cv.Update()
cv.SaveAs(os.path.join(resultsdir, f"{plot.name}.png"))
Data-driven backgrounds and subprocesses
In many analyses, some backgrounds are estimated from a data control region,
with some per-event weight that depends on the physics objects found etc.
This can be largely automatised: besides the main
Selection
, one or more instances with alternative
cuts (the control region instead of the signal region) and weights (the
mis-ID, fake, or transfer factors). That is exactly what is done by the
SelectionWithDataDriven
class: its
create()
method is like
bamboo.plots.Selection.refine()
, but with alternative cuts and weights
to construct the correctly reweighted control region besides the signal region.
Since it supports the same interface as Selection
,
further selections can be applied to both regions at the same time, and every
Plot
will declare the histograms for both.
The additional code for configuring which data-driven contributions to use,
and to make sure that histograms for backgrounds end up in a separate file
(such that they can transparently be used e.g. in plotIt), the analysis module
should inherit from
DataDrivenBackgroundHistogramsModule
(or
DataDrivenBackgroundAnalysisModule
if the
histogram-specific functionality is not required).
Data-driven contributions should be declared in the YAML configuration file with the lists of samples or groups from which the background estimate should be obtained, those that are replaced by it, e.g.
datadriven:
chargeMisID:
uses: [ data ]
replaces: [ DY ]
nonprompt:
uses: [ data ]
replaces: [ TTbar ]
The --datadriven
command-line argument can then be used to specify which of
these should be used (all
, none
, or an explicit list).
Several can be specified in the same run: different sets will then be produced.
The parsed versions are available as the datadrivenScenarios
attribute of
the module (and the contributions as datadrivenContributions
).
The third argument passed to the
create()
method should
correspond to one of the contribution names in the YAML file, e.g. (continuing
the example above):
hasSameSignElEl = SelectionWithDataDriven.create(hasElElZ, "hasSSDiElZ", "chargeMisID",
cut=(diel[0].Charge == diel[1].Charge),
ddCut=(diel[0].Charge != diel[1].Charge),
ddWeight=p_chargeMisID(diel[0])+p_chargeMisID(diel[1]),
enable=any("chargeMisID" in self.datadrivenContributions and self.datadrivenContributions["chargeMisID"].usesSample(sample, sampleCfg))
)
The generation of modified sample configuration dictionaries in the plotIt
YAML file can be customised by replacing the corresponding entry in the
datadrivenContributions
dictionary with a variation of a DataDrivenContribution
instance.
A very similar problem is the splitting of a sample into different
contributions based on some generator-level quantities, e.g. the number of
(heavy-flavour) partons in the matrix element.
In this case, splitting the RDF graph early on, such that each event is
processed by a nearly identical branch of it, would not be very efficient.
The bamboo.plots.LateSplittingSelection
class, a variation of
bamboo.plots.SelectionWithDataDriven
, may help in such cases:
it will branch the RDF graph only when attaching plots to a selection, so it
can be constructed earlier, but the RDF graph branching will be minimal.
By default the combined plot is also saved because it helps avoid
duplication in the graph, but this may be disabled by passing
keepInclusive=False
to the
create()
method.
To make sure the resulting histograms are saved, an analysis module that makes
use of SelectionWithDataDriven
should inherit from
bamboo.analysismodules.HistogramsModuleWithSub
; since the use case
is rather specific, no customisation to the postprocessing method is done,
but in most cases it should be straightforward to manipulate the samples
dictionary in the configuration before calling the superclass’ postprocessing
method, see e.g. this recipe.
Dealing with (failed) batch jobs
When splitting the work over a set of batch jobs using the
--distributed=driver
option (see the bambooRun options
reference), some may fail for various reasons: CPU time or memory limits
that are too tight, environment or hardware issues on the worker node,
or bugs in the analysis or bamboo code.
The monitoring loop will check the status of the running jobs every two
minutes, print information when some fail, merge outputs if all jobs for
a sample complete, and finally run the postprocessing when all samples are
processed, or exit when no running jobs remain.
Currently (improvements and additions are being discussed in
issue #87) resubmission
of the failed jobs and monitoring of the recovery jobs, after identifying the
reason why they fail, needs to be done using the tools provided by the batch
system (sbatch --array=X,Y,Z ...
for slurm; for HTCondor a helper script
bambooHTCondorResubmit
is provided that takes a very similar set of options—the commands are also printed by the monitoring loop).
When the outputs for all jobs that initially failed have been produced,
bambooRun
can be used with the --distributed=finalize
option (and
otherwise all the same options as for the original submission) to do any
remaining merging of outputs, and run the postprocessing step.
If some outputs are missing it will suggest a resubmission command and exit.
This only looks at the output files that are found to decide what still needs
to be done, so if a file in the results/
subdirectory of the output is
present, it will assume that is valid—this can be exploited in two ways:
if anything goes wrong in the merging, removing the results/<<sample>>.root
and running with --distributed=finalize
will try that again (similarly,
removing a corrupt job output file will add it to the resubmission command),
and if a sample is processed with a different splitting it is sufficient to put
the merged output file in the results/
directory.
Note
Understanding why batch jobs fail is not always easy,
and the specifics depend on the batch system and the environment
Bamboo collects all possible log files (standard output and error,
submission log) in the batch/logs
directory, and per-job inputs and
output in batch/input
and batch/output
, respectively.
In principle the worker jobs run in the same environment as where they are submitted, and typically take all software is installed from CVMFS, so most problems with batch jobs are related to data access, e.g. overloaded storage or permissions to access some resources. When reading files through XRootD a grid proxy is needed, at CERN the easiest is to create it in an AFS directory and pass that to the job.
Reproducible analysis: keep track of the version that produced some results
While bamboo does not by default force you to adopt a specific workflow, it can help with adopting some best practices for reproducible analysis. The most important thing is to keep the analysis code under version control: git is widely used for this (see the Pro Git book for an introduction). The idea is to keep the analysis code and configurations in a separate directory, which is tracked by git, from the bamboo outputs (plots, results etc.)—this can also be a subdirectory that is ignored by git, if you prefer.
bambooRun
will write a file with the git version of the repository where
the module and configuration file are found to the output directory: the
version.yml
file.
This will also contain the list of command-line arguments that were passed,
and the bamboo version.
In order for this to work, the analysis repository must at least have all local
changes committed, but it is even better to create a tag for versions that are
used to produce results, and push it to GitHub or GitLab (see e.g.
this overview; it is
also worth noting that tags in git can be annotated, which means that they
can have a descriptive message, just like a commit).
If the --git-policy
switch, or the policy
key in the git
section in
the ~/.config/bamboorc
file, gets a different value than the default
(testing
), bambooRun
will check that the analysis code is committed,
tagged, or tagged and pushed, based on the specified value
(committed
, tagged
, and pushed
, respectively).
It is recommended to use at least committed
(which will print warnings
if the commit has not been pushed, or is not tagged).
Tip: use git worktrees
An interesting solution to have several checkouts of the same repository, e.g. to run jobs with one version of the analysis code, and edit it at the same time, are git worktrees (see git-worktree manual page for a reference, or this article for some examples). They may also help with making sure that everything is committed and tracked by git: if you use the main clone to edit the code, and checkout a commit or tag in a worktree to produce plots on the full dataset, committing all necessary files is the best way to keep them synchronized (the “production” directory should not contain any untracked files then).
Git worktrees were introduced in version 2.5, so it will not work with older versions. The LCG distribution includes git since LCG_99, so if you use that method of installing bamboo it will be included automatically.
Tip: make a python package out of your analysis
For small analyses and projects, all that is needed are a YAML configuration
file and a python module, or a few of each.
When code needs to be shared between different modules, a simple solution is to
make it a python package: move the shared modules to a subdirectory, called
e.g. myanalysis
, add an empty __init__.py
to it, and write a
setup.py
file (still required for editable installs) like this one:
from setuptools import setup, find_packages
setup(
name="myexperiment-myanalysis",
description="Hunt for new physics (implemented with bamboo)",
url="https://gitlab.cern.ch/.../...",
author="...",
author_email="...",
packages=find_packages("."),
setup_requires=["setuptools_scm"],
use_scm_version=True
)
It can then be installed in the virtual environment with
pip install -e .
and the shared modules imported as myanalysis.mymodule
.
The -e
flag actually puts only a link in the virtual environment, such that
any changes in the source files are immediately available, without updating the
installed version (then it does not spoil the change tracking above).
More information on packaging python packages can be found in the PyPA packaging tutorial, the setuptools documentation, the PyPA setuptools guide and the Scikit-HEP packaging guidelines. For packages that include C++ components scikit-build allows to combine setuptools and CMake (it is also used by bamboo and correctionlib).
SVfit for the reconstruction of the Higgs mass in \(H\rightarrow \tau\tau\) events
The Higgs mass in events with Higgs bosons decaying into a pair of \(\tau\) leptons can be reconstructed using the SVfit algorithm. The algorithm is based on matrix element techniques and typically achieves a relative resolution on the Higgs boson mass of 15–20%. It utilizes information about the missing transverse energy and the kinematic properties of leptons as input.
The SVfit algorithm is not implemented within bamboo framework itself, but an interface is provided to enable users to use SVfit in bamboo.
Firstly, SVfit needs to be installed (outside of the bamboo package).
git clone git@github.com:cp3-llbb/ClassicSVfit.git ClassicSVfit -b fastMTT_cmake_build
mkdir build-ClassicSVfit-FastMTT/
cd build-ClassicSVfit-FastMTT/
cmake cmake -DCMAKE_INSTALL_PREFIX=$VIRTUAL_ENV ../ClassicSVfit/
make -j2
After installation, the on the fly calculator should be added and configured
using the bamboo.analysisutils.configureSVfitCalculator()
method, as shown in the example below.
Users can then utilize the bamboo.treefunctions.svFitMTT()
and bamboo.treefunctions.svFitFastMTT()
functions for the reconstruction of the Higgs mass.
class MyAnalysisModule(NanoAODHistoModule):
def prepareTree(self, tree, sample=None, sampleCfg=None):
tree,noSel,be,lumiArgs = NanoAODHistoModule.prepareTree(self, tree, sample=sample, sampleCfg=sampleCfg)
from bamboo.analysisutils import configureSVfitCalculator
configureSVfitCalculator(pathToSVfit=" ", backend=be)
return tree,noSel,be,lumiArgs