Installation and setup

Dependencies and environment

Bamboo only depends on python3 (with pip/setuptools to install PyYAML and numpy if needed) and a recent version of ROOT (6.20/00 is the minimum supported version, as it introduces some compatibility features for the new PyROOT in 6.22/00).

On user interface machines (lxplus, ingrid, or any machine with cvmfs), an easy way to get such a recent version of ROOT is through a CMSSW release that depends on it, or from the SPI LCG distribution, e.g.

source /cvmfs/sft.cern.ch/lcg/views/LCG_102/x86_64-centos7-gcc11-opt/setup.sh
python -m venv bamboovenv
source bamboovenv/bin/activate

(the second command creates a virtual environment to install python packages in, after installation it is sufficient to run two other commands, to pick up the correct base system and then the installed packages).

Alternatively, a conda environment (e.g. with Miniconda) can be created with

conda config --add channels conda-forge # if not already present
conda create -n test_bamboo root pyyaml numpy cmake boost
conda activate test_bamboo

and bamboo installed directly there with pip, or in a virtual environment inside the conda environment (make sure to pass --system-site-packages to venv then); conda-build recipes are in the plans.

A docker image (based on repo2docker, configuration) with an up-to-date version of bamboo and plotIt is also available. It is compatible with binder, which can be used to run some examples without installing anything locally.

Some features bring in additional dependencies. Bamboo should detect if these are relied on and missing, and print a clear error message in that case. Currently, they include:

  • the dasgoclient executable (and a valid grid proxy) for retrieving the list of files in samples specified with db: das:/X/Y/Z. Due to some interference with the setup script above, the best is to run the cms environment scripts first, and also run voms-proxy-init then (this can alternatively also be done from a different shell on the same machine)

  • the slurm command-line tools, and CP3SlurmUtils, which can be installed using pip (or loaded with module load slurm/slurm_utils on the UCLouvain ingrid ui machines)

  • machine learning libraries (libtorch, Tensorflow-C, lwtnn): see this section for more information

  • writing out tables in LaTeX format from cutflow reports relies needs pyplotit (see below)

  • Dask or pySpark for running distributed RDataFrame (see below)

Installation

Bamboo can (and should, in most cases) be installed in a virtual environment or conda environment (see above) with pip:

pip install bamboo-hep

Since Bamboo is still in heavy development, you may want to fetch the latest (unreleased) version using one of:

pip install git+https://gitlab.cern.ch/cp3-cms/bamboo.git
pip install git+ssh://git@gitlab.cern.ch:7999/cp3-cms/bamboo.git

It may even be useful to install from a local clone, such that you can use it to test and propose changes, using

git clone -o upstream https://gitlab.cern.ch/cp3-cms/bamboo.git /path/to/your/bambooclone
pip install /path/to/your/bambooclone ## e.g. ./bamboo (not bamboo - another package with that name exists)

such that you can update later on with (inside /path/to/your/bambooclone)

git pull upstream master
pip install --upgrade .

It is also possible to install bamboo in editable mode for development; to avoid problems, this should be done in a separate virtual environment:

python -m venv devvenv ## deactivate first, or use a fresh shell
source devvenv/bin/activate ## deactivate first, or use a fresh shell
export SETUPTOOLS_ENABLE_FEATURES=legacy-editable
pip install -e ./bamboo

Note that this will store cached build outputs in the _skbuild directory. python setup.py clean --all can be used to clean this up (otherwise they will prevent updating the non-editabl install). The additional environment variable is a workaround for a bug in scikit-build, see this issue.

The documentation can be built locally with python setup.py build_sphinx, and for running all (or some) tests the easiest is to call pytest directly, with the bamboo/tests directory to run all tests, or with a specific file to check only the tests defined there.

Note

bamboo is a shared package, so everything that is specific to a single analysis (or a few related analyses) is best stored elsewhere (e.g. in bamboodev/myanalysis in the example below); otherwise you will need to be very careful when updating to a newer version.

The bambooRun command can pick up code in different ways, so it is possible to start from a single python file, and move to a pip-installed analysis package later on when code needs to be shared between modules.

For combining the different histograms in stacks and producing pdf or png files, which is used in many analyses, the plotIt tool is used. It can be installed with cmake, e.g.

git clone -o upstream https://github.com/cp3-llbb/plotIt.git /path/to/your/plotitclone
mkdir build-plotit
cmake -DCMAKE_INSTALL_PREFIX=$VIRTUAL_ENV -S /path/to/your/plotitclone -B build-plotit
cmake --build build-plotit -t install -j 4

where -DCMAKE_INSTALL_PREFIX=$VIRTUAL_ENV ensures that the plotIt executable will be installed directly in the bin directory of the virtualenv (if not using a virtualenv, its path can be passed to bambooRun with the --plotIt command-line option).

plotIt is very efficient at what it does, but not so easy to adapt to producing efficiently plots, overlays of differently defined distributions etc. Therefore a python implementation of its main functionality was started in the pyplotit package, which can be installed with

pip install git+https://gitlab.cern.ch/cp3-cms/pyplotit.git

or editable from a local clone:

git clone -o upstream https://gitlab.cern.ch/cp3-cms/pyplotit.git
pip install -e pyplotit

pyplotit parses plotIt YAML files and implements the same grouping and stack-building logic; an easy way to get started with it is through the iPlotIt script, which parses a plotIt configuration file and launches an IPython shell. Currently this is used in bamboo for producing yields tables from cutflow reports. It is also very useful for writing custom postprocess functions, see this recipe for an example.

To use scalefactors and weights in the new CMS JSON format, the correctionlib package should be installed with

pip install --no-binary=correctionlib correctionlib

The calculators modules for jet and MET corrections and systematic variations were moved to a separate repository and package, such that they can also be used from other frameworks. The repository can be found at cp3-cms/CMSJMECalculators, and installed with

pip install git+https://gitlab.cern.ch/cp3-cms/CMSJMECalculators.git

For the impatient: recipes for installing and updating

Putting the above commands together, the following should give you a virtual environment with bamboo, and a clone of bamboo and plotIt in case you need to modify them, all under bamboodev:

Fresh install

mkdir bamboodev
cd bamboodev
# make a virtualenv
source /cvmfs/sft.cern.ch/lcg/views/LCG_102/x86_64-centos7-gcc11-opt/setup.sh
python -m venv bamboovenv
source bamboovenv/bin/activate
# clone and install bamboo
git clone -o upstream https://gitlab.cern.ch/cp3-cms/bamboo.git
pip install ./bamboo
# clone and install plotIt
git clone -o upstream https://github.com/cp3-llbb/plotIt.git
mkdir build-plotit
cd build-plotit
cmake -DCMAKE_INSTALL_PREFIX=$VIRTUAL_ENV ../plotIt
make -j2 install
cd -

Environment setup

Once bamboo and plotIt have been installed as above, only the following two commands are needed to set up the environment in a new shell:

source /cvmfs/sft.cern.ch/lcg/views/LCG_102/x86_64-centos7-gcc11-opt/setup.sh
source bamboodev/bamboovenv/bin/activate

Update bamboo

Assuming the environment is set up as above; this can also be used to test a pull request or local modifications to the bamboo source code

cd bamboodev/bamboo
git checkout master
git pull upstream master
pip install --upgrade .

Update plotIt

Assuming the environment is set up as above; this can also be used to test a pull request or local modifications to the plotIt source code. If a plotIt build directory already exists it should have been created with the same environment, otherwise the safest solution is to remove it.

cd bamboodev
mkdir build-plotIt
cd build-plotit
cmake -DCMAKE_INSTALL_PREFIX=$VIRTUAL_ENV ../plotIt
make -j2 install
cd -

Move to a new LCG release or install an independent version

Different virtual environments can exist alongside each other, as long as for each the corresponding base LCG distribution is setup in a fresh shell. This allows to have e.g. one stable version used for analysis, and another one to test experimental changes, or check a new LCG release, without touching a known working version.

cd bamboodev
source /cvmfs/sft.cern.ch/lcg/views/LCG_102/x86_64-centos7-gcc11-opt/setup.sh
python -m venv bamboovenv_X
source bamboovenv_X/bin/activate
pip install ./bamboo
# install plotIt (as in "Update plotIt" above)
mkdir build-plotit
cd build-plotit
cmake -DCMAKE_INSTALL_PREFIX=$VIRTUAL_ENV ../plotIt
make -j2 install
cd -

Test your setup

Now you can run a few simple tests on a CMS NanoAOD to see if the installation was successful. A minimal example is run by the following command:

bambooRun -m /path/to/your/bambooclone/examples/nanozmumu.py:NanoZMuMu /path/to/your/bambooclone/examples/test1.yml -o test1

which will run over a single sample of ten events and fill some histograms (in fact, only one event passes the selection, so they will not look very interesting). If you have a NanoAOD file with muon triggers around, you can put its path instead of the test file in the yml file and rerun to get a nicer plot (xrootd also works, but only for this kind of tests—in any practical case the performance benefit of having the files locally is worth the cost of replicating them).

Getting started

The test command above shows how bamboo is typically run: using the bambooRun command, with a python module that specifies what to run, and an analysis YAML file that specifies which samples to process, and how to combine them in plots (there are several options to run a small test, or submit jobs to the batch system when processing a lot of samples).

A more realistic analysis YAML configuration file is bamboo/examples/analysis_zmm.yml, which runs on a significant fraction of the 2016 and 2017 DoubleMuon data and the corresponding Drell-Yan simulated samples. Since the samples are specified by their DAS path in this case, the dasgoclient executable and a valid grid proxy are needed for resolving those to files, and a configuration file that describes the local computing environment (i.e. the root path of the local CMS grid storage, or the name of the redirector in case of using xrootd); examples are included for the UCLouvain-CP3 and CERN (lxplus/lxbatch) cases.

The corresponding python module shows the typical structure of ever tighter event selections that derive from the base selection, which accepts all the events in the input, and plots that are defined based on these selection, and returned in a list from the main method (this corresponds to the pdf or png files that will be produced).

The module deals with a decorated version of the tree, which can also be inspected from an IPython shell by using the -i option to bambooRun, e.g.

bambooRun -i -m /path/to/your/bambooclone/examples/nanozmumu.py:NanoZMuMu /path/to/your/bambooclone/examples/test1.yml

together with the helper methods defined on this page, this allows to define a wide variety of selection requirements and variables.

The user guide contains a much more detailed description of the different files and how they are used, and the analysis recipes page provides more information about the bundled helper methods for common tasks. The API reference describes all available user-facing methods and classes. If the builtin functionality is not sufficient, some hints on extending or modifying bamboo can be found in the advanced topics and the hacking guide.

Machine learning packages

In order to evaluate machine learning classifiers, bamboo needs to find the necessary C(++) libraries, both when the extension libraries are compiled and at runtime (so they need to be installed before (re)installing bamboo). libtorch is searched for in the torch package with pkg_resources, which unfortunately does not always work due to pip build isolation. This can be bypassed by passing --no-isolated-build when installing, or by installing bamboo-hep[torch], which will install it as a dependency (it is quite big, so if the former method works it should be preferred). The --no-isolated-build option is a workaround: when passing CMake options to pip install (see scikit-build#479) will be possible, that will be a better solution. The minimum version required for libtorch is 1.5 (due to changes in the C++ API), which is available from LCG_99 on (contains libtorch 1.7.0). Tensorflow-C and lwtnn will be searched for (by cmake and the dynamic library loader) in the default locations, supplemented with the currently active virtual environment, if any (scripts to install them there directly are included in the bamboo source code respository, as ext/install_tensorflow-c.sh and ext/install_lwtnn.sh). ONNX Runtime should be part of recent LCG distribution. If not, it will be searched for in the standard locations. It can be added to the virtual environment by following the instruction to build from source, with the additional option --cmake_extra_defines=CMAKE_INSTALL_PREFIX=$VIRTUAL_ENV, after which make install from its build/Linux/<config> will install it correctly (replacing <config> by the CMake build type, e.g. Release or RelWithDebInfo).

Note

Installing a newer version of libtorch in a virtualenv if it is also available through the PYTHONPATH (e.g. in the LCG distribution) generally does not work, since virtualenv uses PYTHONHOME, which has lower precedence. For the pure C(++) libraries Tensorflow-C and lwtnn this could be made to work, but currently the virtual environment is only explicitly searched if they are not found otherwise. Therefore it is recommended to stick with the version provided by the LCG distribution, or set up an isolated environment with conda—see the issues #68 (for now) and #65 for more information. When a stable solution is found it will be added here.

Warning

the libtorch and Tensorflow-C builds in LCG_98python3 contain AVX2 instructions (so one of these CPU generations). See issue #68 for more a more detailed discussion, and a possible workaround.

Distributed RDataFrame

Through distributed ROOT::RDataFrame, bamboo can distribute the computations on a cluster managed by Dask or pySpark. While Dask, using Dask-jobqueue, can work on any existing cluster managed by SLURM or HTCondor, Spark requires a Spark scheduler to be running at your computing centre.

To install the required dependencies, run either one of:

pip install bamboo-hep[dask]
pip install bamboo-hep[spark]

EasyBuild-based installation at CP3

On the ingrid/manneback cluster at UCLouvain-CP3, and other environments that use EasyBuild, it is also possible to install bamboo based on the dependencies that are provided through this mechanism (potentially with some of them built as user modules). The LCG source script in the instructions above should then be replaced by e.g.

module load ROOT/6.22.08-foss-2019b-Python-3.7.4 CMake/3.15.3-GCCcore-8.3.0 \
   Boost/1.71.0-gompi-2019b matplotlib/3.1.1-foss-2019b-Python-3.7.4 \
   PyYAML/5.1.2-GCCcore-8.3.0 TensorFlow/2.1.0-foss-2019b-Python-3.7.4