Installation and setup
Dependencies and environment
Bamboo only depends on python3 (with pip/setuptools to install PyYAML and numpy if needed) and a recent version of ROOT (6.20/00 is the minimum supported version, as it introduces some compatibility features for the new PyROOT in 6.22/00).
On user interface machines (lxplus, ingrid, or any machine with cvmfs), an easy way to get such a recent version of ROOT is through a CMSSW release that depends on it, or from the SPI LCG distribution, e.g.
source /cvmfs/sft.cern.ch/lcg/views/LCG_102/x86_64-centos7-gcc11-opt/setup.sh
python -m venv bamboovenv
source bamboovenv/bin/activate
(the second command creates a virtual environment to install python packages in, after installation it is sufficient to run two other commands, to pick up the correct base system and then the installed packages).
Alternatively, a conda environment (e.g. with Miniconda) can be created with
conda config --add channels conda-forge # if not already present
conda create -n test_bamboo root pyyaml numpy cmake boost
conda activate test_bamboo
and bamboo installed directly there with pip, or in a virtual environment
inside the conda environment (make sure to pass --system-site-packages
to venv
then); conda-build recipes are
in the plans.
A docker image (based on repo2docker, configuration) with an up-to-date version of bamboo and plotIt is also available. It is compatible with binder, which can be used to run some examples without installing anything locally.
Some features bring in additional dependencies. Bamboo should detect if these are relied on and missing, and print a clear error message in that case. Currently, they include:
the dasgoclient executable (and a valid grid proxy) for retrieving the list of files in samples specified with
db: das:/X/Y/Z
. Due to some interference with the setup script above, the best is to run the cms environment scripts first, and also runvoms-proxy-init
then (this can alternatively also be done from a different shell on the same machine)the slurm command-line tools, and CP3SlurmUtils, which can be installed using pip (or loaded with
module load slurm/slurm_utils
on the UCLouvain ingrid ui machines)machine learning libraries (libtorch, Tensorflow-C, lwtnn): see this section for more information
writing out tables in LaTeX format from cutflow reports relies needs pyplotit (see below)
Dask or pySpark for running distributed RDataFrame (see below)
Installation
Bamboo can (and should, in most cases) be installed in a virtual environment or conda environment (see above) with pip:
pip install bamboo-hep
Since Bamboo is still in heavy development, you may want to fetch the latest (unreleased) version using one of:
pip install git+https://gitlab.cern.ch/cp3-cms/bamboo.git
pip install git+ssh://git@gitlab.cern.ch:7999/cp3-cms/bamboo.git
It may even be useful to install from a local clone, such that you can use it to test and propose changes, using
git clone -o upstream https://gitlab.cern.ch/cp3-cms/bamboo.git /path/to/your/bambooclone
pip install /path/to/your/bambooclone ## e.g. ./bamboo (not bamboo - another package with that name exists)
such that you can update later on with (inside /path/to/your/bambooclone
)
git pull upstream master
pip install --upgrade .
It is also possible to install bamboo in editable mode for development; to avoid problems, this should be done in a separate virtual environment:
python -m venv devvenv ## deactivate first, or use a fresh shell
source devvenv/bin/activate ## deactivate first, or use a fresh shell
export SETUPTOOLS_ENABLE_FEATURES=legacy-editable
pip install -e ./bamboo
Note that this will store cached build outputs in the _skbuild
directory.
python setup.py clean --all
can be used to clean this up
(otherwise they will prevent updating the non-editabl install).
The additional environment variable is a workaround for a bug in scikit-build, see this issue.
The documentation can be built locally with python setup.py build_sphinx
,
and for running all (or some) tests the easiest is to call pytest
directly,
with the bamboo/tests
directory to run all tests, or with a specific file
to check only the tests defined there.
Note
bamboo is a shared package, so everything that is specific to a single
analysis (or a few related analyses) is best stored elsewhere (e.g. in
bamboodev/myanalysis
in the example below); otherwise you will need to
be very careful when updating to a newer version.
The bambooRun
command can pick up code in different ways, so it is
possible to start from a single python file, and move to a pip-installed
analysis package later on when code needs to be shared between modules.
For combining the different histograms in stacks and producing pdf or png files, which is used in many analyses, the plotIt tool is used. It can be installed with cmake, e.g.
git clone -o upstream https://github.com/cp3-llbb/plotIt.git /path/to/your/plotitclone
mkdir build-plotit
cmake -DCMAKE_INSTALL_PREFIX=$VIRTUAL_ENV -S /path/to/your/plotitclone -B build-plotit
cmake --build build-plotit -t install -j 4
where -DCMAKE_INSTALL_PREFIX=$VIRTUAL_ENV
ensures that the plotIt
executable will be installed directly in the bin
directory of the
virtualenv (if not using a virtualenv, its path can be passed to bambooRun
with the --plotIt
command-line option).
plotIt is very efficient at what it does, but not so easy to adapt to producing efficiently plots, overlays of differently defined distributions etc. Therefore a python implementation of its main functionality was started in the pyplotit package, which can be installed with
pip install git+https://gitlab.cern.ch/cp3-cms/pyplotit.git
or editable from a local clone:
git clone -o upstream https://gitlab.cern.ch/cp3-cms/pyplotit.git
pip install -e pyplotit
pyplotit parses plotIt YAML files and implements the same grouping and
stack-building logic; an easy way to get started with it is through the
iPlotIt
script, which parses a plotIt configuration file and launches
an IPython shell.
Currently this is used in bamboo for producing yields tables from cutflow reports.
It is also very useful for writing custom postprocess functions, see
this recipe for an example.
To use scalefactors and weights in the new CMS JSON format, the correctionlib package should be installed with
pip install --no-binary=correctionlib correctionlib
The calculators modules for jet and MET corrections and systematic variations were moved to a separate repository and package, such that they can also be used from other frameworks. The repository can be found at cp3-cms/CMSJMECalculators, and installed with
pip install git+https://gitlab.cern.ch/cp3-cms/CMSJMECalculators.git
For the impatient: recipes for installing and updating
Putting the above commands together, the following should give you a virtual
environment with bamboo, and a clone of bamboo and plotIt in case you need to
modify them, all under bamboodev
:
Fresh install
mkdir bamboodev
cd bamboodev
# make a virtualenv
source /cvmfs/sft.cern.ch/lcg/views/LCG_102/x86_64-centos7-gcc11-opt/setup.sh
python -m venv bamboovenv
source bamboovenv/bin/activate
# clone and install bamboo
git clone -o upstream https://gitlab.cern.ch/cp3-cms/bamboo.git
pip install ./bamboo
# clone and install plotIt
git clone -o upstream https://github.com/cp3-llbb/plotIt.git
mkdir build-plotit
cd build-plotit
cmake -DCMAKE_INSTALL_PREFIX=$VIRTUAL_ENV ../plotIt
make -j2 install
cd -
Environment setup
Once bamboo and plotIt have been installed as above, only the following two commands are needed to set up the environment in a new shell:
source /cvmfs/sft.cern.ch/lcg/views/LCG_102/x86_64-centos7-gcc11-opt/setup.sh
source bamboodev/bamboovenv/bin/activate
Update bamboo
Assuming the environment is set up as above; this can also be used to test a pull request or local modifications to the bamboo source code
cd bamboodev/bamboo
git checkout master
git pull upstream master
pip install --upgrade .
Update plotIt
Assuming the environment is set up as above; this can also be used to test a pull request or local modifications to the plotIt source code. If a plotIt build directory already exists it should have been created with the same environment, otherwise the safest solution is to remove it.
cd bamboodev
mkdir build-plotIt
cd build-plotit
cmake -DCMAKE_INSTALL_PREFIX=$VIRTUAL_ENV ../plotIt
make -j2 install
cd -
Move to a new LCG release or install an independent version
Different virtual environments can exist alongside each other, as long as for each the corresponding base LCG distribution is setup in a fresh shell. This allows to have e.g. one stable version used for analysis, and another one to test experimental changes, or check a new LCG release, without touching a known working version.
cd bamboodev
source /cvmfs/sft.cern.ch/lcg/views/LCG_102/x86_64-centos7-gcc11-opt/setup.sh
python -m venv bamboovenv_X
source bamboovenv_X/bin/activate
pip install ./bamboo
# install plotIt (as in "Update plotIt" above)
mkdir build-plotit
cd build-plotit
cmake -DCMAKE_INSTALL_PREFIX=$VIRTUAL_ENV ../plotIt
make -j2 install
cd -
Test your setup
Now you can run a few simple tests on a CMS NanoAOD to see if the installation was successful. A minimal example is run by the following command:
bambooRun -m /path/to/your/bambooclone/examples/nanozmumu.py:NanoZMuMu /path/to/your/bambooclone/examples/test1.yml -o test1
which will run over a single sample of ten events and fill some histograms (in fact, only one event passes the selection, so they will not look very interesting). If you have a NanoAOD file with muon triggers around, you can put its path instead of the test file in the yml file and rerun to get a nicer plot (xrootd also works, but only for this kind of tests—in any practical case the performance benefit of having the files locally is worth the cost of replicating them).
Getting started
The test command above shows how bamboo is typically run: using the bambooRun command, with a python module that specifies what to run, and an analysis YAML file that specifies which samples to process, and how to combine them in plots (there are several options to run a small test, or submit jobs to the batch system when processing a lot of samples).
A more realistic analysis YAML configuration file is
bamboo/examples/analysis_zmm.yml,
which runs on a significant fraction of the 2016 and 2017 DoubleMuon
data
and the corresponding Drell-Yan simulated samples.
Since the samples are specified by their DAS path in this case, the
dasgoclient
executable and a valid grid proxy are needed for resolving
those to files, and a configuration file that describes the
local computing environment (i.e. the root path of the local CMS grid storage,
or the name of the redirector in case of using xrootd); examples are included
for the UCLouvain-CP3 and CERN (lxplus/lxbatch) cases.
The corresponding python module shows the typical structure of ever tighter event selections that derive from the base selection, which accepts all the events in the input, and plots that are defined based on these selection, and returned in a list from the main method (this corresponds to the pdf or png files that will be produced).
The module deals with a decorated version of the tree, which can also be
inspected from an IPython shell by using the -i
option to bambooRun
,
e.g.
bambooRun -i -m /path/to/your/bambooclone/examples/nanozmumu.py:NanoZMuMu /path/to/your/bambooclone/examples/test1.yml
together with the helper methods defined on this page, this allows to define a wide variety of selection requirements and variables.
The user guide contains a much more detailed description of the different files and how they are used, and the analysis recipes page provides more information about the bundled helper methods for common tasks. The API reference describes all available user-facing methods and classes. If the builtin functionality is not sufficient, some hints on extending or modifying bamboo can be found in the advanced topics and the hacking guide.
Machine learning packages
In order to evaluate machine learning classifiers, bamboo needs to find the
necessary C(++) libraries, both when the extension libraries are compiled and
at runtime (so they need to be installed before (re)installing bamboo).
libtorch is searched for in the torch
package with pkg_resources
,
which unfortunately does not always work due to pip
build isolation.
This can be bypassed by passing --no-isolated-build
when installing, or by
installing bamboo-hep[torch]
, which will install it as a dependency (it is
quite big, so if the former method works it should be preferred).
The --no-isolated-build
option is a workaround: when passing CMake options
to pip install (see
scikit-build#479)
will be possible, that will be a better solution.
The minimum version required for libtorch is 1.5 (due to changes in
the C++ API), which is available from LCG_99 on (contains libtorch 1.7.0).
Tensorflow-C and lwtnn will be searched for (by cmake and the dynamic library
loader) in the default locations, supplemented with the currently active
virtual environment, if any (scripts to install them there directly are
included in the bamboo source code respository, as
ext/install_tensorflow-c.sh
and ext/install_lwtnn.sh
).
ONNX Runtime should be part of recent LCG distribution.
If not, it will be searched for in the standard locations.
It can be added to the virtual environment by following the
instruction
to build from source, with the additional option
--cmake_extra_defines=CMAKE_INSTALL_PREFIX=$VIRTUAL_ENV
, after which
make install
from its build/Linux/<config>
will install it correctly
(replacing <config>
by the CMake build type, e.g. Release or
RelWithDebInfo).
Note
Installing a newer version of libtorch in a virtualenv if it is
also available through the PYTHONPATH
(e.g. in the LCG distribution)
generally does not work, since virtualenv uses PYTHONHOME
, which has
lower precedence.
For the pure C(++) libraries Tensorflow-C and lwtnn this could be made to
work, but currently the virtual environment is only explicitly searched if
they are not found otherwise.
Therefore it is recommended to stick with the version provided by the LCG
distribution, or set up an isolated environment with conda—see the
issues #68 (for now) and #65 for more information. When a stable
solution is found it will be added here.
Warning
the libtorch and Tensorflow-C builds in LCG_98python3 contain AVX2 instructions (so one of these CPU generations). See issue #68 for more a more detailed discussion, and a possible workaround.
Distributed RDataFrame
Through distributed ROOT::RDataFrame, bamboo can distribute the computations on a cluster managed by Dask or pySpark. While Dask, using Dask-jobqueue, can work on any existing cluster managed by SLURM or HTCondor, Spark requires a Spark scheduler to be running at your computing centre.
To install the required dependencies, run either one of:
pip install bamboo-hep[dask]
pip install bamboo-hep[spark]
EasyBuild-based installation at CP3
On the ingrid/manneback cluster at UCLouvain-CP3, and other environments that use EasyBuild, it is also possible to install bamboo based on the dependencies that are provided through this mechanism (potentially with some of them built as user modules). The LCG source script in the instructions above should then be replaced by e.g.
module load ROOT/6.22.08-foss-2019b-Python-3.7.4 CMake/3.15.3-GCCcore-8.3.0 \
Boost/1.71.0-gompi-2019b matplotlib/3.1.1-foss-2019b-Python-3.7.4 \
PyYAML/5.1.2-GCCcore-8.3.0 TensorFlow/2.1.0-foss-2019b-Python-3.7.4