Python is a notoriously slow language, so why is it widely used by scientists and machine learning experts? In a numerically heavy task, an interpreted, dynamically typed environment can be thousands of times slower than a compiled, statically typed one, which can make the difference between minutes and days or between coarse models on small datasets and fine-grained models on large datasets. The trick is to drive compiled functions from the interpreted commandline, like R, and to frame your problem in array programming primitives, like Matlab, but in a general-purpose programming language with hundreds of thousands of extensions to glue to every conceivable interface.
In this workshop, we will examine the numerical processing ecosystem that has grown up around Python. The key library in this ecosystem is Numpy, which enables fast array programming, and Pandas, a convenient wrapper for organizing data. We will visualize data in and out of JupyterLab, a notebook front-end for exploratory analysis. We'll also work through examples of binding from C++ to Python with pybind11 and from Python to C++ with Cython, which have different strengths and use-cases. We'll also natively compile Python (C++ speeds without C++) using Numba and run code on GPUs with Numba (Python-like), CuPy (Numpy-like), and PyCUDA/PyOpenCL (raw CUDA/OpenCL).
Participants will be encouraged to bring a laptop or log into their favorite cluster to install the software we discuss here for later use. We will use conda and pip-in-conda, so superuser ("sudo") permissions are not required. None of these topics require programming expertise beyond Python except the short segment on pybind11 (C++) so participants should have general working knowledge of Python.
Important instructions: Come with conda (Miniconda or Anaconda) and Jupyter Lab or Notebook installed for Python 3. Jupyter Lab/Notebook can be installed through conda. General Python programming skills (e.g. writing page-long scripts without difficulty) will be assumed— try some tutorials online if you need to get up to speed. We'll be using Numpy, Pandas, Dask, Numba, Cython, and others, but no prior knowledge of these libraries will be assumed.
10:00 – 10:30 Intro talk
10:30 – 12:00 Just Numpy
12:00 – 01:00 Lunch
01:00 – 01:30 Ecosystem talk
01:30 – 02:00 Pandas
02:00 – 02:30 Dask & multiprocessing
02:30 – 02:50 Coffee break
02:50 – 03:20 Numba, Cython, pybind11
03:20 – 03:40 CuPy, Numba-GPU, PyCUDA
03:40 – 04:00 ctypes & low-level hackery
Jim Pivarski received his Ph.D. in high-energy particle physics from Cornell in 2006. He helped to commission the CMS experiment at the LHC and later switched to data science as a Big Data consultant. He is now back in physics, integrating computing techniques learned from industry into high-energy physics analysis.