Make your numerical Python code fly at transonic speed!ΒΆ

Overview of the Python HPC landscape and zoom on Transonic πŸš€ΒΆ

Python, only a great glue language?
Pierre Augier, Ashwin Vishnu
Β  Β  Β  Β  Β  Β 
PySciDataGre (19 March 2019)

Python for High-Performance Computing?ΒΆ

  1. Fast prototyping (Numpy!)

  2. Popular:

    • Well-known

    • Several great libraries

  3. Share ideas between developers / scientists

    • Popularity counts

    • Readability counts

    • Expressivity counts

  4. Anyway, one needs a good and well-known scripting language so yes!

    (even considering Julia)

Where / when should we stop ?ΒΆ

Python & fast prototyping...ΒΆ

The software engineering method for scientists πŸ‘©β€πŸ”¬ πŸ‘¨β€πŸ”¬ and HPCΒΆ

  1. Fast prototyping

  2. Solidify as needed

Again and again: (1, 2), (1, 2), ...

Python: a programming language, compromises βš–οΈΒΆ

Designed for fast prototyping & "glue" codes together

  • Generalist + easy to learn β‡’ huge and diverse community πŸ‘¨πŸΏβ€πŸŽ“πŸ•΅πŸΌ πŸ‘©πŸΌβ€πŸŽ“ πŸ‘©πŸ½β€πŸ«πŸ‘¨πŸ½β€πŸ’»πŸ‘©πŸΎβ€πŸ”¬ πŸŽ…πŸΌ 🌎 🌍 🌏

  • Expressivity and readability

  • Not oriented towards high performance

    (fast and easy dev, easy debug, correctness)

    • Highly dynamic πŸ’ + introspection (inspect.stack())

    • Automatic memory management πŸ’Ύ

    • All objects encapsulated πŸ₯₯ (PyObject, C struct)

    • Objects accessible through "references" ➑️

    • Usually interpreted

Python interpretersΒΆ

CPython

Interpreted (nearly) instruction per instruction, (nearly) no code optimization

The numerical stack (Numpy, Scipy, Scikits, ...) based on the CPython C API (CPython implementation details)!


PyPy

Optimized implementation with tracing Just-In-Time compilation

"Abstractions for free"

The CPython C API is an issue! PyPy can't accelerate Numpy code!


Micropython

For microcontrollers

Python & performanceΒΆ

References and PyObjectsΒΆ

InΒ [2]:
mylist = [1, 3, 5]

list: array of references towards PyObjects

The C / Python borderΒΆ

InΒ [3]:
arr = 2 * np.arange(10)
print(arr[2])
4

Python & performanceΒΆ

Python interpreters bad at crunching numbersΒΆ

Pure Python terrible 🐒 (except with PyPy)...

InΒ [4]:
from math import sqrt
my_const = 10.
result = [elem * sqrt(my_const * 2 * elem**2) for elem in range(1000)]

but even this is not very efficient (temporary objects)...

InΒ [5]:
import numpy as np
a = np.arange(1000)
result = a * np.sqrt(my_const * 2 * a**2)

Even slightly worth with PyPy πŸ™

Is Python efficient enough?ΒΆ

Python is known to be slow... But what does it mean?


Efficiency / inefficiency: depends on tasks ⏱¢


When is it inefficient? Especially for number crunching πŸ”’ ...ΒΆ


Can we write efficient scientific code in 🐍 ?¢

BookΒΆ

Performance (generalities)ΒΆ

Measure ⏱, don't guess! Profile to find the bottlenecks.¢

Cprofile (pstats, SnakeViz), line-profiler, perf, perf_events


Do not optimize everything!ΒΆ

  • "Premature optimization is the root of all evil" (Donald Knuth)

  • 80 / 20 rule, efficiency important for expensive things and NOT for small things


CPU or IO bounded problemsΒΆ


Use the right algorithms and the right data structures!ΒΆ

For example, using Numpy arrays instead of Python lists...


Unittests before optimizing to maintain correctness!ΒΆ

unittest, pytest

"Crunching numbers" and computers architecturesΒΆ

Β 

CPU optimizationsΒΆ

  • pipelining, hyper-threading, vectorization, advanced instructions (simd), ...

  • important to get data aligned in memory (arrays)

Proper compilation needed for high efficiency !ΒΆ


Compilation to virtual machine instructionsΒΆ

What does CPython (compile, "byte code", nearly no optimization, see dis module)


Compilation to machine instructionsΒΆ

  • Just-in-time

    Has to be fast (warm up), can be hardware specific

  • Ahead-of-time

    Can be slow, hardware specific or more general to distribute binaries

Compilers are usually good for optimizations! Better than most humans...


TranspilationΒΆ

From one language to another language (for example Python to C++)

ParallelismΒΆ

Hardware:ΒΆ

  • Multicore CPU

  • Multi nodes super computers (MPI)

  • GPU (Nvidia: Cuda, Cupy) / Intel Xeon Phi


Different problemsΒΆ

  • CPU bounded (need to use cores at the same time)

  • IO bounded (wait for IO)

Different parallel strategiesΒΆ

IO bounded: one process + async/awaitΒΆ


Cooperative concurrency


Functions able to pause


asyncio, trio

Different parallel strategiesΒΆ

One process split in light subprocesses called threads πŸ‘©πŸΌβ€πŸ”§ πŸ‘¨πŸΌβ€πŸ”§πŸ‘©πŸΌβ€πŸ”§ πŸ‘¨πŸΌβ€πŸ”§ΒΆ

  • handled by the OS

  • share memory and can use at the same time different CPU cores

How?

  • OpenMP (C / C++ / Fortran)

  • In Python: threading and concurrent

⚠️ in Python, one interpreter per process (~) and the Global Interpreter Lock (GIL)...

  • In a Python program, different threads can run at the same time (and take advantage of multicore)

  • But... the Python interpreter runs the Python bytecodes sequentially !

    • Terrible 🐌 for CPU bounded if the Python interpreter is used a lot !

    • No problem for IO bounded !

Different parallel strategiesΒΆ

One program, $n$ processes πŸ‘©πŸΌβ€πŸ”§ πŸ‘¨πŸΌβ€πŸ”§πŸ‘©πŸΌβ€πŸ”§ πŸ‘¨πŸΌβ€πŸ”§ΒΆ

Exchange data (for example with MPI):ΒΆ

Very efficient and no problem with Python!

  • mpi4py
  • h5py parallel

2 other packages for parallel computing with PythonΒΆ

  • dask
  • joblib

Python for HPC: first a glue languageΒΆ

Many tools to interact with static languages:

     ctypes, cffi, cython, cppyy, pybind11, f2py, pyo3, ...

Glue together pieces of native code (C, Fortran, C++, Rust, ...) with a nice syntax

     β‡’ Numpy, Scipy, ...

Remarks:

  • Numpy: great syntax for expressing algorithms, (nearly) as much information as in Fortran

  • Performance of a @ b (Numpy) versus a * b (Julia)?

         Same! The same library is called! (often OpenBlas or MKL)

General principle for perf with Python (not fully valid for PyPy):ΒΆ

Don't use too often the Python interpreter (and small Python objects) for computationally demanding tasks.

Pure Python

     β†’ Numpy

         β†’ Numpy without too many loops (vectorized)

            → C extensions

But ⚠️ ⚠️ ⚠️ writting a C extension by hand is not a good idea ! ⚠️ ⚠️ ⚠️

No need to quit the Python language to avoid using too much the Python interpreter !ΒΆ

Tools to compile Python / write C extensionsΒΆ

  • Langage: superset of Python

  • A great mix of Python / C / CPython C API!

    Very powerfull but a tool for experts!

  • Easy to study where the interpreter is used (cython --annotate).

  • Very mature

  • Now able to use Pythran internally...

My experience: large Cython extensions difficult to maintain

Numba: (per-method) JIT for Python-Numpy codeΒΆ

  • Very simple to use (just add few decorators) πŸ™‚
InΒ [6]:
from numba import jit

@jit
def myfunc(x):
    return x**2
  • "nopython" mode (fast and no GIL) πŸ™‚

  • Also a "python" mode πŸ™‚

  • GPU and Cupy πŸ˜€

  • Methods (of classes) πŸ™‚

Python decoratorsΒΆ

InΒ [7]:
def mydecorator(func):
    # do something with the function
    print(func)
    # return a(nother) function
    return func
InΒ [8]:
@mydecorator
def myfunc(x):
    return x**2
<function myfunc at 0x7fc5bd76f378>

This mysterious syntax with @ is just syntaxic sugar for:

InΒ [9]:
def myfunc(x):
    return x**2

myfunc = mydecorator(myfunc)
<function myfunc at 0x7fc5bd76f598>

Numba: (per-method) JIT for Python-Numpy codeΒΆ

  • Sometimes not as much efficient as it could be πŸ™

    (usually slower than Pythran / Julia / C++)


  • Only JIT πŸ™


  • Not good to optimize high-level NumPy code πŸ™

Pythran: AOT compiler for module using Python-NumpyΒΆ

Transpiles Python to efficient C++

  • Good to optimize high-level NumPy code 😎

  • Extensions never use the Python interpreter (pure C++ β‡’ no GIL) πŸ™‚

  • Can produce C++ that can be used without Python

  • Usually very efficient (sometimes faster than Julia)

    • High and low level optimizations

      (Python optimizations and C++ compilation)

    • SIMD 🀩 (with xsimd)

    • Understand OpenMP instructions πŸ€— !

  • Can use and make PyCapsules (functions operating in the native word) πŸ™‚

High level transformationsΒΆ

InΒ [11]:
# calcul of range
print_optimized("""
def f(x):
    y = 1 if x else 2
    return y == 3
""")
def f(x):
    return 0

InΒ [12]:
# inlining
print_optimized("""
def foo(a):
    return  a + 1
def bar(b, c):
    return foo(b), foo(2 * c)
""")
def foo(a):
    return a + 1


def bar(b, c):
    return ((b + 1), ((2 * c) + 1))

InΒ [13]:
# unroll loops
print_optimized("""
def foo():
    ret = 0
    for i in range(1, 3):
        for j in range(1, 4):
            ret += i * j
    return ret
""")
def foo():
    ret = 0
    ret += 1
    ret += 2
    ret += 3
    ret += 2
    ret += 4
    ret += 6
    return ret

InΒ [14]:
# constant propagation
print_optimized("""
def fib(n):
    return n if n< 2 else fib(n-1) + fib(n-2)
    
def bar(): 
    return [fib(i) for i in [1, 2, 8, 20]]
""")
import functools as __pythran_import_functools


def fib(n):
    return n if (n < 2) else (fib((n - 1)) + fib((n - 2)))


def bar():
    return [1, 1, 21, 6765]


def bar_lambda0(i):
    return fib(i)

InΒ [15]:
# advanced transformations
print_optimized("""
import numpy as np
def wsum(v, w, x, y, z):
    return sum(np.array([v, w, x, y, z]) * (.1, .2, .3, .2, .1))
""")
import numpy as __pythran_import_numpy


def wsum(v, w, x, y, z):
    return __builtin__.sum(
        ((v * 0.1), (w * 0.2), (x * 0.3), (y * 0.2), (z * 0.1))
    )

Pythran: AOT compiler for module using Python-NumpyΒΆ

  • Compile only full modules (β‡’ refactoring needed πŸ™)
  • Only "nopython" mode

    • limited to a subset of Python

      • only homogeneous list / dict πŸ€·β€β™€οΈ
      • no methods (of classes) 😒 and user-defined class
    • limited to few extension packages (Numpy + bits of Scipy)

    • pythranized functions can't call Python functions

  • No JIT: need types (written manually in comments)

  • Lengthy βŒ›οΈ and memory intensive compilations

  • Debugging 🐜 Pythran requires C++ skills!

  • No GPU (maybe with OpenMP 4?)

  • Intel compilers unable to compile Pythran C++11 πŸ‘Ž

First conclusionsΒΆ

  • Python great language & ecosystem for sciences & data
  • Performance issues, especially for crunching numbers πŸ”’

    β‡’ need to accelerate the "numerical kernels"

  • Many good accelerators and compilers for Python-Numpy code

    • All have pros and cons!

    β‡’ We shouldn't have to write specialized code for one accelerator!

  • Other languages don't replace Python for sciences

    • Modern C++ is great and very complementary πŸ’‘ with Python

    • Julia is interesting but not the heaven on earth

Make your numerical Python code fly at transonic speed πŸš€ !ΒΆ

Transonic is landing πŸ›¬ !ΒΆ

Pure Python package (>= 3.6) to easily accelerate modern Python-Numpy code with different accelerators

Work in progress! Current state: one backend based on Pythran!

  • Keep your Python-Numpy code clean and "natural" 🧘

  • Clean type annotations (🐍 3)

  • Easily mix Python code and compiled functions

  • JIT based on AOT compilers

  • Methods (of classes) and blocks of code

Transonic: examples from real-life packagesΒΆ

Works also well in simple scripts and IPython / Jupyter.

Transonic: how does it work?ΒΆ

  • AST analyses (using Beniget, no import at compilation time)
InΒ [24]:
# abstract syntax tree
import ast
tree = ast.parse("great_tool = 'Beniget'")
assign = tree.body[0]
print(f"{assign.value.s} is a {assign.targets[0].id}")
Beniget is a great_tool
  • Write the (Pythran) files when needed

  • Compile the (Pythran) files when needed

  • Use the fast solutions when available

Transonic: PerspectivesΒΆ

  • Alternative syntax for blocks of code (with block():)

  • PyCapsules

  • Backends using Cython, Numba, Cupy

Need funding πŸ’° !ΒΆ

Pythran and Transonic are cool projects. But no πŸ’°! A difference compared to Numba.

ConclusionsΒΆ

  • Very nice and efficient scientific software can be easily built with modern Python

My personal choice / hope for HPC for humans and 🐍¢

  • PyPy (Python abstraction for free) + Numpy accelerators used through Transonic

  • Modern C++ for more fundamental tools (with multi-language API)