17 February 2017

Introduction to NumExpr-3 (Alpha)

I've been working on a new branch of the venerable NumExpr module for Python.  It's a fairly major extension, as I discuss below.  The alpha release can be cloned from GitHub as follows:

cd numexpr
git checkout numexpr-3.0
python setup.py install  
(or pip install . but that suppresses error messages)

What's new?


Faster

The operations were re-written in such a way that gcc can auto-vectorize the loops to use SIMD instructions. Each operation now has a strided and aligned branch, which improves performance on aligned arrays by ~ 40 %. The setup time for threads has been reduced, by removing an unnecessary abstraction layer, and various other minor re-factorizations, resulting in improved thread scaling.

The combination of speed-ups means that NumExpr3 often runs 200-500 % faster than NumExpr2.6 on a machine with AVX2 support. The break-even point with NumPy is now roughly arrays with 64k-elements, compared to 256-512k-elements for NE2.

Example 1: Multiply-Add


Here 8 threads are used for both NE2 and NE3 (NumPy is single-threaded), and the cache dictionaries for the NumExprs have been disabled.  
The gap with complex number mathematics is larger because the entirety of the complex operations within NumExpr3 was implemented with vectorized functions. Vectorization hasn't been done yet with cmath functions for real numbers (such as sqrt()exp(), etc.), only for complex functions, but arithmetic operations such as +-*/  have been auto-vectorized by gcc. Here we are cheating, because NumExpr2 doesn't support complex64 as a data type, so it has to up-cast the data to complex-128. Similar advantages are had in image processing for NE3 on uint8 or uint16, whereas NE2 would up-cast to int32 and promptly return wrong values in the case of under- or overflow.


Example 2: Mandelbrot set

Jean-Francois Puget wrote a blog post last year for IBM, "How To Quickly Compute The Mandelbrot Set In Python." NumExpr2 didn't perform especially well compared to Numba.  There are some reasons for this. In Puget's code, the calculation would have been performed with complex128 instead of complex64 which was intended because NE2 upcasts consts to float64 and didn't have a complex64 datatype. Also it's not clear how many threads were used (he did the calculations on this laptop). 

Here's an example of the algorithm in a NumExpr3 format:


def mandelbrot_ne3(c, maxiter):
    output = np.zeros(c.shape, dtype='float32')
    notdone = np.zeros(c.shape, dtype='bool')
    z = np.zeros(c.shape, dtype='complex64' )

    # Almost 30 % of the time in a comparison appears to be in the 
    # cast to npy_bool
    neObj1 = ne3.NumExpr( 'notdone = abs2(z) < 4.0' )
    neObj2 = ne3.NumExpr( 'z = where(notdone,z*z+c,z)' )
    for it in np.arange(maxiter, dtype='float32'):
        # Here 'it' changes, but the AST parser doesn't know that and treats it
        # as a const if we use 'where(notdone, it, output)'
        # What we really need is an iter( ops, range ) function inside 
        # ne3.  This is an interesting case, since really here we see a 
        # major limitation in NumExpr working inside a loop.
        neObj1.run( check_arrays=False )
        output[notdone] = it
        neObj2.run( check_arrays=False )
    
    output[output == maxiter-1] = 0
    return output

def mandelbrot_set_ne3(xmin,xmax,ymin,ymax,width,height,maxiter):
    r1 = np.linspace(xmin, xmax, width, dtype='float32')
    r2 = np.linspace(ymin, ymax, height, dtype='float32')
    c = r1 + r2[:,None]*1j
    n3 = mandelbrot_ne3(c,maxiter)
    return (r1,r2,n3.T) 

Here I benchmarked with 4 threads on a Xeon CPU with a 2.5 GHz clock:

NE2.6:   365 ms for set #1
NE2.6:   9.73 s for set #2
NE3:     138 ms for set #1
NE3:     4.26 s for set #2

So again we see a nice 250 % speedup for NE3 compared to NE2.  There are some limitations in this algorithm for NumExpr in that the variable `it` changes with each iteration, but would be treated as a const by NumExpr.  This forces us in and out of the interpreter more than we would prefer.  There's a lot more that I believe can be done to improve NE3 performance, especially with regards to thread scaling.


More NumPy Datatypes

The program was re-factorized from a ascii-encoded byte code to a struct array, so that the operation space is now 65535 instead of 128.  As such, support for uint8, int8, uint16, int16, uint32, uint64, and complex64 data types was added.

NumExpr3 now uses NumPy 'safe' casting rules. If an operation doesn't return the same result as NumPy, it's a bug.  In the future other casting styles will be added if there is a demand for them.

More complete function set

With the enhanced operation space, almost the entire C++11 cmath function set is supported (if the compiler library has them; only C99 is expected). Also bitwise operations were added for all integer datatypes.  There are now 436 operations/functions in NE3, with more to come, compared to 190 in NE2.

Also a library-enum has been added to the op keys which allows multiple backend libraries to be linked to the interpreter, and changed on a per-expression basis, rather than picking between GNU std and Intel VML at compile time, for example.

More complete Python language support

The Python compiler was re-written from scratch to use the CPython `ast` module and a functional programming approach. As such, NE3 now compiles a wider subset of the Python language. It supports multi-line evaluation, and assignment with named temporaries.  The new compiler spends considerably less time in Python to compile expressions, about 200 us for 'a*b' compared to 550 us for NE2.

Compare for example:

    out_ne2 = ne2.evaluate( 'exp( -sin(2*a**2) - cos(2*b**2) - 2*a**2*b**2' )

to:

    neObj = NumExp( '''a2 = a*a; b2 = b*b
out_magic = exp( -sin(2*a2) - cos(2*b2) - 2*a2*b2''' ) 

This is a contrived example but the multi-line approach will allow for cleaner code and more sophisticated algorithms to be encapsulated in a single NumExpr call. The convention is that intermediate assignment targets are named temporaries if they do not exist in the calling frame, and full assignment targets if they do, which provides a method for multiple returns. Single-level de-referencing (e.g. `self.data`) is also supported for increased convenience and cleaner code. Slicing still needs to be performed above the ne3.evaluate() or ne3.NumExpr() call. 

More maintainable

The code base was generally refactored to increase the prevalence of single-point declarations, such that modifications don't require extensive knowledge of the code. In NE2 a lot of code was generated by the pre-processor using nested #defines.  That has been replaced by a object-oriented Python code generator called by setup.py, which generates about 15k lines of C code with 1k lines of Python. The use of generated code with defined line numbers makes debugging threaded code simpler. 

The generator also builds the autotest portion of the test submodule, for checking equivalence between NumPy and NumExpr3 operations and functions. 

What's TODO compared to NE2?

  • strided complex functions
  • Intel VML support (less necessary now with gcc auto-vectorization)
  • bytes and unicode support
  • eductions (mean, sum, prod, std)

What I'm looking for feedback on

  • String arrays: How do you use them?  How would unicode differ from bytes strings?
  • Interface: We now have a more object-oriented interface underneath the familiar evaluate() interface. How would you like to use this interface?  Francesc suggested generator support, as currently it's more difficult to use NumExpr within a loop than it should be.

Ideas for the future

  • vectorize real functions (such as exp, sqrt, log) similar to the complex_functions.hpp vectorization.
  • Add a keyword (likely 'yield') to indicate that a token is intended to be changed by a generator inside a loop with each call to NumExpr.run()

If you have any thoughts or find any issues please don't hesitate to open an issue at the Github repo. Although unit tests have been run over the operation space there are undoubtedly a number of bugs to squash.