04 November 2016

Polygon Filling in Parallel Computing

I'm more or less turning into a full-time programmer as I age, so I thought it might be interesting to reactivate the blog and post some discussions on topics I find interesting from time to time. I have here this Voronoi tessellation and I want to extract polygons from it:

How can I do this quickly? Pre-written solutions exist in the Python universe. For example, scikit-image has the function draw.polygon which can be used to generate all points inside a polygon. The function itself is written in Cython and is single-threaded. 

Here's an example of a solution that uses the package numexpr instead of skimage's Cython code to calculate the points inside a polygon  The main advantage of numexpr is that it's multi-threaded and uses blocked calculations. You could try ThreadPool from multiprocessing, but skimage.draw.polygon only releases the GIL intermittently (in the Cython sub-function point_in_polygon), so I'm not sure how well that would work. The per-pixel check in scikit-image can be a little sub-optimal for a dense grid (e.g. filling), here's an example of a row-wise fill algorithm:  http://alienryderflex.com/polygon_fill/ 

Here I'm filling a polygon on a naive pixel-by-pixel basis, but because I we use meshgrid it could be performed on any gridded basis that you define:

# Generate our mesh.  Here I assume we might re-use the mesh several times for tiles 
# of a different shape (or slicing in 3D), so the scope is global relative to the 
# polygonInterior() function below
import numpy as np
import numexpr as ne
import skimage.draw
from time import time

polyCorners = 5
boxLen = 2048
ne.set_num_threads(1) # Generally NumExpr is fastest when threads = # of physical cores

vertices =np.array( [[ 0,   375.56], [ 578.70,  0], [ 2048, 1345.36 ],
       [ 1318.43,  2048], [ 0, 1712.97] ], dtype='float32' )

tileShape = np.array( [boxLen, boxLen], dtype='int' )
tileExtent = np.array( [0, boxLen, 0, boxLen], dtype='int' )

tileXmesh, tileYmesh = np.meshgrid( np.arange(tileShape[1]), np.arange(tileShape[0]) )
tileXmesh = tileXmesh.astype('float32'); tileYmesh = tileYmesh.astype('float32')

# Numexpr approach
# shape should crop the polygon to its extent=(xmin,xmax,ymin,ymax)
def polygonInterior( vertices, extent ):
    # Slice the pre-generated meshes
    xsub = tileXmesh[ extent[0]:extent[1], extent[2]:extent[3] ]
    ysub = tileYmesh[ extent[0]:extent[1], extent[2]:extent[3] ]
    polyMask = np.zeros( [extent[3]-extent[2], extent[1]-extent[0]], dtype='bool' )
    J = vertices.shape[0] - 1
    ypJ = vertices[J,0]

    for I in np.arange( vertices.shape[0] ):
        xpI = vertices[I,1]; ypI = vertices[I,0]
        xpJ = vertices[J,1]; ypJ = vertices[J,0]
        # Could re-use from I: ysub < ypJ, ypJ <= ysub but in testing this led to no speed-up
        polyMask = np.logical_xor( polyMask, ne.evaluate( "( (((ypI  <= ysub) & (ysub < ypJ)) | \
((ypJ <= ysub) & (ysub < ypI)  )) & (xsub < (xpJ - xpI) * (ysub - ypI) / (ypJ - ypI) + xpI) )" ) )

        J = I
    return polyMask
t0  = time()
ne_mask = polygonInterior( vertices, tileShape )
t1 = time()
print( "Numexpr calculated polygon mask over %d points in %f s" %( np.prod(tileShape), t1-t0 ) )

#### skimage approach ####
t2 = time()
xsub = tileXmesh[:tileShape[0],:tileShape[1]]
ysub = tileYmesh[:tileShape[0],:tileShape[1]]
geoMask = np.empty( tileShape, dtype='bool' )
si_indices = skimage.draw.polygon( vertices[:,1], vertices[:,0], tileShape )
t3 = time()
print( "skimage calculated polygon mask over %d points in %f s" %( np.prod(tileShape), t3-t2 )  )
print( "numexpr speed-up: %f %%" % ((t3-t2)/(t1-t0)*100) )

For 1 thread, numexpr is about 240 % faster than skimage.draw (probably because NE is using floats and skimage is double, but also due to the blocked code execution), for 4 threads numexpr is 640 % faster, for 8 threads it's 840 % faster, for 12 cores it's 1130 % faster (Intel Xeon E5-2680 v3 @ ~ 2.9 GHz).  We're hurt a bit by the fact that numexpr doesn't have an xor operator so each loop does have a slow numpy calculation (although maybe I'll fix that in the future). If I include the mesh generation in the numexpr time it's 185 % faster on 1 thread, but the mesh can be reused. One could use decorators to save the meshes on the function handle if you expect to call it repeatedly and don't want to have the variables from scope.

The huge advantage of using a virtual machine like numexpr here is the implementation time, and ease of redistribution. The above code functions just fine in a Python interpreter and doesn't need to be compiled, so the implementation time was very short and because numexpr has an efficient virtual machine, we get fast performance without hand-tuning. Numexpr is available in essentially all scientific python distributions, as is Cython but the numexpr code doesn't need to be compiled by the end-user. This can save a lot of pain on Windows and OSX.  

The result is the texture for each polygon can be extracted in a hurry.  This one took 5 ms: