31 December 2016

bloscpickle

Editted 27-01-2017 to add RapidJSON results.

Python has a number of libraries available to it for the purposes of serializing data and object, generally for the purpose of passing them around from one process or node to another, or for saving the program state to disk. Serialization for a weakly-typed language such as Python brings with it some challenges which typically result in modules having limitations in what they can serialize. My interest is mostly with regards to packaging meta-data with microscopy binary data, were one might have a few megabytes of metadata alongside gigabytes of image stacks.

The built-in modules are pickle, marshal, and json.  I will also look at two other third-party modules ujson and msgpack-python. All of them produce either text or binary representations, and all are uncompressed. I thought I would test an implementation wrapping them with the blosc meta-compressor library to compress their outputs before writing to disk, to see what sort of space-savings and potentially performance-enhancements could be wrote. The code presented herein is available at:

https://github.com/robbmcleod/bloscpickle

It's not intended as a production-ready tool.

Pickle: is Python's most robust serialization tool.  It can manage custom classes and objects, and circular references. It does not duplicate objects found to have multiple references. It outputs binary. It is not compatible with other languages. Pickle received a major speed upgrade with Python 3, which also came with a new file protocol. Pickle is used, for example, in the multiprocessing module to exchange data between processes.  Pickle is often said to be a potential security hazard as it can potentially carry malicious code, which is the disadvantage of its versatility in serializing objects.

Marshal: is Python's internal file I/O module. It can serialize only Python base types (essentially lists, tuples, and dicts), will crash if fed a circular reference. It is not even compatible across different version of Python. It is supposed to be the fastest of the tested standard Python modules.

JSON: otherwise known as JavaScript Object Notation is the most ubiquitous method to serialize objects. It essentially only deals with two constructs: list and dicts. As such, it requires helper functions to be implemented in order to serialize objects. It's not binary, so it can be human edited (with some difficulty, it's picky about commas and similar formatting errors). As with pickle, json received a major performance upgrade in Python 3, such that many external implementations of JSON were obsoleted, with a couple of exceptions such as...

UltraJSON: developed by an Electronic Arts studio (but released under a BSD license), UltraJSON is just like JSON, but faster. One drawback of ujson versus the default library is that it can fail silently.

RapidJSON: Another fast JSON parser built on top of a C-library.  Here I use Hideo Hattori's wrapper which is the more complete of the two Python wrappers.

Message Pack: is billed as a binary-equivalent of JSON. On first glance it was very intriguing, as it offers a significant encoding rate and encoded size advantage over JSON, and it has implementation for basically every programming language in-use today. However, I ran into problems in testing. By default, converts Unicode strings to bytes strings, which can cause a loss of information. When encoding as Unicode it loses its speed advantage over the faster JSONs. Furthermore, while msgpack will serialize objects, it doesn't serialize all their attributes, and so it fails silently.

Blosc: isn't a serializer, it's a compressor, or more properly a meta-compressor.  Blosc wraps a variety of compression codecs with a thread pool, to provide very high performance. The two best compressors in blosc in my experience are lz4, which is ultra-fast but middling compression (and the new standard codec for the ZFS file system), and Zstandard which achieves better compression ratios than zlib/gzip and is still very fast. Zstandard is new as of 2015 and essentially offers something for nothing compared to older compression algorithms. It's usually within spitting distance (a  few %) of BZIP2 for compression ratio and far faster, being heavily optimized for parallel computing. In testing on Xeon machines I've achieved compression rates of about 12 GB/s with lz4 and 5 GB/s with zstd.  That is GigaBytes per second. Blosc also has a filter stage, which at present is byte- or bit-shuffling. I've found bit-shuffling to be effective when compressing floating-point or otherwise dynamic-range limited data. It would probably be extremely helpful for DNA or protein sequences, for example. Here I did not find shuffling to be effective. Throughout this post I use compression level 1 for  zstd and compression level 9 for lz4.  Lz4 does not really slow at all with compression level, and zstd saturates much earlier than zlib (there's rarely an advantage to going past 4).

All tests were performed with Python 3.5 with an iCore5-3570K (3.4 GHz), running 4-threads for blosc, and a Western Digital 3 TB 'Red' drive formatted for NTFS. Ideally one would perform this test on a Linux system with disk cache flushing between each test. I would expect some additional performance from Python 3.6, in particular because we are using dicts here, but I use a lot of libraries so it will be some time before I move to it myself.

High-entropy Benchmark: 256k UUID4 keys

I tested the serialization tools on 256k UUID4 keys with singleton values. This is a fairly challenging data set for compression because there's quite a high degree of randomness inherent in what are supposed to be unique identifiers.
Figure 1: Write times for 256k UUID4 keys.
Figure 2: Read times for 256k UUID4 keys.
Figure 3: Size of the data on disk. The difficulty of this data set is evident in that lz4 compression achieved little. However, zstd shines here, cutting the data in half.

Overall for pickle using zstd compression yields about a 25 % write time penalty, but this is nearly negated by a corresponding reduction in read time. Since the data is small, I expect 'writing' is just depositing the file in the hard drive cache.

The increased performance of blosc-JSON compared to pure JSON is somewhat paradoxical, and not due to the JSON serialization, but the poor performance of Python in reading and writing Unicode streams to disk. If you encode the JSON output as UTF-8 and write it as bytes, it's much faster. I left  the benchmarks as is, because it's something to keep in mind. Similarly marshal seems to be faster at reading when it is passed a bytes object instead of a stream.

Message Pack looks on the surface to offer impressive performance, but as mentioned above the Python implementation often omits important information from objects. If I worked on an enterprise-level project, I might dig more deeply into why and when it fails, but I don't, so I won't. Also as

Low-entropy Benchmark: JSON Generator

Here I generated 10'000 entries of pseudo-phone book entries, with the handy JSON Generator, which corresponds to about 25 MB of JSON data. This data has a lot more repeated elements, in particular a lot of non-unique keys, that should improve relative performance of the compression tools.


Figure 4: Write times for ten-thousand procedurally generated phonebook-like JSON entries.

Figure 5: Read times for ten-thousand procedurally generated phonebook-like JSON entries.
Figure 6: Size of the data on disk. Here the data is significantly less random so the compression tools, and especially lz4 perform better than with the high entropy data set. The blocksize was 64kB.  

Overall lz4 reduces the disk usage by about 40-50 % and zStandard shaves another 10 % off of that.  If you are consistently dealing with larger data chunks, the blocksize could be increased.  Typically blosc is fastest when the block fits into L2 cache, but compression ratio usually increases up to about 1 MB blocks before saturated.

Here both UltraJSON and Message Pack silently failed the read/write assert test. The ujson error appears to be related to minimum precision in reading floats, and for Message Pack the problem was that it converting Unicode to bytes.

Conclusions

Overall, on fairly difficult data blosc reduces file size by about 50 % in return for a 25 % write speed penalty. However, the read time is accelerated, such that the net time spent on file I/O is more or less a push. On more highly compressible data (e.g. DNA base pairs, protein sequences) and in particular data large enough to swamp the hard disks' cache (typically 0.5-1.0 GB), one would expect to see blosc + serialization be faster than just pure serialization.

Only pickle offers out-of-the-box functionality for serializing objects. If you want to serialize with JSON, to maintain cross-language compatibility, then you'll need to implement helper methods yourself.  UltraJSON looks great on the surface but I did manage to break it, so I wouldn't consider it an out-of-the-box robust solution.  Still, it beats pickle in speed.  This could be as simple as Python's boolean True mapping to 'True' on disk and back to 'True' after read. Another potential JSON library that has a Python wrapper to examine is RapidJSON, which has two implementations python-rapidjson and pyrapidjson.

One aspect I wanted to look  at was trying to monkey-patch the multiprocessing module to use bloscpickle instead of pickle. However, pickle is not exposed so one would have to patch the reduction.py file in the module.

One disadvantage of blosc at present is that it does not have a streaming interface, i.e. it deals with bytes objects. This means it will store and extra (compressed) copy of the data in memory, relative to vanilla pickling. It also used to hold onto the GIL, although that has been patched out and should go live with the next release.

04 November 2016

Polygon Filling in Parallel Computing

I'm more or less turning into a full-time programmer as I age, so I thought it might be interesting to reactivate the blog and post some discussions on topics I find interesting from time to time. I have here this Voronoi tessellation and I want to extract polygons from it:



How can I do this quickly? Pre-written solutions exist in the Python universe. For example, scikit-image has the function draw.polygon which can be used to generate all points inside a polygon. The function itself is written in Cython and is single-threaded. 

Here's an example of a solution that uses the package numexpr instead of skimage's Cython code to calculate the points inside a polygon  The main advantage of numexpr is that it's multi-threaded and uses blocked calculations. You could try ThreadPool from multiprocessing, but skimage.draw.polygon only releases the GIL intermittently (in the Cython sub-function point_in_polygon), so I'm not sure how well that would work. The per-pixel check in scikit-image can be a little sub-optimal for a dense grid (e.g. filling), here's an example of a row-wise fill algorithm:  http://alienryderflex.com/polygon_fill/ 

Here I'm filling a polygon on a naive pixel-by-pixel basis, but because I we use meshgrid it could be performed on any gridded basis that you define:

# Generate our mesh.  Here I assume we might re-use the mesh several times for tiles 
# of a different shape (or slicing in 3D), so the scope is global relative to the 
# polygonInterior() function below
import numpy as np
import numexpr as ne
import skimage.draw
from time import time

polyCorners = 5
boxLen = 2048
ne.set_num_threads(1) # Generally NumExpr is fastest when threads = # of physical cores

vertices =np.array( [[ 0,   375.56], [ 578.70,  0], [ 2048, 1345.36 ],
       [ 1318.43,  2048], [ 0, 1712.97] ], dtype='float32' )

tileShape = np.array( [boxLen, boxLen], dtype='int' )
tileExtent = np.array( [0, boxLen, 0, boxLen], dtype='int' )

tileXmesh, tileYmesh = np.meshgrid( np.arange(tileShape[1]), np.arange(tileShape[0]) )
tileXmesh = tileXmesh.astype('float32'); tileYmesh = tileYmesh.astype('float32')

# Numexpr approach
# shape should crop the polygon to its extent=(xmin,xmax,ymin,ymax)
def polygonInterior( vertices, extent ):
    # Slice the pre-generated meshes
    xsub = tileXmesh[ extent[0]:extent[1], extent[2]:extent[3] ]
    ysub = tileYmesh[ extent[0]:extent[1], extent[2]:extent[3] ]
    polyMask = np.zeros( [extent[3]-extent[2], extent[1]-extent[0]], dtype='bool' )
    
    J = vertices.shape[0] - 1
    ypJ = vertices[J,0]

    for I in np.arange( vertices.shape[0] ):
        xpI = vertices[I,1]; ypI = vertices[I,0]
        xpJ = vertices[J,1]; ypJ = vertices[J,0]
        # Could re-use from I: ysub < ypJ, ypJ <= ysub but in testing this led to no speed-up
        polyMask = np.logical_xor( polyMask, ne.evaluate( "( (((ypI  <= ysub) & (ysub < ypJ)) | \
((ypJ <= ysub) & (ysub < ypI)  )) & (xsub < (xpJ - xpI) * (ysub - ypI) / (ypJ - ypI) + xpI) )" ) )

        J = I
    return polyMask
    
t0  = time()
ne_mask = polygonInterior( vertices, tileShape )
t1 = time()
print( "Numexpr calculated polygon mask over %d points in %f s" %( np.prod(tileShape), t1-t0 ) )

#### skimage approach ####
t2 = time()
xsub = tileXmesh[:tileShape[0],:tileShape[1]]
ysub = tileYmesh[:tileShape[0],:tileShape[1]]
geoMask = np.empty( tileShape, dtype='bool' )
si_indices = skimage.draw.polygon( vertices[:,1], vertices[:,0], tileShape )
t3 = time()
print( "skimage calculated polygon mask over %d points in %f s" %( np.prod(tileShape), t3-t2 )  )
print( "numexpr speed-up: %f %%" % ((t3-t2)/(t1-t0)*100) )


For 1 thread, numexpr is about 240 % faster than skimage.draw (probably because NE is using floats and skimage is double, but also due to the blocked code execution), for 4 threads numexpr is 640 % faster, for 8 threads it's 840 % faster, for 12 cores it's 1130 % faster (Intel Xeon E5-2680 v3 @ ~ 2.9 GHz).  We're hurt a bit by the fact that numexpr doesn't have an xor operator so each loop does have a slow numpy calculation (although maybe I'll fix that in the future). If I include the mesh generation in the numexpr time it's 185 % faster on 1 thread, but the mesh can be reused. One could use decorators to save the meshes on the function handle if you expect to call it repeatedly and don't want to have the variables from scope.

The huge advantage of using a virtual machine like numexpr here is the implementation time, and ease of redistribution. The above code functions just fine in a Python interpreter and doesn't need to be compiled, so the implementation time was very short and because numexpr has an efficient virtual machine, we get fast performance without hand-tuning. Numexpr is available in essentially all scientific python distributions, as is Cython but the numexpr code doesn't need to be compiled by the end-user. This can save a lot of pain on Windows and OSX.  

The result is the texture for each polygon can be extracted in a hurry.  This one took 5 ms: