31 December 2016

bloscpickle

Editted 27-01-2017 to add RapidJSON results.

Python has a number of libraries available to it for the purposes of serializing data and object, generally for the purpose of passing them around from one process or node to another, or for saving the program state to disk. Serialization for a weakly-typed language such as Python brings with it some challenges which typically result in modules having limitations in what they can serialize. My interest is mostly with regards to packaging meta-data with microscopy binary data, were one might have a few megabytes of metadata alongside gigabytes of image stacks.

The built-in modules are pickle, marshal, and json.  I will also look at two other third-party modules ujson and msgpack-python. All of them produce either text or binary representations, and all are uncompressed. I thought I would test an implementation wrapping them with the blosc meta-compressor library to compress their outputs before writing to disk, to see what sort of space-savings and potentially performance-enhancements could be wrote. The code presented herein is available at:

https://github.com/robbmcleod/bloscpickle

It's not intended as a production-ready tool.

Pickle: is Python's most robust serialization tool.  It can manage custom classes and objects, and circular references. It does not duplicate objects found to have multiple references. It outputs binary. It is not compatible with other languages. Pickle received a major speed upgrade with Python 3, which also came with a new file protocol. Pickle is used, for example, in the multiprocessing module to exchange data between processes.  Pickle is often said to be a potential security hazard as it can potentially carry malicious code, which is the disadvantage of its versatility in serializing objects.

Marshal: is Python's internal file I/O module. It can serialize only Python base types (essentially lists, tuples, and dicts), will crash if fed a circular reference. It is not even compatible across different version of Python. It is supposed to be the fastest of the tested standard Python modules.

JSON: otherwise known as JavaScript Object Notation is the most ubiquitous method to serialize objects. It essentially only deals with two constructs: list and dicts. As such, it requires helper functions to be implemented in order to serialize objects. It's not binary, so it can be human edited (with some difficulty, it's picky about commas and similar formatting errors). As with pickle, json received a major performance upgrade in Python 3, such that many external implementations of JSON were obsoleted, with a couple of exceptions such as...

UltraJSON: developed by an Electronic Arts studio (but released under a BSD license), UltraJSON is just like JSON, but faster. One drawback of ujson versus the default library is that it can fail silently.

RapidJSON: Another fast JSON parser built on top of a C-library.  Here I use Hideo Hattori's wrapper which is the more complete of the two Python wrappers.

Message Pack: is billed as a binary-equivalent of JSON. On first glance it was very intriguing, as it offers a significant encoding rate and encoded size advantage over JSON, and it has implementation for basically every programming language in-use today. However, I ran into problems in testing. By default, converts Unicode strings to bytes strings, which can cause a loss of information. When encoding as Unicode it loses its speed advantage over the faster JSONs. Furthermore, while msgpack will serialize objects, it doesn't serialize all their attributes, and so it fails silently.

Blosc: isn't a serializer, it's a compressor, or more properly a meta-compressor.  Blosc wraps a variety of compression codecs with a thread pool, to provide very high performance. The two best compressors in blosc in my experience are lz4, which is ultra-fast but middling compression (and the new standard codec for the ZFS file system), and Zstandard which achieves better compression ratios than zlib/gzip and is still very fast. Zstandard is new as of 2015 and essentially offers something for nothing compared to older compression algorithms. It's usually within spitting distance (a  few %) of BZIP2 for compression ratio and far faster, being heavily optimized for parallel computing. In testing on Xeon machines I've achieved compression rates of about 12 GB/s with lz4 and 5 GB/s with zstd.  That is GigaBytes per second. Blosc also has a filter stage, which at present is byte- or bit-shuffling. I've found bit-shuffling to be effective when compressing floating-point or otherwise dynamic-range limited data. It would probably be extremely helpful for DNA or protein sequences, for example. Here I did not find shuffling to be effective. Throughout this post I use compression level 1 for  zstd and compression level 9 for lz4.  Lz4 does not really slow at all with compression level, and zstd saturates much earlier than zlib (there's rarely an advantage to going past 4).

All tests were performed with Python 3.5 with an iCore5-3570K (3.4 GHz), running 4-threads for blosc, and a Western Digital 3 TB 'Red' drive formatted for NTFS. Ideally one would perform this test on a Linux system with disk cache flushing between each test. I would expect some additional performance from Python 3.6, in particular because we are using dicts here, but I use a lot of libraries so it will be some time before I move to it myself.

High-entropy Benchmark: 256k UUID4 keys

I tested the serialization tools on 256k UUID4 keys with singleton values. This is a fairly challenging data set for compression because there's quite a high degree of randomness inherent in what are supposed to be unique identifiers.
Figure 1: Write times for 256k UUID4 keys.
Figure 2: Read times for 256k UUID4 keys.
Figure 3: Size of the data on disk. The difficulty of this data set is evident in that lz4 compression achieved little. However, zstd shines here, cutting the data in half.

Overall for pickle using zstd compression yields about a 25 % write time penalty, but this is nearly negated by a corresponding reduction in read time. Since the data is small, I expect 'writing' is just depositing the file in the hard drive cache.

The increased performance of blosc-JSON compared to pure JSON is somewhat paradoxical, and not due to the JSON serialization, but the poor performance of Python in reading and writing Unicode streams to disk. If you encode the JSON output as UTF-8 and write it as bytes, it's much faster. I left  the benchmarks as is, because it's something to keep in mind. Similarly marshal seems to be faster at reading when it is passed a bytes object instead of a stream.

Message Pack looks on the surface to offer impressive performance, but as mentioned above the Python implementation often omits important information from objects. If I worked on an enterprise-level project, I might dig more deeply into why and when it fails, but I don't, so I won't. Also as

Low-entropy Benchmark: JSON Generator

Here I generated 10'000 entries of pseudo-phone book entries, with the handy JSON Generator, which corresponds to about 25 MB of JSON data. This data has a lot more repeated elements, in particular a lot of non-unique keys, that should improve relative performance of the compression tools.


Figure 4: Write times for ten-thousand procedurally generated phonebook-like JSON entries.

Figure 5: Read times for ten-thousand procedurally generated phonebook-like JSON entries.
Figure 6: Size of the data on disk. Here the data is significantly less random so the compression tools, and especially lz4 perform better than with the high entropy data set. The blocksize was 64kB.  

Overall lz4 reduces the disk usage by about 40-50 % and zStandard shaves another 10 % off of that.  If you are consistently dealing with larger data chunks, the blocksize could be increased.  Typically blosc is fastest when the block fits into L2 cache, but compression ratio usually increases up to about 1 MB blocks before saturated.

Here both UltraJSON and Message Pack silently failed the read/write assert test. The ujson error appears to be related to minimum precision in reading floats, and for Message Pack the problem was that it converting Unicode to bytes.

Conclusions

Overall, on fairly difficult data blosc reduces file size by about 50 % in return for a 25 % write speed penalty. However, the read time is accelerated, such that the net time spent on file I/O is more or less a push. On more highly compressible data (e.g. DNA base pairs, protein sequences) and in particular data large enough to swamp the hard disks' cache (typically 0.5-1.0 GB), one would expect to see blosc + serialization be faster than just pure serialization.

Only pickle offers out-of-the-box functionality for serializing objects. If you want to serialize with JSON, to maintain cross-language compatibility, then you'll need to implement helper methods yourself.  UltraJSON looks great on the surface but I did manage to break it, so I wouldn't consider it an out-of-the-box robust solution.  Still, it beats pickle in speed.  This could be as simple as Python's boolean True mapping to 'True' on disk and back to 'True' after read. Another potential JSON library that has a Python wrapper to examine is RapidJSON, which has two implementations python-rapidjson and pyrapidjson.

One aspect I wanted to look  at was trying to monkey-patch the multiprocessing module to use bloscpickle instead of pickle. However, pickle is not exposed so one would have to patch the reduction.py file in the module.

One disadvantage of blosc at present is that it does not have a streaming interface, i.e. it deals with bytes objects. This means it will store and extra (compressed) copy of the data in memory, relative to vanilla pickling. It also used to hold onto the GIL, although that has been patched out and should go live with the next release.

3 comments:

Francesc said...

Very interesting set of benchmarks. I'd just let you know that Blosc do offer a streaming interface in Python via bloscpack (https://pypi.python.org/pypi/bloscpack). Also, C-Blosc2 (https://github.com/Blosc/c-blosc2) is meant to implement a streaming interface too (the development of Blosc2 has admittedly been slow for a while, but that should change soon).

Robert McLeod said...

Hey Francesc, for me this case was a two-horse race between JSON and Message Pack, as they have C-implementations. I've used bloscpack in the past, but I didn't want to be restricted to Python. It's nice from the point of view that it provides chunking support, and it can serialize with JSON, but it would have complicated things by wrapping numpy arrays in a different style than I want for my C-implementation. I added pickle and marshal into the mix because it was trivially easy to do so.

I looked briefly into the CPython streams implementation, that looks to be a tricky piece of code to wrap in C99 for blosc2. It's definitely not written in the blosc style of a C-library with a thin Pyton wrapper, so good luck.

Francesc said...

Well, bloscpack is just implemented in Python, but provided that its format is documented, nothing prevents you to add a tool to create/read bloscpack files from other languages. It is more work to do, yes.

OTOH, I don't completely get your comment on CPython streams for wrapping Blosc2. Blosc2 has its own 'stream' interface via the `blosc2_append_buffer()` and `blosc2_packed_append_buffer()` which in principle are not designed to play with the Python stream interface.