Wrapping C-Blosc2 in Python (a beginner's view)

An initial release of the Python wrapper for C-Blosc2 is now available in: https://github.com/Blosc/python-blosc2. In this blog I will try to explain some of the most difficult aspects that I had to learn in doing this and how I solved them.

This work is being made thanks to a grant from the Python Software Foundation.

Python views

At university, the first programming language that I learned was Python. But because programming was new for the majority of the class the subject only covered the basics: basic statements and classes. And although these were easy to understand, the views were unknown to me (until now).

To explain what the views are, let’s suppose we have the following code in Python:

>>> import sys
>>> a = []
>>> b = a
>>> sys.getrefcount(a)
3

The reference count for the object is 3: a, b and the argument passed to sys.getrefcount().

Basically, to avoid making copies of a same variable, Python uses views. Every variable has its counter and until the counter is 0, the variable is not deleted. But that means that two threads cannot access the counter at the same time. Because having a lock for every variable would be inefficient and could produce deadlocks (which means that several threads are waiting for each other), the GIL was created. So GIL was my next thing to learn.

GIL and Cython

GIL stands for Global Interpreter Lock. With a single lock on the interpreter there are no deadlocks. But the execution of any Python program must acquire the interpreter lock, which prevents some programs to take advantage of the multi-threading execution.

When writing C extensions, this lock is very useful because it can be released. Thus, the program can be more efficient (i.e. threads can actually run in parallel). To write a function with the GIL I spent many time reading about it. Unfortunately, nothing seemed to expain what I wanted to do until I found this nice blog from Nicolas Hug in which he explains the 3 rules you have to follow to make Cython release the GIL.

First of all, Cython needs to know which C functions that were imported are thread-safe. This is done by using the nogil statement in the function declaration. Then, inside the function the with nogil statement lets Cython know that this block is going to be executed with the GIL released. But to make that code block safe, there cannot be any Python interaction inside that block.

To understand it better, an example is shown below:

cdef extern from "math_operation.h":
    int add(int a, int b)nogil

cpdef sum(src, dest):
    cdef int len_src = len(src)
    cdef int len_dest = len(dest)
    cdef int result
    with nogil:
        # Code with the GIL released
        result = add(len_src, len_dest)
    # Code with the GIL, any Python interaction can be done here

The function sum returns the result of adding the length of src and dest. As you can see, the function has been defined with the cpdef statement instead of the def. The c lets Cython know that this function can be called with C. So this is necessary when writing a function with the GIL released, otherwise you will be trying to execute a Python program without the GIL (which, as explained previously cannot be done). Notice that len_src and len_dest have also been defined as C integers with the cdef int statement. If not, it would not be possible to work with them with the GIL released (the with nogil block).

On the other hand, the p lets Cython know that this function can be called through Python. This does not have to be done always, only when you want to call that function from Python.

Cython typed memoryviews

One of the main differences between the python-blosc and python-blosc2 API, is that the functions compress_ptr and decompress_ptr are no longer supported. We decided to do so, because the Pickle protocol 5 already makes an optimization of the copies. That way, we could have a similar performance for compress_ptr and decompress_ptr but with the functions pack and unpack.

However, when timing the functions I realised that in the majority of the cases, although the compress function from python-blosc2 was faster than the compress_ptr, the decompress function was slower than the decompress_ptr. Thus I checked the code to see if the speed could somehow be increased.

Originally, the code used the Python Buffer Protocol. which is part of the Python/C API. The Python Buffer Protocol lets you (among other things) obtain a pointer to the raw data of an object. But because it wasn't clear for me wether it needed to do a copy or not we decided to work with Cython typed memoryviews.

Cython typed memoryviews are very similar to Python memory views, but with the main difference that the first ones are a C-level type and therefore they do not have much Python overhead. Because it is a C-level type you have to know the dimension of the buffer from which you want to obtain the typed memoryview as well as its data type.

The shape dimension of the buffer is expressed writing as many : between brackets as dimensions it has. If the memory is allocated contiguously, you can write ::1 instead in the corresponding dimension. On the other hand, the type is expressed as you would do it in Cython. In the following code, you can see an example for a one-dimensional numpy array:

import numpy as np
arr = np.ones((10**6,), dtype=np.double)
cdef double [:] typed_view = arr

However, if you want to define a function that receives an object whose type may be unknown, you will have to create a Python memoryview and then cast it into the type you wish as in the next example:

# Get a Python memoryview from an object
mem_view = memoryview(object)
# Cast that memory view into an unsigned char memoryview
cdef unsigned char[:]typed_view = mem_view.cast('B')

The 'B' indicates to cast the memoryview type into an unsigned char.

But if I run the latter code for a binary Python string, it produces a runtime error. It took me 10 minutes to fix the error adding the const statement to the definition of the Cython typed memoryview (as shown below), but I spent two days trying to understand the error and its solution.

# Get a Python memoryview from an object
mem_view = memoryview(object)
# Cast that memory view into an unsigned char memoryview
cdef const unsigned char[:]typed_view = mem_view.cast('B')

The reason why the const statement fixed it, is that a binary Python string is a read-only buffer. By declaring the typed memoryview to const, Cython is being told that the object from the memory view is a read-only buffer so that it cannot change it.

Conclusions

So far, my experience wrapping C-Blosc2 has had some ups and downs.

One method that I use whenever I learn something new is to write down a summary of what I read. Sometimes is almost a copy (therefore some people may find it useless), but it always works really well for me. It helps me connect the ideas better or to build a global idea of what I have or want to do.

Another aspect I realized when doing this wrapper is that because I am a stubborn person, I usually tend to force myself to try to understand something and get frustrated if I do not. However, I have to recognize that sometimes it is better to forget about it until the next day. Your brain will organize your ideas at night so that you can invest better your time the next morning.

But maybe the most difficult part for me was the beginning, and therefore I have to thank Francesc Alted and Aleix Alcacer for giving me a push into the not always easy world of Python extensions.

C-Blosc2 Ready for General Review

On behalf of the Blosc team, we are happy to announce the first C-Blosc2 release (Release Candidate 1) that is meant to be reviewed by users. As of now we are declaring both the API and the format frozen, and we are seeking for feedback from the community so as to better check the library and declare it apt for its use in production.

Some history

The next generation Blosc (aka Blosc2) started back in 2015 as a way to overcome some limitations of the Blosc compressor, mainly the limitation of 2 GB for the size of data to be compressed. But it turned out that I wanted to make thinks a bit more complete, and provide a native serialization too. During that process Google awarded my contributions to Blosc with the Open Source Peer Bonus Program in 2017. This award represented a big emotional push for me in persisting in the efforts towards producing a stable release.

Back in 2018, Zeeman Wang from Huawei invited me to go to their central headquarters in Shenzhen to meet a series of developers that were trying to use compression in a series of scenarios. During two weeks we had a series of productive meetings, and I got aware of the many possibilities that compression is opening in industry: since making phones with limited hardware to work faster to accelerate computations on high-end computers. That was also a great opportunity for me to better know a millennial culture; I was genuinely interested to see how people live, eat and socialize in China.

In 2020, Huawei graciously offered a grant to the Blosc project to complete the project. Since then, we have got donations from several other sources (like NumFOCUS, Python Software Foundation, ESRF among them). Lately ironArray is sponsoring two of us (Aleix Alcacer and myself) to work partial time on Blosc related projects.

Thanks to all this support, the Blosc development team has been able to grow quite a lot (we are currently 5 people in the core team) and we have been able to work hard at producing a series of improvements in different projects under the Blosc umbrella, in particular C-Blosc2, Python-Blosc2, Caterva and cat4py.

As you see, there is a lot of development going on around C-Blosc2 other than C-Blosc2 itself. In this installment I am going to focus just on the main features that C-Blosc2 is bringing, but hopefully all the other projects in the ecosystem will also complement its existing functionality. When all these projects would be ready, we hope that users will be able to use them to store big amounts of data in a way that is both efficient, easy-to-use and most importantly, adapted to their needs.

New features of C-Blosc2

Here it is the list of the main features that we are releasing today:

  • 64-bit containers: the first-class container in C-Blosc2 is the super-chunk or, for brevity, schunk, that is made by smaller chunks which are essentially C-Blosc1 32-bit containers. The super-chunk can be backed or not by another container which is called a frame (see later).

  • More filters: besides shuffle and bitshuffle already present in C-Blosc1, C-Blosc2 already implements:

    • delta: the stored blocks inside a chunk are diff'ed with respect to first block in the chunk. The idea is that, in some situations, the diff will have more zeros than the original data, leading to better compression.

    • trunc_prec: it zeroes the least significant bits of the mantissa of float32 and float64 types. When combined with the shuffle or bitshuffle filter, this leads to more contiguous zeros, which are compressed better.

  • A filter pipeline: the different filters can be pipelined so that the output of one can the input for the other. A possible example is a delta followed by shuffle, or as described above, trunc_prec followed by bitshuffle.

  • Prefilters: allows to apply user-defined C callbacks prior the filter pipeline during compression. See test_prefilter.c for an example of use.

  • Postfilters: allows to apply user-defined C callbacks after the filter pipeline during decompression. The combination of prefilters and postfilters could be interesting for supporting e.g. encryption (via prefilters) and decryption (via postfilters). Also, a postfilter alone can used to produce on-the-flight computation based on existing data (or other metadata, like e.g. coordinates). See test_postfilter.c for an example of use.

  • SIMD support for ARM (NEON): this allows for faster operation on ARM architectures. Only shuffle is supported right now, but the idea is to implement bitshuffle for NEON too. Thanks to Lucian Marc.

  • SIMD support for PowerPC (ALTIVEC): this allows for faster operation on PowerPC architectures. Both shuffle and bitshuffle are supported; however, this has been done via a transparent mapping from SSE2 into ALTIVEC emulation in GCC 8, so performance could be better (but still, it is already a nice improvement over native C code; see PR https://github.com/Blosc/c-blosc2/pull/59 for details). Thanks to Jerome Kieffer and ESRF for sponsoring the Blosc team in helping him in this task.

  • Dictionaries: when a block is going to be compressed, C-Blosc2 can use a previously made dictionary (stored in the header of the super-chunk) for compressing all the blocks that are part of the chunks. This usually improves the compression ratio, as well as the decompression speed, at the expense of a (small) overhead in compression speed. Currently, it is only supported in the zstd codec, but would be nice to extend it to lz4 and blosclz at least.

  • Contiguous frames: allow to store super-chunks contiguously, either on-disk or in-memory. When a super-chunk is backed by a frame, instead of storing all the chunks sparsely in-memory, they are serialized inside the frame container. The frame can be stored on-disk too, meaning that persistence of super-chunks is supported.

  • Sparse frames (on-disk): each chunk in a super-chunk is stored in a separate file, as well as the metadata. This is the counterpart of in-memory super-chunk, and allows for more efficient updates than in frames (i.e. avoiding 'holes' in monolithic files).

  • Partial chunk reads: there is support for reading just part of chunks, so avoiding to read the whole thing and then discard the unnecessary data.

  • Parallel chunk reads: when several blocks of a chunk are to be read, this is done in parallel by the decompressing machinery. That means that every thread is responsible to read, post-filter and decompress a block by itself, leading to an efficient overlap of I/O and CPU usage that optimizes reads to a maximum.

  • Meta-layers: optionally, the user can add meta-data for different uses and in different layers. For example, one may think on providing a meta-layer for NumPy so that most of the meta-data for it is stored in a meta-layer; then, one can place another meta-layer on top of the latter for adding more high-level info if desired (e.g. geo-spatial, meteorological...).

  • Variable length meta-layers: the user may want to add variable-length meta information that can be potentially very large (up to 2 GB). The regular meta-layer described above is very quick to read, but meant to store fixed-length and relatively small meta information. Variable length metalayers are stored in the trailer of a frame, whereas regular meta-layers are in the header.

  • Efficient support for special values: large sequences of repeated values can be represented with an efficient, simple and fast run-length representation, without the need to use regular codecs. With that, chunks or super-chunks with values that are the same (zeros, NaNs or any value in general) can be built in constant time, regardless of the size. This can be useful in situations where a lot of zeros (or NaNs) need to be stored (e.g. sparse matrices).

  • Nice markup for documentation: we are currently using a combination of Sphinx + Doxygen + Breathe for documenting the C-API. See https://c-blosc2.readthedocs.io. Thanks to Alberto Sabater and Aleix Alcacer for contributing the support for this.

  • Plugin capabilities for filters and codecs: we have a plugin register capability inplace so that the info about the new filters and codecs can be persisted and transmitted to different machines. Thanks to the NumFOCUS foundation for providing a grant for doing this.

  • Pluggable tuning capabilities: this will allow users with different needs to define an interface so as to better tune different parameters like the codec, the compression level, the filters to use, the blocksize or the shuffle size. Thanks to ironArray for sponsoring us in doing this.

  • Support for I/O plugins: so that users can extend the I/O capabilities beyond the current filesystem support. Things like use databases or S3 interfaces should be possible by implementing these interfaces. Thanks to ironArray for sponsoring us in doing this.

  • Python wrapper: we have a preliminary wrapper in the works. You can have a look at our ongoing efforts in the python-blosc2 repo. Thanks to the Python Software Foundation for providing a grant for doing this.

  • Security: we are actively using using the OSS-Fuzz and ClusterFuzz for uncovering programming errors in C-Blosc2. Thanks to Google for sponsoring us in doing this.

As you see, the list is long and hopefully you will find compelling enough features for your own needs. Blosc2 is not only about speed, but also about providing

Tasks to be done

Even if the list of features above is long, we still have things to do in Blosc2; and the plan is to continue the development, although always respecting the existing API and format. Here are some of the things in our TODO list:

  • Centralized plugin repository: we have got a grant from NumFOCUS for implementing a centralized repository so that people can send their plugins (using the existing machinery) to the Blosc2 team. If the plugins fulfill a series of requirements, they will be officially accepted, and distributed withing the library.

  • Improve the safety of the library: although this is always a work in progress, we did a long way in improving our safety, mainly thanks to the efforts of Nathan Moinvaziri.

  • Support for lossy compression codecs: although we already support the trunc_prec filter, this is only valid for floating point data; we should come with lossy codecs that are meant for any data type.

  • Checksums: the frame can benefit from having a checksum per every chunk/index/metalayer. This will provide more safety towards frames that are damaged for whatever reason. Also, this would provide better feedback when trying to determine the parts of the frame that are corrupted. Candidates for checksums can be the xxhash32 or xxhash64, depending on the goals (to be decided).

  • Documentation: utterly important for attracting new users and making the life easier for existing ones. Important points to have in mind here:

    • Quality of API docstrings: is the mission of the functions or data structures clearly and succinctly explained? Are all the parameters explained? Is the return value explained? What are the possible errors that can be returned?.

    • Tutorials/book: besides the API docstrings, more documentation materials should be provided, like tutorials or a book about Blosc (or at least, the beginnings of it). Due to its adoption in GitHub and Jupyter notebooks, one of the most extended and useful markup systems is Markdown, so this should also be the first candidate to use here.

  • Lock support for super-chunks: when different processes are accessing concurrently to super-chunks, make them to sync properly by using locks, either on-disk (frame-backed super-chunks), or in-memory. Such a lock support would be configured in build time, so it could be disabled with a cmake flag.

It would be nice that, in case some of this feature (or a new one) sounds useful for you, you can help us in providing either code or sponsorship.

Summary

Since 2015, it has been a long time to get C-Blosc2 so much featured and tested. But hopefully the journey will continue because as Kavafis said:

As you set out for Ithaka
hope your road is a long one,
full of adventure, full of discovery.

Let me thank again all the people and sponsors that we have had during the life of the Blosc project; without them we would not be where we are now. We do hope that C-Blosc2 will have a long life and we as a team will put our soul in making that trip to last as long as possible.

Now is your turn. We expect you to start testing the library as much as possible and report back. With your help we can get C-Blosc2 in production stage hopefully very soon. Thanks in advance!

Blosc metalayers, where the user metainformation is stored

The C-Blosc2 library has two different spaces to store user-defined information. In this post, we are going to describe what these spaces are and where they are stored inside a Blosc2 frame (a persistent super-chunk).

As its name suggests, a metalayer is a space that allows users to store custom information. For example, Caterva, a project based on C-Blosc2 that handles compressed and chunked arrays, uses these metalayers to store the dimensions and the shape, chunkshape and blockshape of the arrays.

Fixed-length metalayers

The first kind of metalayers in Blosc2 are the fixed-length metalayers. These metalayers are stored in the header of the frame. This decision allows adding chunks to the frame without the need to rewrite the whole meta information and data coming after it.

But this implementation has some drawbacks. The most important one is that fixed-length metalayers cannot be resized. Furthermore, once the first chunk of data is added to the super-chunk, no more fixed-length metalayers can be added either.

Let's see with an example the reason for these restrictions. Supose that we have a frame that stores 10 GB of data with a metalayer containing a "cat". If we update the meta information with a "dog" we can do that because they have exactly the same size.

However, if we were to update the meta information with a "giraffe", the metalayer would need to be resized and therefore we would have to rewrite the 10GB of data plus the trailer. This would obviously be very inefficient and hence, not allowed:

/images/metalayers/metalayers.png

Data that would need to be rewritten are ploted in red.

Variable-length metalayers

To fix the above issue, we have introduced variable-length metalayers. Unlike fixed-length metalayers, these are stored in the trailer section of the frame.

As their name suggests, these metalayers can be resized. Blosc can do that because, whenever the metalayers content are modified, Blosc rewrites the trailer completely, using more space if necessary. Furthermore, and since these metalayers are stored in the trailer, they will also be rewritten each time a chunk is added.

Another feature of variable-length metalayers is that their content is compressed by default (in contrast to fixed-length metalayers). This will minimize the size of the trailer, a very important feature because since the trailer is rewritten every time new data is added, we want to keep it as small as possible so as to optimize data written.

Let's continue with the previous example, but storing the meta information in a variable-length metalayer now:

/images/metalayers/metalayers-vl.png

In this case the trailer is rewritten each time that we update the metalayer, but it is a much more efficient operation than rewriting all the data (as a fixed-length metalayer would require). So the variable-length metalayers complement the fixed-length metalayers by bringing different capabilities on the table. Depending on her needs, it is up to the user to choose one or another metalayer storage.

Fixed-length vs variable-length metalayers comparsion

To summarize, and to better see what kind of metalayer is more suitable for each situation, the following table contains a comparison between fixed-length metalayers and variable-length metalayers:

Fixed-length metalayers

Variable-length metalayers

Where are stored?

Header

Trailer

Can be resized?

No

Yes

Can be added after adding chunks?

No

Yes

Are they rewritten when adding chunks?

No

Yes

Metalayers API

Currently, C-Blosc2 has the following functions implemented:

  • blosc2_meta_add() / blosc2_vlmeta_add(): Add a new metalayer.

  • blosc2_meta_get() / blosc2_vlmeta_get(): Get the metalayer content.

  • blosc2_meta_exists() / blosc2_vlmeta_exists(): Check if a metalayer exists or not.

  • blosc2_meta_update() / blosc2_vlmeta_update(): Update the metalayer content.

Conclusions

As we have seen, Blosc2 supports two different spaces where users can store their meta information. The user can choose one or another depending on her needs.

On the one hand, the fixed-length metalayers are meant to store user meta information that does not change size over time. They are stored in the header and can be updated without having to rewrite any other part of the frame, but they can no longer be added once the first chunk of data is added.

On the other hand, for users storing meta information that is going to change in size over time, they can store their meta information into variable-length metalayers. These are stored in the trailer section of a frame and are more flexible than its fixed-length counterparts. However, each time that a metalayer content is updated, the whole trailer has to be rewritten.

Introducing Sparse Frames

Overview

The new sparse frame implementation allows the storage of Blosc2 super-chunk data chunks sparsely on-disk, using the filesystem as a key/value storage. This mimics existing formats like bcolz or Zarr.

For the sparse implementation we are making use of the existing contiguous frame, in order to store the metadata and the index for accessing the different chunks. Here you can see the new sparse format compared with the existing contiguous frame:

/images/sparse-frames/cframe-vs-sframe.png

As can be seen in the image above, the contiguous frame file is made of a header, a chunks section and a trailer. The header contains information needed to decompress the chunks and the trailer contains a user meta data chunk. The chunks section for a contiguous frame is made of all the data chunks plus the index chunk. The latter contains the offset where each chunk begins inside the contiguous frame. All these pieces are stored sequentially, without any empty spaces between them.

However, in a sparse frame the chunks are stored somewhere as independent binary files. But there is still the need to store the information to decompress the chunks as well as a place to store the user meta data. All this goes to the chunks.b2frame, which is actually a contiguous frame file with the difference that its chunks section contains only the index chunk. This index chunk stores the ID of each chunk (an integer from 0 to 2^32-1). The name of the chunk file is built by expressing the chunk ID in hexadecimal, padded with zeros (until 8 characters) and adding the .chunk extension. For example, if the index chunk is 46 (2E in hexadecimal) the chunk file name would be 0000002E.chunk.

Advantages

The big advantage of the sparse frame compared with the contiguous one is avoiding empty spaces resulting when updating a chunk.

To better illustrate this, let's imagine that the set of the data chunks in a contiguous frame is stored like in the Jenga board game tower, a tower built with wood blocks. But in constrast to the genuine Jenga board game, not all the blocks have the same size (the uncompressed size of a the chunks is the same, but not the compressed one):

/images/sparse-frames/jenga3.png

Above it is shown the initial structure of such a tower. If the yellow piece is updated (changed by another piece) there are two possibilities. The first one is that the new piece fits into the empty space left where the old piece was. In that case, the new piece is put in the previous space without any problem and we have no empty spaces left. However, if the new piece does not fit into the empty space, the new piece has to be placed at the top of the tower (like in the game), leaving an empty space where the old piece was.

On the other hand, the chunks of an sparse frame can be seen as books on a shelf, where each book is a different chunk:

/images/sparse-frames/bookshelf.png

If one needs to update one book with the new, taller edition, one only has to grab the old edition and replace it by the new one. As there is no limit in the height of the books, the yellow book can be replaced with a larger book without creating empty spaces, and making a better use of space.

Example of use

Creating a sparse frame in C-Blosc2 is easy; just specifify the name of the directory where you want to store your chunks and you are done:

blosc2_storage storage = {.urlpath="dir1.b2frame"};
schunk = blosc2_schunk_new(storage);
for (nchunk = 0; nchunk < NCHUNKS; nchunk++) {
    blosc2_schunk_append_buffer(schunk, data, isize);
}

The above will create NCHUNKS of chunks in the "dir.b2frame". After that, you can open and read the frame with:

schunk = blosc2_schunk_open("dir1.b2frame");
for (nchunk = 0; nchunk < NCHUNKS; nchunk++) {
    blosc2_schunk_decompress_chunk(schunk, nchunk, data_dest, isize);
}

Simple and effective.

You can have a look at a more complete example here.

Future work

We think that this implementation opens the door to several interesting possibilities.

For example, by introducing networking code in Blosc2, the chunks could be stored in another machine and accessed remotely. That way, with just the metadata (the contiguous frame) we could access all the data chunks in the sparse frame.

For example, let's suppose that we have a sparse frame with 1 million chunks. The total size of the data chunks from this sparse frame is 10 TB, but the contiguous frame size can be as small as 10 KB. So, with just sending an small object of 10 KB, any worker could access the whole 10 TB of data.

The remote stores could be typical networked key/value databases. The key is the identifier for each element of the database, whereas the value is the information that is associated to each key (similar to a set of unique keys and a set of doors). In this case, the key would be built from the metadata (e.g. a URL) plus the index of the chunk, and the value would be the data chunk itself.

This can lead to a whole new range of applications, where data can be spread in the cloud and workers can access to it by receiving small amounts of serialized buffers (the contiguous frame). This way, arbitrarily large data silos could be created and accessed via the C-Blosc2 library (plus a key/value network store).

Note by Francesc: The implementation of sparse frames has been done by Marta Iborra, who is the main author of this blog too. Marta joined the Blosc team a few months ago as a student, and the whole team is very pleased with the quality of her contribution; we would be thrilled to continue having her among us for the next months (but this requires some budget indeed). If you like where we are headed, please consider making a donation to the Blosc project via the NumFOCUS Foundation: https://blosc.org/pages/donate. Thank you!

Announcing Blosc Wheels

We are happy to announce that wheels for Intel (32 and 64 bits) and all major OS (Win, Linux, Mac) are being produced on regular basis for python-blosc. Such wheels also contain development files for the C-Blosc library. If you are interested in knowing more how to use them, keep reading.

A Python wheel (.whl file) is a ZIP archive used to make easier the installation process of packages. The new wheels make Blosc library installation faster by avoiding compiling, and they are now available at PyPI. See: https://pypi.org/project/blosc/.

Moreover, wheels for Blosc have support for AVX2 runtime detection, so it will be automatically leveraged in case the local host has AVX2. On the other hand, if the host does not have AVX2, SSE2 is used instead, which, even if it is slower than AVX2, it is still faster than regular x86 instructions.

Small intro to wheels

Wheels are an advantageous alternative to distribute Python (but also pure C) packages which contain C (or Cython) source code, and hence, need a compiler. For those that are not familiar to wheels, here it comes a small tutorial on how to create and use wheels.

First, let's recall the traditional way to build a source distribution:

$ python setup.py sdist

To build a wheel, the process is quite similar:

$ python setup.py bdist_wheel

To install a package via pip (pip decides whether install a from wheel or compile from the source package; wheels have obviously more priority):

$ python -m pip install {package}

To install a package forcing to use source distribution:

$ python -m pip install --no-binary {package}

To install a package forcing to use wheels:

$ python -m pip install --only-binary {package}

Different types of wheels

There are different kind of wheels, depending on the goals and the build process:

  • Universal Wheels are wheels that are pure Python (i.e. contain no compiled extensions) and support Python 2 and 3.

  • Pure Python Wheels that are not “universal” are wheels that are pure Python (i.e. contain no compiled extensions), but don’t natively support both Python 2 and 3.

  • Platform Wheels are wheels that are specific to a certain platform like Linux, macOS, or Windows, usually due to containing compiled extensions.

Platform wheels are built in one Linux variant and have no guarantee of working on another Linux variant. However, the manylinux wheels are accepted by most Linux variants:

  • manylinux1: based on Centos5.

  • manylinux2010: based on Centos6.

  • manylinux2014: based on Centos7.

Specifically, Blosc wheels are platform wheels that support Python3 (3.7 and up) on Windows, Linux and Mac, for both 32 and 64 bits systems.

Binaries for C-Blosc libraries are included

Although wheels were meant for Python packages, nothing prevents adding more stuff to them. In particular, we are not only distributing python-blosc binary extensions in our wheels, but also binaries for the C-Blosc library. This way, people willing to use the C-Blosc library can make use of these wheels to install the necessary development files.

First, install the binary wheel via PyPI without the need to manually compile the thing:

$ pip install --only-binary blosc

Now, let's suppose that we want to compile the c-blosc/examples/many_compressors.c on Linux:

First, you have to look where the wheels directory is located. In our case:

$ WHEEL_DIR=/home/soscar/miniconda3
$ export LD_LIBRARY_PATH=$WHEEL_DIR/lib   # note that you need the LD_LIBRARY_PATH env variable

For the actual compilation, you need to pass the directory for the include and lib directories:

$ gcc many_compressors.c -I$WHEEL_DIR/include -o many_compressors -L$WHEEL_DIR/lib -lblosc

Finally, run the resulting binary and hopefully you will see something like:

$ ./many_compressors
Blosc version info: 1.20.1 ($Date:: 2020-09-08 #$)
Using 4 threads (previously using 1)
Using blosclz compressor
Compression: 4000000 -> 37816 (105.8x)
Succesful roundtrip!
Using lz4 compressor
Compression: 4000000 -> 37938 (105.4x)
Succesful roundtrip!
Using lz4hc compressor
Compression: 4000000 -> 27165 (147.2x)
Succesful roundtrip!

For more details, including compiling with binary wheels on other platforms than Linux, see: https://github.com/Blosc/c-blosc/blob/master/COMPILING_WITH_WHEELS.rst.

Final remarks

Producing Python wheels for a project can be somewhat involved for regular users. However, the advantages of binary wheels really make them worth the effort, since they make the installation process easier and faster for users. This is why we are so happy to finally provide wheels that can benefit, not only python-blosc users, but users of the C-Blosc library as well.

Last but not least, a big thank you to the Zarr team, specially to Jeff Hammerbacher, who provided a grant to the Blosc team for making the wheels support official. Hopefully this new development will make life easier for Zarr developers and users (by the way, we are really glad to see Zarr quickly spreading as a data container for big multidimensional data, and Blosc helping on the compression part).

Mid 2020 Progress Report

2020 has been a year where the Blosc projects have received important donations, totalling an amount of $55,000 USD so far. In the present report we list the most important tasks that have been carried out during the period that goes from January 2020 to August 2020. Most of these tasks are related to the most fast-paced projects under development: C-Blosc2 and Caterva (including its cat4py wrapper). Having said that, the Blosc development team has been active in other projects too (C-Blosc, python-blosc), although mainly for maintenance purposes.

Besides, we also list the roadmap for the C-Blosc2, Caterva and cat4py projects that we plan to tackle during the next few months.

C-Blosc2

C-Blosc2 adds new data containers, called superchunks, that are essentially a set of compressed chunks in memory that can be accessed randomly and enlarged during its lifetime. Also, a new frame serialization layer has been added, so that superchunks can be persisted on disk, while keeping the same properties of superchunks in memory. Finally, a metalayer capability allow for higher level containers to be created on top of superchunks/frames.

Highligths

  • Maskout functionality. This allows for selectively choose the blocks of a chunk that are going to be decompressed. This paves the road for faster multidimensional slicing in Caterva (see below in the Caterva section).

  • Prefilters introduced and declared stable. Prefilters allow for the user to pass C functions for performing arbitrary computations on a chunk prior to the filter/codec pipeline. In addition, the C function can even have access to more chunks than just the one that is being compressed. This opens the door to a way to operate with different super-chunks and produce a new one very efficiently. See https://github.com/Blosc/c-blosc2/blob/master/tests/test_prefilter.c for some examples of use.

  • Support for PowerPC/Altivec. We added support for PowerPC SIMD (Altivec/VSX) instructions for faster operation of shuffle and bitshuffle filters. For details, see https://github.com/Blosc/c-blosc2/pull/98.

  • Improvements in compression ratio for LZ4/BloscLZ. New processors are continually increasing the amount of memory in their caches. In recent C-Blosc and C-Blosc2 releases we increased the size of the internal blocks so that LZ4/BloscLZ codecs have better opportunities for finding duplicates and hence, increasing their compression ratios. But due to the increased cache sizes, performance has kept close to the original, fast speeds. For some benchmarks, see https://blosc.org/posts/beast-release/.

  • New entropy probing method for BloscLZ. BloscLZ is a native codec for Blosc whose mission is to be able to compress synthetic data efficiently. Synthetic data can appear in multiple situations and having a codec that is meant to compress/decompress that with high compression ratios in a fast manner is important. The new entropy probing method included in recent BloscLZ 2.3 (introduced in both C-Blosc and C-Blosc2) allows for even better compression ratios for highly compressible data, while giving up early when blocks are going to be difficult to compress at all. For details see: https://blosc.org/posts/beast-release/ too.

Roadmap for C-Blosc2

During the next few months, we plan to tackle the next tasks:

  • Postfilters. The same way that prefilters allows to do user-defined computations prior to the compression pipeline, the postfilter would allow to do the same after the decompression pipeline. This could be useful in e.g. creating superchunks out of functions taking simple data as input (for example, a [min, max] range of values).

  • Finalize the frame implementation. Although the frame specification is almost complete (bar small modifications/additions), we still miss some features that are included in the specification, but not implemented yet. An example of this is the fingerprint support at the end of the frames.

  • Chunk insertion. Right now only chunk appends are supported. It should be possible to support chunk insertion in any position, and not only at the end of a superchunk.

  • Security. Although we already started actions to improve the safety of the package using tools like OSS-Fuzz, this is an always work in progress task, and we plan indeed continuing improving it in the future.

  • Wheels. We would like to deliver wheels on every release soon.

Caterva/cat4py

Caterva is a multidimensional container on top of C-Blosc2 containers. It uses the metalayer capabilities present in superchunks/frames in order to store the multidimensionality information necessary to define arrays up to 8 dimensions and up to 2^63 elements. Besides being able to create such arrays, Caterva provides functionality to get (multidimensional) slices of the arrays easyly and efficiently. cat4py is the Python wrapper for Caterva.

Highligths

  • Multidimensional blocks. Chunks inside superchunk containers are endowed with a multidimensional structure so as to enable efficient slicing. However, in many cases there is a tension between defining large chunks so as to reduce the amount of indexing to find chunks or smaller ones in order to avoid reading data that falls outside of a slice. In order to reduce such a tension, we endowed the blocks inside chunks with a multidimensional structure too, so that the user has two parameters (chunkshape and blockshape) to play with in order to optimize I/O for their use case. For an example of the kind of performance enhancements you can expect, see https://htmlpreview.github.io/?https://github.com/Blosc/cat4py/blob/269270695d7f6e27e6796541709e98e2f67434fd/notebooks/slicing-performance.html.

  • API refactoring. Caterva is a relatively young project, and its API grew up organically and hence, in a quite disorganized manner. We recognized that and proceeded with a big API refactoring, trying to put more sense in the naming schema of the functions, as well as in providing a minimal set of C structs that allows for a simpler and better API.

  • Improved documentation. A nice API is useless if it is not well documented, so we decided to put a significant amount of effort in creating high-quality documentation and examples so that the user can quickly figure out how to create and access Caterva containers with their own data. Although this is still a work in progress, we are pretty happy with how docs are shaping up. See https://caterva.readthedocs.io/ and https://cat4py.readthedocs.io/.

  • Better Python integration (cat4py). Python, specially thanks to the NumPy project, is a major player in handling multidimensional datasets, so have greatly bettered the integration of cat4py, our Python wrapper for Caterva, with NumPy. In particular, we implemented support for the NumPy array protocol in cat4py containers, as well as an improved NumPy-esque API in cat4py package.

Roadmap for Caterva / cat4py

During the next months, we plan to tackle the next tasks:

  • Append chunks in any order. This will make it easier for the user to create arrays, since they will not be forced to use a row-wise order.

  • Update array elements. With this, users will be able to update their arrays without having to make a copy.

  • Resize array dimensions. This feature will allow Caterva to increase or decrease in size any dimension of the arrays.

  • Wheels. Once Caterva/cat4py would be in beta stage, we plan to deliver wheels on every release.

Final thoughts

We are very grateful to our sponsors in 2020; they allowed us to implement what we think would be nice features for the whole Blosc ecosystem. However, and although we did a lot of progress towards making C-Blosc2 and Caterva as featured and stable as possible, we still need to finalize our efforts so as to see both projects stable enough to allow them to be used in production. Our expectation is to release a 2.0.0 (final) release for C-Blosc2 by the end of the year, whereas Caterva (and cat4py) should be declared stable during 2021.

Also, we are happy to have enrolled new members on Blosc crew: Óscar Griñón, who proved to be instrumental in implementing the multidimensional blocks in Caterva and Nathan Moinvaziri, who is making great strides in making C-Blosc and C-Blosc2 more secure. Thanks guys!

Hopefully 2021 will also be a good year for seeing the Blosc ecosystem to evolve. If you are interested on what we are building and want to help, we are open to any kind of contribution, including donations. Thank you for your interest!

C-Blosc Beast Release

TL;DR; The improvements in new CPUs allow for more cores and (much) larger caches. Latest C-Blosc release leverages these facts so as to allow better compression ratios, while keeping the speed on par with previous releases.

During the past two months we have been working hard at increasing the efficiency of Blosc for the new processors that are coming with more cores than ever before (8 can be considered quite normal, even for laptops, and 16 is not that unusual for rigs). Furthermore, their caches are increasing beyond limits that we thought unthinkable just a few years ago (for example, AMD is putting 64 MB in L3 for their mid-range Ryzen2 39x0 processors). This is mainly a consequence of the recent introduction of the 7nm process for both ARM and AMD64 architectures. It turns out that compression ratios are quite dependent on the sizes of the streams to compress, so having access to more cores and significantly larger caches, it was clear that Blosc was in a pressing need to catch-up and fine-tune its performance for such a new 'beasts'.

So, the version released today (C-Blosc 1.20.0) has been carefully fine-tuned to take the most of recent CPUs, specially for fast codecs, where even if speed is more important than compression ratio, the latter is still a very important parameter. With that, we decided to increase the amount of every compressed stream in a block from 64 KB to 256 KB (most of CPUs nowadays have this amount of private L2 cache or even larger). Also, it is important to allow a minimum of shared L3 cache to every thread so that they do not have to compete for resources, so a new restriction has been added so that no thread has to deal with streams larger than 1 MB (both old and modern CPUs seem to guarantee that they provide at least this amount of L3 per thread).

Below you will find the net effects of this new fine-tuning of fast codecs like LZ4 and BloscLZ on our AMD 3900X box (12 physical cores, 64 MB L3). Here we will be comparing results from C-Blosc 1.18.1 and C-Blosc 1.20.0 (we will skip the comparison against 1.19.x because this can be considered an intermediate release in our pursuit). Spoiler: you will be seeing an important boost of compression ratios, while the high speed of LZ4 and BloscLZ codecs is largely kept.

On the plots below, on the left is the performance of 1.18.1 release, whereas on the right is the performance of the new 1.20.0 release.

Effects in LZ4

Let's start by looking at how the new fine tuning affected compression performance:

lz4-c-before

lz4-c-after

Look at how much compression ratio has improved. This is mainly a consequence of using compression streams of up to 256 KB, instead of the previous 64 KB --incidentally, this is just for this synthetic data, but it is clear that real data is going to be benefited as well; besides, synthetic data is something that frequently appears in data science (e.g. a uniformly spaced array of values). One can also see that compression speed has not dropped in general which is great considering that we allow for much better compression ratios now.

Regarding decompression we can see a similar pattern:

lz4-d-before

lz4-d-after

So the decompression speed is generally the same, even for data that can be compressed with high compression ratios.

Effects in BloscLZ

Now it is the turn for BloscLZ. Similarly to LZ4, this codec is also meant for speed, but another reason for its existence is that it usually provides better compression ratios than LZ4 when using synthetic data. In that sense, BloscLZ complements well LZ4 because the latter can be used for real data, whereas BloscLZ is usually a better bet for highly repetitive synthetic data. In new C-Blosc we have introduced BloscLZ 2.3.0 which brings a brand new entropy detector which will disable compression early when entropy is high, allowing to selectively put CPU cycles where there are more low-hanging data compression opportunities.

Here it is how performance changes for compression:

blosclz-c-before

blosclz-c-after

In this case, the compression ratio has improved a lot too, and even if compression speed suffers a bit for small compression levels, it is still on par to the original speed for higher compression levels (compressing at more than 30 GB/s while reaching large compression ratios is a big achievement indeed).

Regarding decompression we have this:

blosclz-d-before

blosclz-d-after

As usual for the new release, the decompression speed is generally the same, and performance can still exceed 80 GB/s for the whole range of compression levels. Also noticeable is that fact that single-thread speed is pretty competitive with a regular memcpy(). Again, Ryzen2 architecture is showing its muscle here.

Final Thoughts

Due to technological reasons, CPUs are evolving towards having more cores and larger caches. Hence, compressors and specially Blosc, has to adapt to the new status quo. With the new parametrization and new algorithms (early entropy detector) introduced today, we can achieve much better results. In new Blosc you can expect a good bump in compression ratios with fast codecs (LZ4, BloscLZ) while keeping speed as good as always.

Appendix: Hardware and Software Used

For reference, here it is the software that has been used for this blog entry:

  • Hardware: AMD Ryzen2 3900X, 12 physical cores, 64 MB L3, 32 GB RAM.

  • OS: Ubuntu 20.04

  • Compiler: Clang 10.0.0

  • C-Blosc: 1.18.1 (2020-03-29) and 1.20.0 (2020-07-25)

    ** Enjoy Data!**

Blosc Received a $50,000 USD donation

I am happy to announce that the Blosc project recently received a donation of $50,000 USD from Huawei via NumFOCUS. Now that we have such an important amount available, our plan is to use it in order to continue making Blosc and its ecosystem more useful for the community. In order to do so, it is important to stress out that our priorities are going to be on the fundamentals of the stack: getting C-Blosc2 out of beta and pushing for making Caterva (the multi-dimensional container on top of C-Blosc2) actually usable.

Critical Tasks: Pushing C-Blosc2 and Caterva

C-Blosc2 has been kind of a laboratory that we used for testing new ideas, like new 64-bit containers, new filters, a new serialization system, the concept of pre-filters and others, for the past 5 years. Although the fork from C-Blosc happened such a long time ago, we tried hard to keep the API backwards compatible so that C-Blosc2 can be used as a drop-in replacement of C-Blosc1 --but beware, the C-Blosc2 format will not be forward-compatible with C-Blosc1, but will be backward-compatible, that is, it will be able to read C-Blosc1 compressed chunks.

On its hand, Caterva is our attempt to build a multidimensional container that is tightly built on top of C-Blosc2, so leveraging its unique features. Caterva is a C99 library (the same than C-Blosc2) that will allow an easy adoption by many different libraries that are about matrix manipulation. The fact that it supports on-the-flight compression and persistency will open new possibilities in that the size of matrices will not be limited to the available memory anymore: data may span through available memory or disk in compressed state.

Provided how fundamental C-Blosc2 and Caterva packages are meant to be, we think that the usefulness of the Blosc project as a whole will be largely benefited from putting most of our efforts here for the next months/years. For this, we already established a series of priorities for working in these projects, as specified in the roadmaps below

Roadmap for C-Blosc2

C-Blosc2 is already in beta stage, and in the next few months we should see it in production stage. Here are some of the more important the things that we want to tackle in order to make this happen:

  • Plugin capabilities for allowing users to add more filters and codecs. There should also be a plugin register capability so that the info about the new filters and codecs can be persistent and propagated to different machines.

  • Checksums: the frame can benefit from having a checksum per every chunk/index/metalayer. This will provide more safety towards frames that are damaged for whatever reason. Also, this would provide better feedback when trying to determine the parts of the frame that are corrupted.

  • Documentation: utterly important for attracting new users and making the life easier for existing ones. Important points to have in mind here:

    • Quality of API docstrings: is the mission of the functions or data structures clearly and succinctly explained? Are all the parameters explained? Is the return value explained? What are the possible errors that can be returned?

    • Tutorials/book: besides the API docstrings, more documentation materials should be provided, like tutorials or a book about Blosc (or at least, the beginnings of it). Due to its adoption in GitHub and Jupyter notebooks, one of the most extended and useful markup systems is MarkDown, so this should also be the first candidate to use here.

  • Wrappers for other languages: Python and Java are the most obvious candidates, but others like R or Julia would be nice to have. Still not sure if these should be produced and maintained by the Blosc development team, or leave them for third-party players that would be interested.

For a more detailed discussion see: https://github.com/Blosc/c-blosc2/blob/master/ROADMAP.md

Roadmap for Caterva

Caterva is a much more young project and as such, one may say that it is still in alpha stage, although the basic functionality like creating multidimensional containers, getting items or multidimensional slices or accessing persistent data without a previous load is already there. However, we still miss important things like:

  • A complete refactorization of the Caterva C code to facilitate its usability.

  • Adapt the Python interface to the refactorization done in C code.

  • Add examples into the Python wrapper documentation and create some jupyter notebooks.

  • Build wheels to make the Python wrapper easier for the user.

  • Implements a new level of multidimensionality in Caterva. After that, we will support three layers of multidimensionality in a Caterva container: the shape, the chunk shape and the block shape.

For a more detailed discussion see: https://github.com/Blosc/Caterva/blob/master/ROADMAP.md

How we are spending resources

Money is important, but not everything: you need people to work on a project. We are slowly starting to put consistent human resources in the Blosc project. To start with, I (Francesc Alted) and Aleix Alcacer will be putting 25% of our time in the project for the next months, and hopefully others will join too. We will also be using funds to invest in our main tool, that is laptops and desktop computers, but also some furniture like proper seats and tables; the office space is important for creating a happy team. Finally, our plan is to use a part of the donation in facilitating meeting among the Blosc development team.

Your input is important for us

Although during the next year or so, we plan to organize some meetings of the board of directors and the Blosc development team, we think that our ideas cannot grow isolated from the community of users. So in case you want to convey ideas or better, contribute with implementation of ideas, we will be happy to hear and discuss. You can get in touch with us via the Blosc mailing list (https://groups.google.com/forum/#!forum/blosc), and the @Blosc2 twitter account. We are thinking that having other tools like Discourse may help in driving discussions more to the point, but so far we have little experience with it; if you have other suggestions please tell us.

All in all, the Blosc development team is very excited about this new development, and we are putting all our enthusiasm for delivering a new set of tools that we sincerely hope will of of help for the data community out there.

Finally, let me thank our main sponsor for their generous donation, NumFOCUS for accepting our project inside its umbrella, and to all the users and contributors that made Blosc and its ecosystem to help people through the past years (a bit more than 10 since the first C-Blosc 1.0 release).

Enjoy Data!

Blosc2-Meets-Rome

On August 7, 2019, AMD released a new generation of its series of EPYC processors, the EPYC 7002, also known as Rome, which are based on the new Zen 2 micro-architecture. Zen 2 is a significant departure from the physical design paradigm of AMD's previous Zen architectures, mainly in that the I/O components of the CPU are laid out on a separate die, different from computing dies; this is quite different from Naples (aka EPYC 7001), its antecessor in the EPYC series:

/images/blosc2-meets-rome/amd-rome-arch-multi-die.png

Such a separation of dies for I/O and computing has quite large consequences in terms of scalability when accessing memory, which is critical for Blosc operation, and here we want to check how Blosc and AMD Rome couple behaves. As there is no replacement for experimentation, we are going to use the same benchmark that was introduced in our previous Breaking Down Memory Walls. This essentially boils down to compute an aggregation with a simple loop like:

#pragma omp parallel for reduction (+:sum)
for (i = 0; i < N; i++) {
  sum += udata[i];
}

As described in the original blog post, the different udata arrays are just chunks of the original dataset that are decompressed just in time for performing the partial aggregation operation; the final result is indeed the sum of all the partial aggregations. Also we have seen that the time to execute the aggregation is going to depend quite a lot on the kind of data that is decompressed: carefully chosen synthetic data can be decompressed much more quickly than real data. But syntehtic data is nevertheless interesting as it allows for a roof analysis of where the performance can grow up to.

In this blog, we are going to see how the AMD EPYC 7402 (Rome), a 24-core processor performs on both synthetic and real data.

Aggregating the Synthetic Dataset on AMD EPYC 7402 24-Core

The synthetic data chosen for this benchmark allows to be compressed/decompressed very easily with applying the shuffle filter before the actual compression codec. Interestingly, and as good example of how filters can benefit the compression process, if we would not apply the shuffle filter first, synthetic data was going to take much more time to compress/decompress (test it by yourself if you don't believe this).

After some experiments, and as usual for synthetic datasets, the codec inside Blosc2 that has shown the best speed while keeping a decent compression ratio (54.6x), has been BloscLZ with compression level 3. Here are the results:

/images/blosc2-meets-rome/sum_openmp_synthetic-blosclz-3.png

As we can see, the uncompressed dataset scales pretty well until 8 threads, where it hits the memory wall for this machine (around 74 GB/s). On its hand, even if data compressed with Blosc2 (in combination with BloscLZ codec) shows less performance initially, it scales quite smoothly up to 12 threads, where it reaches a higher performance than its uncompressed counterpart (and reaching the 90 GB/s mark).

After that, the compressed dataset can perform aggregations at speeds that are typically faster than uncompressed ones, reaching important peaks at some magical number of threads (up to 210 GB/s at 48 threads). Why these peaks exist at all is probably related with the architecture of the AMD Rome processor, but provided that we are using a 24-core CPU there is little wonder that numbers like 12, 24 (28 is an exception here) and 48 are reaching the highest figures.

Aggregating the Precipitation Dataset on AMD EPYC 7402 24-Core

Now it is time to check the performance of the aggregation with the 100 million values dataset coming from a precipitation dataset from Central Europe. Computing the aggregation of this data is representative of a catchment average of precipitation over a drainage area. This time, the best codec inside Blosc2 was determined to be LZ4 with compression level 9:

/images/blosc2-meets-rome/sum_openmp_rainfall-lz4-9-lz4-9-ipp.png

As expected, the uncompressed aggregation scales pretty much the same than for the synthetic dataset (in the end, the Arithmetic and Logical Unit in the CPU is completely agnostic on what kind of data it operates with). But on its hand, the compressed dataset scales more slowly, but more steadily towards hitting a maximum at 48 threads, where it reaches almost the same speed than the uncompressed dataset, which is quite a feat, provided the high memory bandwidth of this machine (~74 GB/s).

Also, as Blosc2 recently gained support for the accelerated LZ4 codec inside Intel IPP, figures for it have been added to the plot above. There one can see that Intel's accelerated LZ4 can get an up to 10% boost in speed compared with regular LZ4; this additional 10% actually allows Blosc2/LZ4 to be clearly faster than the uncompressed dataset at 48 threads.

Final Thoughts

AMD EPYC Rome represents a significant leap forward in adding a high number of cores to CPUs in a way that scales really well, allowing to put more computational resources to our problems at hand. Here we have shown how nicely a 24-core AMD Rome CPU performs when performing tasks with in-memory compressed datasets; first, by allowing competitive speed when using compression with real data and second, allowing speeds of more than 200 GB/s (with synthetic datasets).

Finally, the 24-core CPU that we have exercised here is just for whetting your appetite, as CPUs of 32 or even 64 cores are going to happen more and more often in the next future. Although I should have better said in present times, as AMD announced today the availability of 32-core CPUs for the workstation market, with 64-core ones coming next year. Definitely, compression is going to play an increasingly important role in getting the most out of these beasts.

Appendix: Software used

For reference, here it is the software that has been used for this blog entry:

  • OS: Ubuntu 19.10

  • Compiler: Clang 8.0.0

  • C-Blosc2: 2.0.0b5.dev (2019-09-13)

Acknowledgments

Thanks to packet.com for kindly providing the hardware for the purposes of this benchmark. Packet guys have been really collaborative through the time in allowing me testing new, bare-metal hardware, and I must say that I am quite impressed on how easy is to start using their services with almost no effort on user's side.

C-Blosc2 Enters Beta Stage

The first beta version of C-Blosc2 has been released today. C-Blosc2 is the new iteration of C-Blosc 1.x series, adding more features and better documentation and is the outcome of more than 4 years of slow, but steady development. This blog entry describes the main features that you may see in next generation of C-Blosc, as well as an overview of what is in our roadmap.

Note 1: C-Blosc2 is currently in beta stage, so not ready to be used in production yet. Having said this, being in beta means that the API has been declared frozen, so there is guarantee that your programs will continue to work with future versions of the library. If you want to collaborate in this development, you are welcome: have a look at our roadmap below and contribute PR's or just go to the open issues and help us with them.

Note 2: the term C-Blosc1 will be used instead of the official C-Blosc name for referring to the 1.x series of the library. This is to make the distinction between the C-Blosc 2.x series and C-Blosc 1.x series more explicit.

Main features in C-Blosc2

New 64-bit containers

The main container in C-Blosc2 is the super-chunk or, for brevity, schunk, that is made by smaller containers which are essentially C-Blosc1 32-bit containers. The super-chunk can be backed (or not) by another container which is called a frame. If a schunk is not backed by a frame (the default), the different chunks will be stored sparsely in-memory.

The frame object allows to store super-chunks contiguously, either on-disk or in-memory. When a super-chunk is backed by a frame, instead of storing all the chunks sparsely in-memory, they are serialized inside the frame container. The frame can be stored on-disk too, meaning that persistence of super-chunks is supported and that data can be accessed using the same API independently of where it is stored, memory or disk.

Finally, the user can add meta-data to frames for different uses and in different layers. For example, one may think on providing a meta-layer for NumPy so that most of the meta-data for it is stored in a meta-layer; then, one can place another meta-layer on top of the latter can add more high-level info (e.g. geo-spatial, meteorological...), if desired.

When taken together, these features represent a pretty powerful way to store and retrieve compressed data that goes well beyond of the previous contiguous compressed buffer, 32-bit limited, of C-Blosc1.

New filters and filters pipeline

Besides shuffle and bitshuffle already present in C-Blosc1, C-Blosc2 already implements:

  • delta: the stored blocks inside a chunk are diff'ed with respect to first block in the chunk. The basic idea here is that, in some situations, the diff will have more zeros than the original data, leading to better compression.

  • trunc_prec: it zeroes the least significant bits of the mantissa of float32 and float64 types. When combined with the shuffle or bitshuffle filter, this leads to more contiguous zeros, which are compressed better and faster.

Also, a new filter pipeline has been implemented. With it, the different filters can be pipelined so that the output of one filter can be the input for the next; this happens at the block level, so minimizing the size of temporary buffers, and hence, accelerating the process. Possible examples of pipelines are a delta filter followed by shuffle, or a trunc_prec followed by bitshuffle. Up to 6 filters can be pipelined, so there is plenty of space for upcoming new filters to collaborate among them.

More SIMD support for ARM and PowerPC

New SIMD support for ARM (NEON), allowing for faster operation on ARM architectures. Only shuffle is supported right now, but the idea is to implement bitshuffle for NEON too.

Also, SIMD support for PowerPC (ALTIVEC) is here, and both shuffle and bitshuffle are supported. However, this has been done via a transparent mapping from SSE2 into ALTIVEC emulation in GCC 8, so performance could be better (but still, it is already a nice improvement over native C code; see PR https://github.com/Blosc/c-blosc2/pull/59 for details). Thanks to Jerome Kieffer.

New codecs

There is a new Lizard codec, which is an efficient compressor with very fast decompression. It achieves compression ratio that is comparable to zip/zlib and zstd/brotli (at low and medium compression levels) that is able to attain decompression speeds of 1 GB/s or more.

New dictionary support for better compression ratio

Dictionaries allow for better discovery of data duplicates among different blocks: when a block is going to be compressed, C-Blosc2 can use a previously made dictionary (stored in the header of the super-chunk) for compressing all the blocks that are part of the chunks. This usually improves the compression ratio, as well as the decompression speed, at the expense of a (small) overhead in compression speed. Currently, this is only supported in the zstd codec, but would be nice to extend it to lz4 and blosclz at least.

Much improved documentation mark-up

We are currently using a combination of Sphinx + Doxygen + Breathe for documenting the C API for C-Blosc2. This is a huge step further compared with the documentation of C-Blosc1, where the developer needed to go the blosc.h header for reading the docstrings there. Thanks to Alberto Sabater for contributing the support for this.

Support for Intel IPP (Integrated Performance Primitives)

Intel is producing a series of optimizations in their IPP library and among them, and accelerated version of the LZ4 codec. Due to its excellent compression capabilities and speed, LZ4 is probably the most used codec in Blosc, so enabling even a bit more of optimization on LZ4 is always a good news. And judging by the plots below, the Intel guys seem to have done an excellent job:

lz4-no-ipp

lz4-ipp

In the plots above we see a couple of things: 1) the IPP/LZ4 functions can compress more than regular LZ4, and 2) they are quite a bit faster than regular LZ4. As always, take these plots with a grain of salt, as actual datasets will see more similar compression ratios and speed (but still, the difference can be significant). Of course, IPP/LZ4 should generate LZ4 chunks that are completely compatible with the original LZ4 library (but in case you detect any incompatibility, please shout!).

C-Blosc2 beta.1 comes with support for LZ4/IPP out-of-the-box, that is, if IPP is detected in the system, its optimized LZ4 functions are automatically linked and used with the Blosc2 library. If, for portability or other reasons, you don't want to create a Blosc2 library that is linked with Intel IPP, you can disable support for it passing the -DDEACTIVATE_IPP=ON to cmake. In the future, we surely may give support for other optimized codecs in IPP too (Zstd would be an excellent candidate).

Roadmap

Of course, C-Blosc2 is not done yet, and there are many interesting enhancements that we would like to tackle sooner or later. Here it is a more or less comprehensive list of our roadmap:

  • Lock support for super-chunks: when different processes are accessing concurrently to super-chunks, make them to sync properly by using locks, either on-disk (frame-backed super-chunks), or in-memory.

  • Checksums: the frame can benefit from having a checksum per every chunk/index/metalayer. This will provide more safety towards frames that are damaged for whatever reason. Also, this would provide better feedback when trying to determine the parts of the frame that are corrupted. Candidates for checksums can be the xxhash32 or xxhash64, depending on the gaols (to be decided).

  • Documentation: utterly important for attracting new users and making the life easier for existing ones. Important points to have in mind here:

    • Quality of API docstrings: is the mission of the functions or data structures clearly and succinctly explained? Are all the parameters explained? Is the return value explained? What are the possible errors that can be returned?

    • Tutorials/book: besides the API docstrings, more documentation materials should be provided, like tutorials or a book about Blosc (or at least, the beginnings of it). Due to its adoption in GitHub and Jupyter notebooks, one of the most extended and useful markup systems is MarkDown, so this should also be the first candidate to use here.

  • Wrappers for other languages: Python and Java are the most obvious candidates, but others like R or Julia would be nice to have. Still not sure if these should be produced and maintained by the Blosc development team, or leave them for third-party players that would be interested.

  • It would be nice to use LGTM, a CI-friendly analyzer for security.

  • Add support for buildkite as another CI would be handy because it allows to use on-premise machines, potentially speeding-up the time to do the builds, but also to setup pipelines with more complex dependencies and analyzers.

The implementation of these features will require the help of people, either by contributing code (see our developing guidelines) or, as it turns out that Blosc is a project sponsored by NumFOCUS, you may want to make a donation to the project. If you plan to contribute in any way, thanks so much in the name of the community!

Addendum: Special thanks to developers

C-Blosc2 is the outcome of the work of many developers that worked not only on C-Blosc2 itself, but also on C-Blosc1, from which C-Blosc2 inherits a lot of features. I am very grateful to Jack Pappas, who contributed important portability enhancements, specially runtime and cross-platform detection of SSE2/AVX2 (with the help of Julian Taylor) as well as high precision timers (HPET) which are essential for benchmarking purposes. Lucian Marc also contributed the support for ARM/NEON for the shuffle filter. Jerome Kieffer contributed support for PowerPC/ALTIVEC. Alberto Sabater, for his great efforts on producing really nice Blosc2 docs, among other aspects. And last but not least, to Valentin Haenel for general support, bug fixes and other enhancements through the years.

** Enjoy Data!**