Dynamic plugins in C-Blosc2

Updated: 2023-08-03 Added a new example of a dynamic filter for Python. Also, we have improved the content so that it can work more as a tutorial on how to make dynamic plugins for Blosc2. Finally, there is support now for dynamic plugins on Windows and MacOS/ARM64. Enjoy!

The Blosc Development Team is excited to announce that the latest version of C-Blosc2 and Python-Blosc2 include a great new feature: the ability to dynamically load plugins, such as codecs and filters. This means that these codecs or filters will only be loaded at runtime when they are needed. These C libraries will be easily distributed inside Python wheels and be used from both C and Python code without problems. Keep reading for a gentle introduction to this new feature.

Creating a dynamically loaded filter

To learn how to create dynamic plugins, we'll use an already created example. Suppose you have a filter that you want Blosc2 to load dynamically only when it is used. In this case, you need to create a Python package to build a wheel and install it as a separate library. You can follow the structure used in blosc2_plugin_example to do this:

├── CMakeLists.txt
├── README.md
├── blosc2_plugin_name
│   └── __init__.py
├── examples
│   ├── array_roundtrip.py
│   ├── schunk_roundtrip.py
│   └── test_plugin.c
├── pyproject.toml
├── requirements-build.txt
├── setup.py
└── src
    ├── CMakeLists.txt
    ├── urfilters.c
    └── urfilters.h

Note that the project name has to be blosc2_ followed by the plugin name (plugin_example in this case). The corresponding functions will be defined in the src folder, in our case in urfilters.c, following the same format as functions for user-defined filters (see https://github.com/Blosc/c-blosc2/blob/main/plugins/README.md for more information). Here it is the sample code:

int blosc2_plugin_example_forward(const uint8_t* src, uint8_t* dest,
                                  int32_t size, uint8_t meta,
                                  blosc2_cparams *cparams, uint8_t id) {
  blosc2_schunk *schunk = cparams->schunk;

  for (int i = 0; i < size / schunk->typesize; ++i) {
    switch (schunk->typesize) {
      case 8:
        ((int64_t *) dest)[i] = ((int64_t *) src)[i] + 1;
        break;
      default:
        BLOSC_TRACE_ERROR("Item size %d not supported", schunk->typesize);
        return BLOSC2_ERROR_FAILURE;
    }
  }
  return BLOSC2_ERROR_SUCCESS;
}


int blosc2_plugin_example_backward(const uint8_t* src, uint8_t* dest, int32_t size,
                                   uint8_t meta, blosc2_dparams *dparams, uint8_t id) {
  blosc2_schunk *schunk = dparams->schunk;

  for (int i = 0; i < size / schunk->typesize; ++i) {
    switch (schunk->typesize) {
      case 8:
        ((int64_t *) dest)[i] = ((int64_t *) src)[i] - 1;
        break;
      default:
        BLOSC_TRACE_ERROR("Item size %d not supported", schunk->typesize);
        return BLOSC2_ERROR_FAILURE;
    }
  }
  return BLOSC2_ERROR_SUCCESS;
}

In addition to these functions, we need to create a filter_info (or codec_info or tune_info in each case) named info. This variable will contain the names of the forward and backward functions. In our case, we will have:

filter_info info  = {"blosc2_plugin_example_forward", "blosc2_plugin_example_backward"};

To find the functions, the variable must always be named info. Furthermore, the symbols info and the functions forward and backward must be exported in order for Windows to find them. You can see all the details for doing that in the blosc2_plugin_example repository.

Creating and installing the wheel

Once the project is done, you can create a wheel and install it locally:

python setup.py bdist_wheel
pip install dist/*.whl

This wheel can be uploaded to PyPI so that anybody can use it. Once tested and stable enough, you can request the Blosc Team to register it globally. This way, an ID for the filter or codec will be booked so that the data will always be able to be encoded/decoded by the same code, ensuring portability.

Registering the plugin in C-Blosc2

After installation, and prior to use it, you must register it in C-Blosc2. This step is necessary only if the filter is not already registered globally by C-Blosc2, which is likely if you are testing it or you are not ready to share it with other users. To register it, follow the same process as registering a user-defined plugin, but leave the function pointers as NULL:

blosc2_filter plugin_example;
plugin_example.id = 250;
plugin_example.name = "plugin_example";
plugin_example.version = 1;
plugin_example.forward = NULL;
plugin_example.backward = NULL;
blosc2_register_filter(&plugin_example);

When the filter is used for the first time, C-Blosc2 will automatically fill in the function pointers.

Registering the plugin in Python-Blosc2

The same applies for Python-Blosc2. You can register the filter as follows:

import blosc2
blosc2.register_filter(250, None, None, "plugin_example")

Using the plugin in C-Blosc2

To use the plugin, simply set the filter ID in the filters pipeline, as you would do with user-defined filters:

blosc2_cparams cparams = BLOSC2_CPARAMS_DEFAULTS;
cparams.filters[4] = 250;
cparams.filters_meta[4] = 0;

blosc2_dparams dparams = BLOSC2_DPARAMS_DEFAULTS;

blosc2_schunk* schunk;

/* Create a super-chunk container */
cparams.typesize = sizeof(int32_t);
blosc2_storage storage = {.cparams=&cparams, .dparams=&dparams};
schunk = blosc2_schunk_new(&storage);

To see a full usage example, refer to https://github.com/Blosc/blosc2_plugin_example/blob/main/examples/test_plugin.c. Keep in mind that the executable using the plugin must be launched from the same virtual environment where the plugin wheel was installed. When compressing or decompressing, C-Blosc2 will dynamically load the library and call its functions automatically (as depicted below).

Dynamically loading filter

Once you are satisfied with your plugin, you may choose to request the Blosc Development Team to register it as a global plugin. The only difference (aside from its ID number) is that users won't need to register it locally anymore. Also, a dynamic plugin will not be loaded until it is explicitly requested by any compression or decompression function, saving resources.

Using the plugin in Python-Blosc2

As in C-Blosc2, just set the filter ID in the filters pipeline, as you would do with user-defined filters:

shape = [100, 100]
size = int(np.prod(shape))
nparray = np.arange(size, dtype=np.int32).reshape(shape)
blosc2_array = blosc2.asarray(nparray, cparams={"filters": [250]})

To see a full usage example, refer to https://github.com/Blosc/blosc2_plugin_example/blob/main/examples/array_roundtrip.py.

Conclusions

C-Blosc2's ability to support dynamically loaded plugins allows the library to grow in features without increasing the size and complexity of the library itself. For more information about user-defined plugins, refer to this blog entry.

We appreciate your interest in our project! If you find our work useful and valuable, we would be grateful if you could support us by making a donation. Your contribution will help us continue to develop and improve Blosc packages, making them more accessible and useful for everyone. Our team is committed to creating high-quality and efficient software, and your support will help us to achieve this goal.

Bytedelta: Enhance Your Compression Toolset

Bytedelta is a new filter that calculates the difference between bytes in a data stream. Combined with the shuffle filter, it can improve compression for some datasets. Bytedelta is based on initial work by Aras Pranckevičius.

TL;DR: We have a brief introduction to bytedelta in the 3rd section of this presentation.

The basic concept is simple: after applying the shuffle filter,

/images/bytedelta-enhance-compression-toolset/shuffle-filter.png

then compute the difference for each byte in the byte streams (also called splits in Blosc terminology):

/images/bytedelta-enhance-compression-toolset/bytedelta-filter.png

The key insight enabling the bytedelta algorithm lies in its implementation, especially the use of SIMD on Intel/AMD and ARM NEON CPUs, making the filter overhead minimal.

Although Aras's original code implemented shuffle and bytedelta together, it was limited to a specific item size (4 bytes). Making it more general would require significant effort. Instead, for Blosc2 we built on the existing shuffle filter and created a new one that just does bytedelta. When we insert both in the Blosc2 filter pipeline (it supports up to 6 chained filters), it leads to a completely general filter that works for any type size supported by existing shuffle filter.

With that said, the implementation of the bytedelta filter has been a breeze thanks to the plugin support in C-Blosc2. You can also implement your own filters and codecs on your own, or if you are too busy, we will be happy to assist you.

Compressing ERA5 datasets

The best approach to evaluate a new filter is to apply it to real data. For this, we will use some of the ERA5 datasets, representing different measurements and labeled as "wind", "snow", "flux", "pressure" and "precip". They all contain floating point data (float32) and we will use a full month of each one, accounting for 2.8 GB for each dataset.

The diverse datasets exhibit rather dissimilar complexity, which proves advantageous for testing diverse compression scenarios. For instance, the wind dataset appears as follows:

/images/bytedelta-enhance-compression-toolset/wind-colormap.png

The image shows the intricate network of winds across the globe on October 1, 1987. The South American continent is visible on the right side of the map.

Another example is the snow dataset:

/images/bytedelta-enhance-compression-toolset/snow-colormap.png

This time the image is quite flat. Here one can spot Antarctica, Greenland, North America and of course, Siberia, which was pretty full of snow by 1987-10-01 23:00:00 already.

Let's see how the new bytedelta filter performs when compressing these datasets. All the plots below have been made using a box with an Intel i13900k processor, 32 GB of RAM and using Clear Linux.

/images/bytedelta-enhance-compression-toolset/cratio-vs-filter.png

In the box plot above, we summarized the compression ratios for all datasets using different codecs (BLOSCLZ, LZ4, LZ4HC and ZSTD). The main takeaway is that using bytedelta yields the best median compression ratio: bytedelta achieves a median of 5.86x, compared to 5.62x for bitshuffle, 5.1x for shuffle, and 3.86x for codecs without filters. Overall, bytedelta seems to improve compression ratios here, which is good news.

While the compression ratio is a useful metric for evaluating the new bytedelta filter, there is more to consider. For instance, does the filter work better on some data sets than others? How does it impact the performance of different codecs? If you're interested in learning more, read on.

Effects on various datasets

Let's see how different filters behave on various datasets:

/images/bytedelta-enhance-compression-toolset/cratio-vs-dset.png

Here we see that, for datasets that compress easily (precip, snow), the behavior is quite different from those that are less compressible. For precip, bytedelta actually worsens results, whereas for snow, it slightly improves them. For less compressible datasets, the trend is more apparent, as can be seen in this zoomed in image:

/images/bytedelta-enhance-compression-toolset/cratio-vs-dset-zoom.png

In these cases, bytedelta clearly provides a better compression ratio, most specifically with the pressure dataset, where compression ratio by using bytedelta has increased by 25% compared to the second best, bitshuffle (5.0x vs 4.0x, using ZSTD clevel 9). Overall, only one dataset (precip) shows an actual decrease. This is good news for bytedelta indeed.

Furthermore, Blosc2 supports another compression parameter for splitting the compressed streams into bytes with the same significance. Normally, this leads to better speed but less compression ratio, so this is automatically activated for faster codecs, whereas it is disabled for slower ones. However, it turns out that, when we activate splitting for all the codecs, we find a welcome surprise: bytedelta enables ZSTD to find significantly better compression paths, resulting in higher compression ratios.

/images/bytedelta-enhance-compression-toolset/cratio-vs-dset-always-split-zoom.png

As can be seen, in general ZSTD + bytedelta can compress these datasets better. For the pressure dataset in particular, it goes up to 5.7x, 37% more than the second best, bitshuffle (5.7x vs 4.1x, using ZSTD clevel 9). Note also that this new highest is 14% more than without splitting (the default).

This shows that when compressing, you cannot just trust your intuition for setting compression parameters - there is no substitute for experimentation.

Effects on different codecs

Now, let's see how bytedelta affects performance for different codecs and compression levels.

/images/bytedelta-enhance-compression-toolset/cratio-vs-codec.png

Interestingly, on average bytedelta proves most useful for ZSTD and higher compression levels of ZLIB (Blosc2 comes with ZLIB-NG). On the other hand, the fastest codecs (LZ4, BLOSCLZ) seem to benefit more from bitshuffle instead.

Regarding compression speed, in general we can see that bytedelta has little effect on performance:

/images/bytedelta-enhance-compression-toolset/cspeed-vs-codec.png

As we can see, compression algorithms like BLOSCLZ, LZ4 and ZSTD can achieve extremely high speeds. LZ4 reaches and surpasses speeds of 30 GB/s, even when using bytedelta. BLOSCLZ and ZSTD can also exceed 20 GB/s, which is quite impressive.

Let’s see the compression speed grouped by compression levels:

/images/bytedelta-enhance-compression-toolset/cspeed-vs-codec-clevel.png

Here one can see that, to achieve the highest compression rates when combined with shuffle and bytedelta, the codecs require significant CPU resources; this is especially noticeable in the zoomed-in view:

/images/bytedelta-enhance-compression-toolset/cspeed-vs-codec-clevel-zoom.png

where capable compressors like ZSTD do require up to 2x more time to compress when using bytedelta, especially for high compression levels (6 and 9).

Now, let us examine decompression speeds:

/images/bytedelta-enhance-compression-toolset/dspeed-vs-codec.png

In general, decompression is faster than compression. BLOSCLZ, LZ4 and LZ4HC can achieve over 100 GB/s. BLOSCLZ reaches nearly 180 GB/s using no filters on the snow dataset (lowest complexity).

Let’s see the decompression speed grouped by compression levels:

/images/bytedelta-enhance-compression-toolset/dspeed-vs-codec-clevel.png

The bytedelta filter noticeably reduces speed for most codecs, up to 20% or more. ZSTD performance is less impacted.

Achieving a balance between compression ratio and speed

Often, you want to achieve a good balance of compression and speed, rather than extreme values of either. We will conclude by showing plots depicting a combination of both metrics and how bytedelta influences them.

Let's first represent the compression ratio versus compression speed:

/images/bytedelta-enhance-compression-toolset/cratio-vs-cspeed.png

As we can see, the shuffle filter is typically found on the Pareto frontier (in this case, the point furthest to the right and top). Bytedelta comes next. In contrast, not using a filter at all is on the opposite side. This is typically the case for most real-world numerical datasets.

Let's now group filters and datasets and calculate the mean values of combining (in this case, multiplying) the compression ratio and compression speed for all codecs.

/images/bytedelta-enhance-compression-toolset/cspeed-vs-filter.png

As can be seen, bytedelta works best with the wind dataset (which is quite complex), while bitshuffle does a good job in general for the others. The shuffle filter wins on the snow dataset (low complexity).

If we group by compression level:

/images/bytedelta-enhance-compression-toolset/cratio_x_cspeed-vs-codec-clevel.png

We see that bytedelta works well with LZ4 here, and also with ZSTD at the lowest compression level (1).

Let's revise the compression ratio versus decompression speed comparison:

/images/bytedelta-enhance-compression-toolset/cratio-vs-dspeed.png

Let's group together the datasets and calculate the mean for all codecs:

/images/bytedelta-enhance-compression-toolset/cratio_x_dspeed-vs-filter-dset.png

In this case, shuffle generally prevails, with bitshuffle also doing reasonably well, winning on precip and pressure datasets.

Also, let’s group the data by compression level:

/images/bytedelta-enhance-compression-toolset/cratio_x_dspeed-vs-codec-clevel.png

We find that bytedelta compression does not outperform shuffle compression in any scenario. This is unsurprising since decompression is typically fast, and bytedelta's extra processing can decrease performance more easily. We also see that LZ4HC (clevel 6 and 9) + shuffle strikes the best balance in this scenario.

Finally, let's consider the balance between compression ratio, compression speed, and decompression speed:

/images/bytedelta-enhance-compression-toolset/cratio_x_cspeed_dspeed-vs-dset.png

Here the winners are shuffle and bitshuffle, depending on the data set, but bytedelta never wins.

If we group by compression levels:

/images/bytedelta-enhance-compression-toolset/cratio_x_cspeed_dspeed-vs-codec-clevel.png

Overall, we see LZ4 as the clear winner at any level, especially when combined with shuffle. On the other hand, bytedelta did not win in any scenario here.

Benchmarks for other computers

We have run the benchmarks presented here in an assortment of different boxes:

Also, find here a couple of runs using the i9-13900K box above, but with the always split and never split settings:

Reproducing the benchmarks is straightforward. First, download the data; the downloaded files will be in the new era5_pds/ directory. Then perform the series of benchmarks; this is takes time, so grab coffee and wait 30 min (fast workstations) to 6 hours (slow laptops). Finally, run the plotting Jupyter notebook to explore your results. If you wish to share your results with the Blosc development team, we will appreciate hearing from you!

Conclusion

Bytedelta can achieve higher compression ratios in most datasets, specially in combination with capable codecs like ZSTD, with a maximum gain of 37% (pressure) over other codecs; only in one case (precip) compression ratio decreases. By compressing data more efficiently, bytedelta can reduce file sizes even more, accelerating transfer and storage.

On the other hand, while bytedelta excels at achieving high compression ratios, this requires more computing power. We have found that for striking a good balance between high compression and fast compression/decompression, other filters, particularly shuffle, are superior overall.

We've learned that no single codec/filter combination is best for all datasets:

  • ZSTD (clevel 9) + bytedelta can get better absolute compression ratio for most of the datasets.

  • LZ4 + shuffle is well-balanced for all metrics (compression ratio, speed, decompression speed).

  • LZ4 (clevel 6) and ZSTD (clevel 1) + shuffle strike a good balance of compression ratio and speed.

  • LZ4HC (clevel 6 and 9) + shuffle balances well compression ratio and decompression speed.

  • BLOSCLZ without filters achieves best decompression speed (at least in one instance).

In summary, the optimal choice depends on your priorities.

As a final note, the Blosc development team is working on BTune, a new deep learning tuner for Blosc2. BTune can be trained to automatically recognize different kinds of datasets and choose the optimal codec and filters to achieve the best balance, based on the user's needs. This would create a much more intelligent compressor that can adapt itself to your data faster, without requiring time-consuming manual tuning. If interested, contact us; we are looking for beta testers!

Introducing Blosc2 NDim

One of the latest and more exciting additions in recently released C-Blosc2 2.7.0 is the Blosc2 NDim layer (or b2nd for short). It allows to create and read n-dimensional datasets in an extremely efficient way thanks to a completely general n-dim 2-level partitioning, allowing to slice and dice arbitrary large (and compressed!) data in a more fine-grained way.

Remember that having a second partition means that we have better flexibility to fit the different partitions at the different CPU cache levels; typically the first partition (aka chunks) should be made to fit in L3 cache, whereas the second partition (aka blocks) should rather fit in L2/L1 caches (depending on whether compression ratio or speed is desired).

This capability was formerly part of Caterva, and now we are including it in C-Blosc2 for convenience. As a consequence, the Caterva and Python-Caterva projects are now officially deprecated and all the action will happen in the C-Blosc2 / Python-Blosc2 side of the things.

Last but not least, Blosc NDim is gaining support for a full-fledged data type system like NumPy. Keep reading.

Going multidimensional in the first and the second partition

Blosc (both Blosc1 and Blosc2) has always had a two-level partition schema to leverage the different cache levels in modern CPUs and make compression happen as quickly as possible. This allows, among other things, to create and query tables with 100 trillion of rows when properly cooperating with existing libraries like HDF5.

With Blosc2 NDim we are taking this feature a step further and both partitions, known as chunks and blocks, are gaining multidimensional capabilities meaning that one can split some dataset (super-chunk in Blosc2 parlance) in such a n-dim cubes and sub-cubes:

/images/blosc2-ndim-intro/b2nd-2level-parts.png

With these more fine-grained cubes (aka partitions), it is possible to retrieve arbitrary n-dim slices more rapidly because you don't have to decompress all the data that is necessary for the more coarse-grained partitions typical in other libraries.

For example, for a 4-d array with a shape of (50, 100, 300, 250) with float64 items, we can choose a chunk with shape (10, 25, 50, 50) and a block with shape (3, 5, 10, 20) which makes for about 5 MB and 23 KB respectively. This way, a chunk fits comfortably on a L3 cache in most of modern CPUs, and a block in a L1 cache (we are tuning for speed here). With that configuration, the NDArray object in the Python-Blosc2 package can slice the array as fast as it is shown below:

/images/blosc2-ndim-intro/Read-Partial-Slices-B2ND.png

Of course, the double partition comes with some overhead during the creation of the partitions: more data moves and computations are required in order to place the data in the correct positions. However, we have done our best in order to minimize the data movement as much as possible. Below we can see how the speed of creation (write) of an array from anew is still quite competitive:

/images/blosc2-ndim-intro/Complete-Write-Read-B2ND.png

On the other hand, we can also see that, when reading the complete array, the double partitioning overhead is not really a big issue, and actually, it benefits Blosc2 NDArray somewhat.

All the plots above have been generated using the compare_getslice.py script, where we have been using the Zstd codec with compression level 1 (the fastest inside Blosc2) + the Shuffle filter for all the packages. The box used was an Intel 13900K CPU with 32 GB of RAM and using an up-to-date Clear Linux distro.

Data types are in!

Another important thing that we are adding to Blosc2 NDim is the support for data types. This was not previously supported in either C-Blosc2 or Caterva, where only a typesize was available to characterize the type. Now, the data type becomes a first class citizen for the b2nd metalayer. Metalayers in Blosc2 are stored in msgpack format, so it is pretty easy to introspect into them by using external msgpack tools. For example, the b2nd file created in the section above contains this meta info:

$ dd bs=1 skip=112 count=1000 <  compare_getslice.b2nd | msgpack2json -b
<snip>
[0,4,[50,100,300,250],[10,25,50,50],[3,5,10,20],0,"<f8"]

Here we can see the version of the metalayer (0), the number of dimensions of the array (4), followed by the shape, chunk shape and block shape. Then it comes the version of the dtype representation (it support up to 127; the default is 0, meaning NumPy). Finally, we can spot the "<f8" string, so a little-endian double precision data type. Note that the all data types in NumPy are supported by the Python wrapper of Blosc2; that means that with the NDArray object you can store e.g. datetimes (including units), or arbitrarily nested heterogeneous types, which allows to create multidimensional tables.

Conclusion

We have seen how, when sensibly chosen, the double partition provides a formidable boost in retrieving arbitrary slices in potentially large multidimensional arrays. In addition, the new support for arbitrary data types represents a powerful addition as well. Combine that with the excellent compression capabilities of Blosc2, and you will get a first class data container for many types of (numerical, but also textual) data.

Finally, we will be releasing the new NDArray object in the forthcoming release of Python-Blosc2 very soon. This will enable full access to these shiny new features of Blosc2 from the convenience of Python. Stay tuned!

If you regularly store and process large datasets and need advice to partition your data, or choosing the best combination of codec, filters, chunk and block sizes, or many other aspects of compression, do not hesitate to contact the Blosc team at contact (at) blosc.org. We have more than 30 years of cumulative experience in data handling systems like HDF5, Blosc and efficient I/O in general; but most importantly, we have the ability to integrate these innovative technologies quickly into your products, enabling a faster access to these innovations.

Update (2023-02-24)

Slicing a dataset in pineapple-style

Enjoy the meal!

100 Trillion Rows Baby

In recently released PyTables 3.8.0 we gave support for an optimized path for writing and reading Table instances with Blosc2 cooperating with the HDF5 machinery. On the blog describing its implementation we have shown how it collaborates with the HDF5 library so as to get top-class I/O performance.

Since then, we have been aware (thanks to Mark Kittisopikul) of the introduction of the H5Dchunk_iter function in HDF5 1.14 series. This predates the functionality of H5Dget_chunk_info, and makes retrieving the offsets of the chunks in the HDF5 file way more efficiently, specially on files with a large number of chunks - H5Dchunk_iter cost is O(n), whereas H5Dget_chunk_info is O(n^2).

As we decided to implement support for H5Dchunk_iter in PyTables, we were curious on the sort of boost this could provide reading tables created from real data. Keep reading for the experiments we've conducted about this.

Effect on (relatively small) datasets

We start by reading a table with real data coming from our usual ERA5 database. We fetched one year (2000 to be specific) of data with five different ERA5 datasets with the same shape and the same coordinates (latitude, longitude and time). This data has been stored on a table with 8 columns with 32 bytes per row and with 9 millions rows (for a grand total of 270 GB); the number of chunks is about 8K.

When using compression, the size is typically reduced between a factor of 6x (LZ4 + shuffle) and 9x (Zstd + bitshuffle); in any case, the resulting file size is larger than the RAM available in our box (32 GB), so we can safely exclude OS filesystem caching effects here. Let's have a look at the results on reading this dataset inside PyTables (using shuffle only; for bitshuffle results are just a bit slower):

/images/100-trillion-baby/real-data-9Grow-seq.png/images/100-trillion-baby/real-data-9Grow-rand.png

We see how the improvement when using HDF5 1.14 (and hence H5Dchunk_iter) for reading data sequentially (via a PyTables query) is not that noticeable, but for random queries, the speedup is way more apparent. For comparison purposes, we added the figures for Blosc1+LZ4; one can notice the great job of Blosc2, specially in terms of random reads due to the double partitioning and HDF5 pipeline replacement.

A trillion rows table

But 8K chunks is not such a large figure, and we are interested in using datasets with a larger amount. As it is very time consuming to download large amounts of real data for our benchmarks purposes, we have decided to use synthetic data (basically, a bunch of zeros) just to explore how the new H5Dchunk_iter function scales when handling extremely large datasets in HDF5.

Now we will be creating a large table with 1 trillion rows, with the same 8 fields than in the previous section, but whose values are zeros (remember, we are trying to push HDF5 / Blosc2 to their limits, so data content is not important here). With that, we are getting a table with 845K chunks, which is about 100x more than in the previous section.

With this, lets' have a look at the plots for the read speed:

/images/100-trillion-baby/synth-data-9Grow-seq.png/images/100-trillion-baby/synth-data-9Grow-rand.png

As expected, we are getting significantly better results when using HDF5 1.14 (with H5Dchunk_iter) in both sequential and random cases. For comparison purposes, we have added Blosc1-Zstd which does not make use of the new functionality. In particular, note how Blosc1 gets better results for random reads than Blosc2 with HDF5 1.12; as this is somehow unexpected, if you have an explanation, please chime in.

It is worth noting that even though the data are made of zeros, Blosc2 still needs to compress/decompress the full 32 TB thing. And the same goes for numexpr, which is used internally to perform the computations for the query in the sequential read case. This is testimonial of the optimization efforts in the data flow (i.e. avoiding as much memory copies as possible) inside PyTables.

100 trillion rows baby

As a final exercise, we took the previous experiment to the limit, and made a table with 100 trillion (that’s a 1 followed with 14 zeros!) rows and measured different interesting aspects. It is worth noting that the total size for this case is 2.8 PB (petabyte), and the number of chunks is around 85 millions (finally, large enough to fully demonstrate the scalability of the new H5Dchunk_iter functionality).

Here it is the speed of random and sequential reads:

/images/100-trillion-baby/synth-data-100Trow-seq.png/images/100-trillion-baby/synth-data-100Trow-rand.png

As we can see, despite the large amount of chunks, the sequential read speed actually improved up to more than 75 GB/s. Regarding the random read latency, it increased to 60 µs; this is not too bad actually, as in real life the latencies during random reads in such a large files are determined by the storage media, which is no less than 100 µs for the fastest SSDs nowadays.

The script that creates the table and reads it can be found at bench/100-trillion-rows-baby.py. For the curious, it took about 24 hours to run on a Linux box wearing an Intel 13900K CPU with 32 GB of RAM. The memory consumption during writing was about 110 MB, whereas for reading was 1.7 GB steadily (pretty good for a multi-petabyte table). The final size for the file has been 17 GB, for a compression ratio of more than 175000x.

Conclusion

As we have seen, the H5Dchunk_iter function recently introduced in HDF5 1.14 is confirmed to be of a big help in performing reads more efficiently. We have also demonstrated that scalability is excellent, reaching phenomenal sequential speeds (exceeding 75 GB/s with synthetic data) that cannot be easily achieved by the most modern I/O subsystems, and hence avoiding unnecessary bottlenecks.

Indeed, the combo HDF5 / Blosc2 is able to handle monster sized tables (on the petabyte ballpark) without becoming a significant bottleneck in performance. Not that you need to handle such a sheer amount of data anytime soon, but it is always reassuring to use a tool that is not going to take a step back in daunting scenarios like this.

If you regularly store and process large datasets and need advice to partition your data, or choosing the best combination of codec, filters, chunk and block sizes, or many other aspects of compression, do not hesitate to contact the Blosc team at contact (at) blosc.org. We have more than 30 years of cumulated experience in storage systems like HDF5, Blosc and efficient I/O in general; but most importantly, we have the ability to integrate these innovative technologies quickly into your products, enabling a faster access to these innovations.

20 years of PyTables

Back in October 2002 the first version of PyTables was released. It was an attempt to store a large amount of tabular data while being able to provide a hierarchical structure around it. Here it is the first public announcement by me:

Hi!,

PyTables is a Python package which allows dealing with HDF5 tables.
Such a table is defined as a collection of records whose values are
stored in fixed-length fields.  PyTables is intended to be easy-to-use,
and tried to be a high-performance interface to HDF5.  To achieve this,
the newest improvements in Python 2.2 (like generators or slots and
metaclasses in brand-new classes) has been used.  Python creation
extension tool has been chosen to access the HDF5 library.

This package should be platform independent, but until now I’ve tested
it only with Linux.  It’s the first public release (v 0.1), and it is
in alpha state.

As noted, PyTables was an early adopter of generators and metaclasses that were introduced in the new (by that time) Python 2.2. It turned out that generators demonstrated to be an excellent tool in many libraries related with data science. Also, Pyrex adoption (which was released just a few months ago) greatly simplified the wrapping of native C libraries like HDF5.

By that time there were not that much Python libraries for persisting tabular data with a format that allowed on-the-flight compression, and that gave PyTables a chance to be considered as a good option. Some months later, PyCon 2003 accepted our first talk about PyTables. Since then, we (mainly me, with the support from Scott Prater on the documentation part) gave several presentations in different international conferences, like SciPy or EuroSciPy and its popularity skyrocketed somehow.

Cárabos Coop. V.

In 2005, and after receiving some good inputs on PyTables by some customers (including The HDF Group), we decided to try to make a life out of PyTables development and together with Vicent Mas and Ivan Vilata, we set out to create a cooperative called Cárabos Coop V. Unfortunately, and after 3 years of enthusiastic (and hard) work, we did not succeed in making the project profitable, and we had to close by 2008.

During this period we managed to make a professional version of PyTables that was using out-of core indexes (aka OPSI) as well as a GUI called ViTables. After closing Cárabos we open sourced both technologies, and we are happy to say that they are still in good use, most specially OPSI indexes, that are meant to perform fast queries in very large datasets; OPSI can still be used straight from pandas.

Crew renewal

After Cárabos closure, I (Francesc Alted) continued to maintain PyTables for a while, but in 2010 I expressed my desire to handover the project, and shortly after, a new gang of people, including Anthony Scopatz and Antonio Valentino, with Andrea Bedini joining shortly after, stepped ahead and took the challenge. This is where open source is strong: whenever a project faces difficulties, there are always people eager to jump up to the wagon and continue providing traction for it.

Attempt to merge with h5py

Meanwhile, the h5py package was receiving a great adoption, specially from the community that valued more the multidimensional arrays than the tabular side of the things. There was a feeling that we were duplicating efforts and by 2016, Andrea Bedini, with the help of Anthony Scopatz, organized a HackFest in Perth, Australia where developers of the h5py and PyTables gathered to attempt a merge of the two projects. After the initial work there, we continued this effort with a grant from NumFOCUS.

Unfortunately, the effort demonstrated to be rather complex, and we could not finish it properly (for the sake of curiosity, the attempt is still available). At any rate, we are actively encouraging people using both packages depending on the need; see for example, the tutorial on h5py/PyTables that Tom Kooij taught at SciPy 2017.

Satellite Projects: Blosc and numexpr

As many other open sources libraries, PyTables stands in the shoulders of giants, and makes use of amazing libraries like HDF5 or NumPy for doing its magic. In addition to that, and in order to allow PyTables push against the hardware I/O and computational limits, it leverages two high-performance packages: Blosc and numexpr. Blosc is in charge of compressing data efficiently and at very high speeds to overcome the limits imposed by the I/O subsystem, while numexpr allows to get maximum performance from computations in CPU when querying large tables. Both projects have been substantially improved by the PyTables crew, and actually, they are quite popular by themselves.

Specifically, the Blosc compressor, although born out of the needs of PyTables, it spun off as a standalone compressor (or meta-compressor, as it can use several codecs internally) meant to accelerate not just disk I/O, but also memory access in general. In an unexpected twist, Blosc2, has developed its own multi-level data partitioning system, which goes beyond the single-level partitions in HDF5, and is currently helping PyTables to reach new performance heights. By teaming with the HDF5 library (and hence PyTables), Blosc2 is allowing PyTables to query 100 trillion rows in human timeframes.

Thank you!

It has been a long way since PyTables started 20 years ago. We are happy to have helped in providing a useful framework for data storage and querying needs for many people during the journey.

Many thanks to all maintainers and contributors (either with code or donations) to the project; they are too numerous to mention them all here, but if you are reading this and are among them, you should be proud to have contributed to PyTables. In hindsight, the road may have been certainly bumpy, but it somehow worked and many difficulties have been surpassed; such is the magic and grace of Open Source!

Blosc2 Meets PyTables: Making HDF5 I/O Performance Awesome

PyTables lets users to easily handle data tables and array objects in a hierarchical structure. It also supports a variety of different data compression libraries through HDF5 filters. With the release of PyTables 3.8.0, the Blosc Development Team is pleased to announce the availability of Blosc2, acting not only as another HDF5 filter, but also as an additional partition tool (aka 'second partition') that complements the existing HDF5 chunking schema.

By providing support for a second partition in HDF5, the chunks (aka the 'first partition') can be made larger, ideally fitting in cache level 3 in modern CPUs (see below for advantages of this). Meanwhile, Blosc2 will use its internal blocks (aka the second partition) as the minimum amount of data that should be read and decompressed during data retrieval, no matter how small the hyperslice to be read is.

When Blosc2 is used to implement a second partition for data (referred ahead as 'optimized Blosc2' too), it can bypass the HDF5 pipeline for writing and for reading. This brings another degree of freedom when choosing the different internal I/O buffers, which can be of extraordinary importance in terms of performance and/or resource saving.

How second partition allows for Big Chunking

Blosc2 in PyTables is meant for compressing data in big chunks (typically in the range of level 3 caches in modern CPUs, that is, 10 ~ 1000 MB). This has some interesting advantages:

  • It allows to reduce the number of entries in the HDF5 hash table. This means less resource consumption in the HDF5 side, so PyTables can handle larger tables using less resources.

  • It speeds-up compression and decompression because multithreading works better with more blocks. Remember that you can specify the number of threads to use by using the MAX_BLOSC_THREADS parameter, or by using the BLOSC_NTHREADS environment variable.

However, the traditional drawback of having large chunks is that getting small slices would take long time because the whole chunk has to be read completely and decompressed. Blosc2 surmounts that difficulty like this: it asks HDF5 where chunks start on-disk (via H5Dget_chunk_info()), and then it access to the internal blocks (aka the second partition) independently instead of having to decompress the entire chunk. This effectively avoids penalizing access to small data slices.

In the graphic below you can see the second partition in action where, in order to retrieve the green slice, only blocks 2 and 3 needs to be addressed and decompressed, instead of the (potentially much) larger chunk 0 and 1, which would be the case for the traditional 1 single partition in HDF5:

/images/blosc2_pytables/block-slice.png

In the benchmarks below we are comparing the performance of existing filters inside PyTables (like Zlib or Blosc(1)) against Blosc2, both working as a filter or in optimized mode, that is, bypassing the HDF5 filter pipeline completely.

Benchmarks

The data used in this section have been fetched from ERA5 database (see downloading script), which provides hourly estimates of a large number of atmospheric, land and oceanic climate variables. To build the tables used for reading and writing operations, we have used five different ERA5 datasets with the same shape (100 x 720 x 1440) and the same variables (latitude, longitude and time). Then, we have built a table with a column for each variable and each dataset and added the latitude, longitude and time as columns (for a total of 8 cols). Finally, there have been written 100 x 720 x 1440 rows (more than 100 million) to this table, which makes for a total data size of 3.1 GB.

We present different scenarios when comparing resource usage for writing and reading between the Blosc and Blosc2 filters, including the Blosc2 optimized versions. First one is when PyTables is choosing the chunkshape automatically (the default); as Blosc2 is meant towards large chunks, PyTables has been tuned to produce far larger chunks for Blosc2 in this case (Blosc and other filters will remain using the same chunk sizes as usual). Second, we will visit the case where the chunkshape is equal for both Blosc and Blosc2. Spoiler alert: we will see how Blosc2 behaves well (and sometimes much beter) in both scenarios.

Automatic chunkshape

Inkernel searches

We start by performing inkernel queries where the chunkshape for the table is chosen automatically by the PyTables machinery (see query script here). This size is the same for Blosc, Zlib and uncompressed cases which are all using 16384 rows (about 512 KB), whereas for Blosc2 the computed chunkshape is quite larger: 1179648 rows (about 36 MB; this actually depends on the size of the L3 cache, which is automatically queried in real-time by PyTables and it turns out to be exactly 36 MB for our CPU, an Intel i9-13900K).

Now, we are going to analyze the memory and time usage of performing six inkernel searches, which means scanning the full table six times, in different cases:

  • With no compression; size is 3,1 GB.

  • Using HDF5 with ZLIB + Shuffle; size is 407 MB.

  • Using Blosc filter with BloscLZ codec + Bitshuffle; size is 468 MB.

  • Using Blosc2 filter with BloscLZ codec + Bitshuffle; size is 421 MB.

  • Using Blosc2 filter with Zstd codec + Bitshuffle; size is 341 MB.

/images/blosc2_pytables/inkernel-nocomp-zlib-blosc-zstd.png

As we can see, the queries with no compression enable do not take much time or memory consumption, but it requires storing the full 3.1 GB of data. When using ZLIB, which is the HDF5 default, it does not require much memory either, but it takes a much more time (about 10x more), although the stored data is more than 6x smaller. When using Blosc, the time spent in (de-)compression is much less, but the queries still takes more time (1.7x more) than the no compression case; in addition, the compression ratio is quite close to the ZLIB case.

However, the big jump comes when using Blosc2 with BloscLZ and BitShuffle, since although it uses just a little bit more memory than Blosc (a consequence of using larger chunks), in exchange it is quite faster than the previous methods while achieving a noticeably better compression ratio. Actually, this combination is 1.3x faster than using no compression; this is one of the main features of Blosc (and even more with Blosc2): it can accelerate operation by using compression.

Finally, in case we want to improve compression further, Blosc2 can be used with the ZSTD codec, which achieves the best compression ratio here, in exchange for a slightly slower time (but still 1.15x faster than not using compression).

PyTables inkernel vs pandas queries

Now that we have seen how Blosc2 can help PyTables in getting great query performance, we are going to compare it against pandas queries; to make things more interesting, we will be using the same NumExpr engine in both PyTables (where it is used in inkernel queries) and pandas.

For this benchmark, we have been exploring the best configuration for speed, so we will be using 16 threads (for both Blosc2 and NumExpr) and the Shuffle filter instead of Bitshuffle; this leads to slightly less compression ratios (see below), but now the goal is getting full speed, not reducing storage (keep in mind that Pandas stores data in-memory without compression).

Here it is how PyTables and pandas behave when doing the same 6 queries than in the previous section.

/images/blosc2_pytables/inkernel-pandas.png

And here it is another plot for the same queries, but comparing raw I/O performance for a variety of codecs and filters:

/images/blosc2_pytables/inkernel-vs-pandas2.png

As we can see, the queries using Blosc2 are generally faster (up to 2x faster) than not using compression. Furthermore, Blosc2 + LZ4 get nearly as good times as pandas, while the memory consumption is much smaller with Blosc2 (as much as 20x less in this case; more for larger tables indeed). This is remarkable, as this means that Blosc2 compression results in acceleration that almost compensates for all the additional layers in PyTables (the disk subsystem and the HDF5 library itself).

And in case you wonder how much compression ratio we have lost by switching from Bitshuffle to Shuffle, not much actually:

/images/blosc2_pytables/shuffle-bitshuffle-ratios.png

The take away message here is that, when used correctly, compression can make out-of-core queries go as fast as pure in-memory ones (even when using a high performance tool-set like pandas + NumExpr).

Writing and reading speed with automatic chunkshape

Now, let's have a look at the raw write and read performance. In this case we are going to compare Blosc, Blosc2 as an HDF5 filter, and the optimized Blosc2 (acting as a de facto second partition). Remember that in this section the chunkshape determination is still automatic and different for Blosc (16384 rows, about 512 KB) and Blosc2 (1179648 rows, about 36 MB).

/images/blosc2_pytables/append-expectedrows.png

For writing, optimized Blosc2 is able to do the job faster and get better compression ratios than others, mainly because it uses the HDF5 direct chunking mechanism, bypassing the overhead of the HDF5 pipeline.

Note: the standard Blosc2 filter cannot make of use HDF5 direct chunking, but it still has an advantage when using bigger chunks because it allows for more threads working in parallel and hence, allowing improved parallel (de-)compression.

The plot below shows how optimized Blosc2 is able to read the table faster and how the performance advantage grows as we use more threads.

/images/blosc2_pytables/read-expectedrows.png

And now, let's compare the mean times of Blosc and Blosc2 optimized to read a small slice. In this case, Blosc chunkshape is much smaller, but optimized Blosc2 still can reach similar speed since it uses blocks that are similar in size to Blosc chunks.

/images/blosc2_pytables/slice-read-expectedrows.png

Writing and reading speed when using the same chunkshape

In this scenario, we are choosing the same chunkshape (720 x 1440 rows, about 32 MB) for both Blosc and Blosc2. Let's see how this affects performance:

/images/blosc2_pytables/append-chunklen.png

The plot above shows how optimized Blosc2 manages to write the table faster (mainly because it can bypass the HDF5 pipeline); with the advantage being larger as more threads are used.

/images/blosc2_pytables/read-chunklen.png

Regarding reading, the optimized Blosc2 is able to perform faster too, and we continue to see the same trend of getting more speed when more threads are thrown at the task, with optimized Blosc2 scaling better.

Finally, let's compare the mean times of Blosc and Blosc2 when reading a small slice in this same chunkshape scenario. In this case, since chunkshapes are equal and large, optimized Blosc2 is much faster than the others because it has the ability to decompresses just the necessary internal blocks, instead of the whole chunks. However, the Blosc and the Blosc2 filter still need to decompress the whole chunk, so getting much worse times. See this effect below:

/images/blosc2_pytables/slice-read-chunklen.png

Final words

By allowing a second partition on top of the HDF5 layer, Blosc2 provides a great boost in PyTables I/O speed, specially when using big chunks (mainly when they fit in L3 CPU cache). That means that you can read, write and query large compressed datasets in less time. Interestingly, Blosc2 compression can make these operations faster than when using no compression at all, and even being competitive against a pure in-memory solution like pandas (but consuming vastly less memory).

On the other hand, there are situations where using big chunks would not be acceptable. For example, when using other HDF5 apps that do not support the optimized paths for Blosc2 second partition, and one is forced to use the plain Blosc2 filter. In this case having large chunks would penalize the retrieval of small data slices too much. By the way, you can find a nice assortment of generic filters (including Blosc2) for HDF5 in the hdf5plugin library.

Also note that, in the current implementation we have just provided optimized Blosc2 paths for the Table object in PyTables. That makes sense because Table is probably the most used entity in PyTables. Other chunked objects in PyTables (like EArray or CArray) could be optimized with Blosc2 in the future too (although that would require providing a *multidimensional* second partition for Blosc2).

Last but not least, we would like to thank NumFOCUS and other PyTables donors for providing the funds required to implement Blosc2 support in PyTables. If you like what we are doing, and would like our effort to continue developing well, you can support our work by donating to PyTables project or Blosc project teams. Thank you!

User Defined Pipeline for Python-Blosc2

The Blosc Development Team is happy to announce that the latest version of Python-Blosc2 allows user-defined Python functions all throughout its compression pipeline: you can use Python for building prefilters, postfilters, filters, codecs for Blosc2 and explore all its capabilities. And if done correctly (by using e.g. NumPy, numba, numexpr...), most of the time you won't even need to translate those into C for speed.

The Blosc2 pipeline

The Blosc2 pipeline includes all the functions that are applied to the data until it is finally compressed (and decompressed back). As can be seen in the image below, during compression the first function that can be applied to the data is the prefilter (if any), then the filters pipeline (with a maximum of six filters) and, last but not least, the codec itself. For decompressing, the order will be the other way around: first the codec, then the filters pipeline and finally the postfilter (if any).

blosc2-pipeline

Defining prefilters and postfilters

A prefilter is a function that is applied to the SChunk each time you add data into it (e.g. when appending or updating). It is executed for each data block and receives three parameters: input, output and offset. For convenience, the input and output are presented as NumPy arrays; the former is a view of the input data and the later is an empty NumPy array that must be filled (this is actually what the first filter in the filters pipeline will receive). Regarding the offset, it is an integer which indicates where the corresponding block begins inside the SChunk container.

You can easily set a prefilter to a specific SChunk through a decorator. For example:

schunk = blosc2.SChunk()
@schunk.prefilter(np.int64, np.float64)
def pref(input, output, offset):
    output[:] = input - np.pi + offset

This decorator requires the data types for the input (original data) and output NumPy arrays, which must be of same itemsize.

If you do not want the prefilter to be applied anymore, you can always remove it:

schunk.remove_prefilter("pref")

As for the postfilters, they are applied at the end of the pipeline during decompression. A postfilter receives the same parameters as the prefilter and can be set in the same way:

@schunk.postfilter(np.float64, np.int64)
def postf(input, output, offset):
    output[:] = input + np.pi - offset

In this case, the input data is the one from the buffer returned by the filter pipeline, and the output data type should be the same as the original data (for a good round-trip).

You can also remove postfilters whenever you want:

schunk.remove_postfilter("postf")

Fillers

Before we move onto the next step in the pipeline, we need to introduce the fillers. A filler is similar to a prefilter but with a twist. It is used to fill an empty SChunk and you can pass to it any extra parameter you want, as long as it is a NumPy array, SChunk or Python Scalar. All these extra parameters will arrive to the filler function as a tuple containing just the corresponding block slice for each parameter (except for the scalars, that are passed untouched). To declare a filter, you will also need to pass the inputs along with its data type. For example:

@schunk.filler(((schunk0, dtype0), (ndarray1, dtype1), (py_scalar3, dtype2)), schunk_dtype)
def filler(inputs_tuple, output, offset):
    output[:] = inputs_tuple[0] - inputs_tuple[1] * inputs_tuple[2]

This will automatically append the data to the schunk, but applying the filler function first. After that the schunk would be completely filled, the filler will be de-registered, so the next time you update some data the it would not be executed; a filler is meant to build new SChunk objects from other containers.

User-defined filters and codecs

The main difference between prefilters/postfilters and their filters/codecs counterparts is that the former ones are meant to run for an specific SChunk instance, whereas the later can be locally registered and hence, used in any SChunk.

Let's start describing the user-defined filters. A filter is composed by two functions: one for the compression process (forward), and another one for the decompression process (backward). Such functions receive the input and output as NumPy arrays of type uint8 (bytes), plus the filter meta and the SChunk instance to which the data belongs to. The forward function will fill the output with the modified data from input. The backward will be responsible of reversing the changes done by forward so that the data returned at the end of the decompression should be the same as the one received at the beginning of the compression. Check the drawing below:

forwardbackward

So, once we have the pair of forward and backward functions defined, they can be registered locally associating to a filter ID between 160 and 255:

blosc2.register_filter(id, forward, backward)

Now, we can use the user-defined filter in any SChunk instance by choosing the new local ID in the filters pipeline:

schunk.cparams = {"filters": [id], "filters_meta": [meta]}

Regarding the user-defined codecs, they do not differ too much from its filter counterparts. The main difference is that, because their goal is to actually compress data, the corresponding functions (in this case encoder and decoder) will need to return the size in bytes of the compressed/decompressed data respectively. This time the scheme would be:

encoderdecoder

To register a codec, you name it, assign an ID to it and pass the user-defined functions:

blosc2.register_codec(codec_name, id, encoder, decoder)

And to use it you just use its ID in the cparams:

schunk.cparams = {"codec": id, "codec_meta": meta}

Final words

We have seen how easily you can define your own filters and codecs for the Blosc2 compression pipeline. They are very easy to use because they conveniently wrap input and output data as NumPy arrays. Now, you can start experimenting with different filter/compression algorithms straight from Python. You can even come with a library of such filters/codecs that can be used in all your data pipeline processing. Welcome to compression made easy!

See more examples in the repository.

Find the complete documentation at: https://www.blosc.org/python-blosc2/python-blosc2.html.

This work has been made thanks to a Small Development Grant from NumFOCUS. NumFOCUS is a non-profit organization supporting open code for better science. If you like this, consider giving a donation to them (and if you like our work, you can nominate it to our project too!). Thanks!

New features in Python-Blosc2

The Blosc Development Team is happy to announce that the latest version of Python-Blosc2 0.4. It comes with new and exciting features, like a handier way of setting, expanding and getting the data and metadata from a super-chunk (SChunk) instance. Contrarily to chunks, a super-chunk can update and resize the data that it containes, supports user metadata, and it does not have the 2 GB storage limitation.

Additionally, you can now convert now a SChunk into a contiguous, serialized buffer (aka cframe) and vice-versa; as a bonus, this serialization process also works with a NumPy array at a blazing speed.

Continue reading for knowing the new features a bit more in depth.

Retrieve data with __getitem__ and get_slice

The most general way to store data in Python-Blosc2 is through a SChunk (super-chunk) object. Here the data is split into chunks of the same size. So until now, the only way of working with it was chunk by chunk (see the basics tutorial).

With the new version, you can get general data slices with the handy __getitem__() method without having to mess with chunks manually. The only inconvenience is that this returns a bytes object, which is difficult to read by humans. To overcome this, we have also implemented the get_slice() method; it comes with two optional params: start and stop for selecting the slice you are interested in. Also, you can pass to out any Python object supporting the Buffer Protocol and it will be filled with the data slice. One common example is to pass a NumPy array in the out argument:

out_slice = numpy.empty(chunksize * nchunks, dtype=numpy.int32)
schunk.get_slice(out=out_slice)

We have now the out_slice NumPy array filled with the schunk data. Easy and effective.

Set data with __setitem__

Similarly, if we would like to set data, we had different ways of doing it, e.g. with the update_chunk() or the update_data() methods. But those work, again, chunk by chunk, which was a bummer. That's why we also implemented the convenient __setitem__() method. In a similar way to the get_slice() method, the value to be set can be any Python object supporting the Buffer Protocol. In addition, this method is very flexible because it not only can set the data of an already existing slice of the SChunk, but it also can expand (and update at the same time) it.

To do so, the stop param will set the new number of items in SChunk:

start = schunk_nelems - 123
stop = start + new_value.size   # new number of items
schunk[start:stop] = new_value

In the code above, the data between start and the SChunk current size will be updated and then, the data between the previous SChunk size and the new stop will be appended automatically for you. This is very handy indeed (note that the step parameter is not yet supported though).

Serialize SChunk from/to a contiguous compressed buffer

Super-chunks can be serialized in two slightly different format frames: contiguous and sparse. A contiguous frame (aka cframe) serializes the whole super-chunk data and metadata into a sequential buffer, whereas the sparse frame (aka sframe) uses a contiguous frame for metadata and the data is stored separately in so-called chunks. Here it is how they look like:

Contiguous and sparse frames

The contiguous and sparse formats come with its own pros and cons. A contiguous frame is ideal for transmitting / storing data as a whole buffer / file, while the sparse one is better suited to act as a store while a super-chunk is being built.

In this new version of Python-Blosc2 we have added a method to convert from a SChunk to a contiguous, serialized buffer:

buf = schunk.to_cframe()

as well as a function to build back a SChunk instance from that buffer:

schunk = schunk_from_cframe(buf)

This allows for a nice way to serialize / deserialize super-chunks for transmission / storage purposes. Also, and for performance reasons, and for reducing memory usage to a maximum, these functions avoid copies as much as possible. For example, the schunk_from_cframe function can build a SChunk instance without copying the data in cframe. Such a capability makes the use of cframes very desirable whenever you have to transmit and re-create data from one machine to another in a very efficient way.

Serializing NumPy arrays

Last but not least, you can also serialize NumPy arrays with the new pair of functions pack_array2() / unpack_array2(). Although you could already do this with the existing pack_array() / unpack_array() functions, the new ones are much faster and do not have the 2 GB size limitation. To prove this, let's see its performance by looking at some benchmark results obtained with an Intel box (i9-10940X CPU @ 3.30GHz, 14 cores) running Ubuntu 22.04.

In this benchmark we are comparing a plain NumPy array copy against compression/decompression through different compressors and functions (compress() / decompress(), pack_array() / unpack_array() and pack_array2() / unpack_array2()). The data distribution for the plots below is for 3 different data distributions: arange, linspace and random:

Compression ratio for different codecs

As can be seen, different codecs offer different compression ratios for the different distributions. Note in particular how linear distributions (arange for int64 and linspace for float64) can reach really high compression ratios (very low entropy).

Let's see the speed for compression / decompression; in order to not show too many info in this blog, we will show just the plots for the linspace linear distribution:

Compression speed for different codecsDecompression speed for different codecs

Here we can see that the pair pack_array2() / unpack_array2() is consistently (much) faster than their previous version pack_array() / unpack_array(). Despite that, the fastest is the compress() / decompress() pair; however this is not serializing all the properties of a NumPy array, and has the limitation of not being able to compress data larger than 2 GB.

You can test the speed in your box by running the pack_compress bench.

Also, if you would like to store the contiguous buffer on-disk, you can directly use the pair of functions save_array(), load_array().

Native performance on Apple M1 processors

Contrariliy to Blosc1, Blosc2 comes with native support for ARM processors (it leverages the NEON SIMD instruction set there), and that means that it runs very fast in this architecture. As an example, let's see how the new pack_array2() / unpack_array2() works in an Apple M1 laptop (Macbook Air).

Compression speed for different codecsDecompression speed for different codecs

As can be seen, running Blosc2 in native arm64 mode on M1 offers quite a bit more performance (specially during compression) than using the i386 emulation. If speed is important to you, and you have a M1/M2 processor, make sure that you are running Blosc2 in native mode (arm64).

Conclusions

The new features added to python-blosc2 offer an easy way of creating, getting, setting and expanding data by using a SChunk instance. Furthermore, you can get a contiguous compressed representation (aka cframe) of it and re-create it again latter. And you can do the same with NumPy arrays (either in-memory or on-disk) faster than with the former functions, and even faster than a plain memcpy().

For more info on how to use these useful new features, see this Jupyter notebook tutorial.

Finally, the complete documentation is at: https://www.blosc.org/python-blosc2/python-blosc2.html. Thanks to Marc Garcia (@datapythonista) for his fine work and enthusiasm in helping us in providing a better structure to the Blosc documentation!

This work has been possible thanks to a Small Development Grant from NumFOCUS. NumFOCUS is a non-profit organization supporting open code for better science. If you like the goal, consider giving a donation to NumFOCUS (you can optionally make it go to our project too, to which we would be very grateful indeed :-).

Announcing Support for Lossy ZFP Codec as a Plugin for C-Blosc2

Announcing Support for Lossy ZFP Codec as a Plugin for C-Blosc2

Blosc supports different filters and codecs for compressing data, like e.g. the lossless NDLZ codec and the NDCELL filter. These have been developed explicitly to be used in multidimensional datasets (via Caterva or ironArray Community Edition).

However, a lossy codec like ZFP allows for much better compression ratios at the expense of loosing some precision in floating point data. Moreover, while NDLZ is only available for 2-dim datasets, ZFP can be used up to 4-dim datasets.

How ZFP works?

ZFP partitions datasets into cells of 4^(number of dimensions) values, i.e., 4, 16, 64, or 256 values for 1D, 2D, 3D, and 4D arrays, respectively. Each cell is then (de)compressed independently, and the resulting bit strings are concatenated into a single stream of bits.

Furthermore, ZFP usually truncates each input value either to a fixed number of bits to meet a storage budget or to some variable length needed to meet a chosen error tolerance. For more info on how this works, see zfp overview docs.

ZFP implementation

Similarly to other registered Blosc2 official plugins, this codec is now available at the blosc2/plugins directory of the C-Blosc2 repository. However, as there are different modes for working with ZFP, there are several associated codec IDs (see later).

So, in order to use ZFP, users just have to choose the ID for the desired ZFP mode between the ones listed in blosc2/codecs-registry.h. For more info on how the plugin selection mechanism works, see https://www.blosc.org/posts/registering-plugins/.

ZFP modes

ZFP is a lossy codec, but it still lets the user to choose the degree of the data loss. There are different compression modes:

  • BLOSC_CODEC_ZFP_FIXED_ACCURACY: The user can choose the absolute error in truncation. For example, if the desired absolute error is 0.01, each value loss must be less than or equal to 0.01. With that, if 23.0567 is a value of the original input, after compressing and decompressing this input with error=0.01, the new value must be between 23.0467 and 23.0667.

  • BLOSC_CODEC_ZFP_FIXED_PRECISION: The user specifies the maximum number of bit planes encoded during compression (relative error). This is, for each input value, the number of most significant bits that will be encoded.

  • BLOSC_CODEC_ZFP_FIXED_RATE: The user chooses the size that the compressed cells must have based on the input cell size. For example, if the cell size is 2000 bytes and user chooses ratio=50, the output cell size will be 50% of 2000 = 1000 bytes.

For more info, see: https://github.com/Blosc/c-blosc2/blob/main/plugins/codecs/zfp/README.md

Benchmark: ZFP FIXED-ACCURACY VS FIXED_PRECISION VS FIXED-RATE modes

The dataset used in this benchmark is called precipitation_amount_1hour_Accumulation.zarr and has been fetched from ERA5 database, which provides hourly estimates of a large number of atmospheric, land and oceanic climate variables.

Specifically, the downloaded dataset in Caterva format has this parameters:

  • ndim = 3

  • type = float32

  • shape = [720, 721, 1440]

  • chunkshape = [128, 128, 256]

  • blockshape = [16, 32, 64]

The next plots represent the compression results obtained by using the different ZFP modes to compress the already mentioned dataset.

Note: It is important to remark that this is a specific dataset and the codec may perform differently for other ones.

/images/zfp-plugin/ratio_zfp.png/images/zfp-plugin/times_zfp.png

Below the bars it is annotated what parameter is used for each test. For example, for the first column, the different compression modes are setup like this:

  • FIXED-ACCURACY: for each input value, the absolute error is 10^(-6) = 0.000001.

  • FIXED-PRECISION: for each input value, only the 20 most significant bits for the mantissa will be encoded.

  • FIXED-RATE: the size of the output cells is 30% of the input cell size.

Although the FIXED-PRECISION mode does not obtain great results, we see that with the FIXED-ACCURACY mode we do get better performance as the absolute error increases. Similarly, we can see how the FIXED-RATE mode gets the requested ratios, which is cool but, in exchange, the amount of data loss is unknown.

Also, while FIXED-ACCURACY and FIXED-RATE modes consume similar times, the FIXED-PRECISION mode, which seems to have less data loss, also takes longer to compress. Generally speaking we can see how, the more data loss (more data truncation) achieved by a mode, the faster it operates.

"Third partition"

One of the most appealing features of Caterva besides supporting multi-dimensionality is its implementation of a second partition, making slicing more efficient. As one of the distinctive characteristics of ZFP is that it compresses data in independent (and small) cells, we have been toying with the idea of implementing a third partition so that slicing of thin selections or just single-point selection can be made faster.

So, as part of the current ZFP implementation, we have combined the Caterva/Blosc2 partitioning (chunking and blocking) with the independent cell handling of ZFP, allowing to extract single cells within the ZFP streams (blocks in Blosc jargon). Due to the properties and limitations of the different ZFP compression modes, we have been able to implement a sort of "third partition" just for the FIXED-RATE mode when used together with the blosc2_getitem_ctx() function.

Such a combination of the existing partitioning and single cell extraction is useful for selecting more narrowly the data to extract, saving time and memory. As an example, below you can see a comparison of the mean times that it takes to retrieve a bunch of single elements out of different multidimensional arrays from the ERA5 dataset (see above). Here we have used Blosc2 with a regular LZ4 codec compared against the FIXED-RATE mode of the new ZFP codec:

/images/zfp-plugin/zfp_fixed_rate.png

As you can see, using the ZFP codec in FIXED-RATE mode allows for a good improvement in speed (up to more than 2x) for retrieving single elements (or, in general an amount not exceeding the cell size) in comparison with the existing codecs (even the fastest ones, like LZ4) inside Blosc2. As the performance improvement is of the same order than random access time of modern SSDs, we anticipate that this could be a major win in scenarios where random access is important.

If you are curious on how this new functionality performs for your own datasets and computer, you can use/adapt our benchmark code.

Conclusions

The integration of ZFP as a codec plugin will greatly enhance the capabilities of lossy compression inside C-Blosc2. The current ZFP plugin supports different modes; if users want to specify data loss during compression, it is recommended to use the FIXED-ACCURACY or FIXED-PRECISION modes (and most specially the former because of its better compression performance).

However, if the priority is to get specific compression ratios without paying too much attention to the amount of data loss, one should use the FIXED-RATE mode, which let choose the desired compression ratio. This mode also has the advantage that a "third partition" can be used for improving random elements access speed.

This work has been done thanks to a Small Development Grant from the NumFOCUS Foundation, to whom we are very grateful indeed. NumFOCUS is doing a excellent job in sponsoring scientific projects and you can donate to the Blosc project (or many others under the NumFOCUS umbrella) via its donation page.

Caterva Slicing Performance: A Study

/images/cat_slicing/caterva.png

Caterva is a C library for handling multi-dimensional, chunked, compressed datasets in an easy and fast way. It is build on top of the C-Blosc2 library, leveraging all its avantages on modern CPUs.

Caterva can be used in a lot of different situations; however, where it really stands out is for extracting multidimensional slices of compressed datasets because, thanks to the double partitioning schema that it implements, the amount of data that has to be decompressed so as to get the slice is minimized, making data extraction faster (usually). In this installment, you will be offered a rational on how double partitioning works, together with some examples where it shines, and others where it is not that good.

Double partitioning

/images/cat_slicing/cat_vs_zarr,hdf5.png

Some libraries like HDF5 or Zarr store data into multidimensional chunks. This makes slice extraction from compressed datasets more efficient than using monolithic compression, since only the chunks containing the interesting slice are decompressed instead of the entire array.

In addition, Caterva introduces a new level of partitioning. Within each chunk, the data is re-partitioned into smaller multidimensional sets called blocks. This generally improves the slice extraction, since this allows to decompress only the blocks containing the data in desired slice instead of the whole chunks.

Slice extraction with Caterva, HDF5 and Zarr

So as to see how the double partitioning performs with respect to a traditional single partition schema, we are going to compare the ability to extract multidimensional slices from compressed data of Caterva, HDF5 and Zarr. The examples below consist on extracting some hyper-planes from chunked arrays with different properties and seeing how Caterva performs compared with traditional libraries.

Note: So as to better compare apples with apples, all the benchmarks below have been run using Blosc (with LZ4 as the internal codec) as the compressor by default, with the shuffle filter. Even if Caterva uses the newest C-Blosc2 compressor, and HDF5 and Zarr uses its C-Blosc(1) antecessor, the performance of both libraries are very similar. Also, for easier interactivity, we have used the libraries via Python wrappers (python-caterva, h5py, Zarr).

2-dimensional array

This is a 2-dimensional array and has the following properties, designed to optimize slice extraction from the second dimension:

shape = (8_000, 8_000)
chunkshape = (4_000, 100)
blockshape = (500, 25)

Here we can see that the ratio between chunkshape and blockshape is 8x in dimension 0 and 4x in dimension 1.

/images/cat_slicing/dim0.png/images/cat_slicing/dim1.png

Now we are going to extract some planes from the chunked arrays and will plot the performance. For dimension 0 we extract a hyperplane [i, :], and for dimension 1, [:, i], where i is a random integer.

/images/cat_slicing/2dim.png

Here we see that the slicing times are similar in the dimension 1. However, Caterva performs better in the dimension 0. This is because with double partitioning you only have to decompress the blocks containing the slice instead of the whole chunk.

In fact, Caterva is around 12x faster than HDF5 and 9x faster than Zarr for slicing the dimension 0, which makes sense since Caterva decompresses 8x less data. For the dimension 1, Caterva is approximately 3x faster than HDF5 and Zarr; in this case Caterva has to decompress 4x less data.

That is, the difference in slice extraction speed depends largely on the ratio between the chunk size and the block size. Therefore, for slices where the chunks that contain the slice also have many items that do not belong to it, the existence of blocks (i.e. the second partition) allows to significantly reduce the amount of data to decompress.

Overhead of the second partition

So as to better assess the possible performance cost of the second partition, let's analyze a new case of a 3-dimensional array with the following parameters:

shape = (800, 600, 300)
chunkshape = (200, 100, 80)
blockshape = (20, 100, 10)

So, in the dimensions 0 and 2 the difference between shape and chunkshape is not too big whereas the difference between chunkshape and blockshape is remarkable.

However, for the dimension 1, there is not a difference at all between chunkshape and blockshape. This means that in dim 1 the Caterva machinery will make extra work because of the double partitioning, but it will not get any advantage of it since the block size is going to be equal to the chunk size. This a perfect scenario for measuring the overhead of the second partition.

The slices to extract will be [i, :, :], [:, i, :] or [:, :, i]. Let's see the execution times for slicing these planes:

/images/cat_slicing/3dim.png

As we can see, the performance in dim 1 is around the same order than HDF5 and Zarr (Zarr being a bit faster actually), but difference is not large, so that means that the overhead introduced purely by the second partition is not that important. However, in the other dimensions Caterva still outperforms (by far) Zarr and HDF5. This is because the two level partitioning works as intended here.

A last hyper-slicing example

Let's see a final example showing the double partitioning working on a wide range of dimensions. In this case we choose a 4-dimensional array with the following parameters:

shape = (400, 80, 100, 50)
chunkshape = (100, 40, 10, 50)
blockshape = (30, 5, 2, 10)

Here the last dimension (3) is not optimized for getting hyper-slices, specially in containers with just single partitioning (Zarr and HDF5). However, Caterva should still perform well in this situation because of the double partitioning.

The slices we are going to extract will be [i, :, :, :], [:, i, :, :], [:, :, i, :] or [:, :, :, i]. Let's see the execution times for slicing these hyperplanes:

/images/cat_slicing/4dim.png

As we can see, in this case Caterva outperforms Zarr and HDF5 in all dimensions. However, the advantage is not that important for the last dimension. The reason is that in this last dimension Caterva has a noticeably lower ratio between its shape and blockshape than in the other dimensions.

Final thoughts

We have seen that adding a second partition is beneficial for improving slicing performance in general. Of course, there are some situations where the overhead of the second partition can be noticeable, but the good news is that such an overhead does not get too large when compared with containers with only one level of partitioning.

Finally, we can conclude that Caterva usually obtains better results due to its second partitioning, but when it shines the most is when the two levels of partitioning are well balanced among them and also with respect to the shape of the container.

As always, there is no replacement for experimentation so, in case you want to try Caterva by yourself (and you should if you really care about this problem), you can use our Caterva poster; it is based on a Jupyter notebook that you can adapt to your own scenarios.