Compressing data with the SChunk class

Although the NDArray class is the most widely used container for data in Blosc2, it (and many other containers like C2Array, ProxySource, etc.) is built on top of the SChunk class. The machinery of SChunk (from “super-chunk”) is what makes it possible to easily and quickly create, append, insert, update and delete data and metadata in a these containers which inherit from the super-chunk container. Hence, it is worthwhile to learn how to use it directly. See this quick overview of the SChunk class in the Python-Blosc2 documentation.

[149]:
import numpy as np

import blosc2

Create a new SChunk instance

One can initialize an SChunk instance with default parameters. If no data is provided, the space assigned to the chunked data will also be empty (since once can always extend and resize a super-chunk, this is not a problem). However, let’s specify the parameters so they are different to defaults: we’ll set chunksize (the size of each chunk in bytes), the cparams (compression parameters), the dparams (decompression parameters) and pass a Storage instance, which is used to persist the data on-disk.

[150]:
cparams = blosc2.CParams(
    codec=blosc2.Codec.BLOSCLZ,
    typesize=4,
    nthreads=8,
)

dparams = blosc2.DParams(
    nthreads=16,
)

storage = blosc2.Storage(
    contiguous=True,
    urlpath="myfile.b2frame",
    mode="w",  # create a new file
)

schunk = blosc2.SChunk(chunksize=10_000_000, cparams=cparams, dparams=dparams, storage=storage)
schunk
[150]:
<blosc2.schunk.SChunk at 0x7384047ebb60>

Great! So you have created your first super-chunk, persistent on-disk, with the desired compression codec and chunksize. We can now fill it with data, read it, update it, insert new chunks, etc.

Append and read data

We are going to add some data. First, let’s create the dataset, composed of 100 chunks of 2.5 million 4-bit integers each (i.e. an uncompressed size of 10 MB, the chunksize we specified above):

[151]:
buffer = [i * np.arange(2_500_000, dtype="int32") for i in range(100)]

Now we update the super chunk with the data for each chunk - the super chunk automatically extends the container to accommodate the new data, as we can verify by checking the number of chunks in the super-chunk after each append operation:

[152]:
for i in range(100):
    nchunks = schunk.append_data(buffer[i])
    assert nchunks == (i + 1)
!ls -lh myfile.b2frame
/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
-rw-r--r-- 1 lshaw lshaw 82M Jul 25 16:38 myfile.b2frame

So, while we have added 100 chunks of 10 MB (uncompressed) each, the data size of the frame on-disk is quite a bit less. This is how compression is helping you to use less resources.

In order to read the chunks from the on-disk SChunk we need to initiliaize a buffer and then use the decompress_chunk method, which will decompress the data into the provided buffer. The first argument is the chunk number to decompress, and the second one is the destination buffer where the decompressed data will be stored. After the loop, dest should contain the final chunk we added, which was 99 * np.arange(2_500_000, dtype="int32"):

[153]:
dest = np.empty(2_500_000, dtype="int32")
for i in range(100):
    chunk = schunk.decompress_chunk(i, dest)
## Final chunk should be equal to checker
checker = 99 * np.arange(2_500_000, dtype="int32")
np.testing.assert_equal(dest, checker)

Updating and inserting

We can update the first chunk with some new. Unlike for the append operation, we must first compress the data into a blosc2-compatible form and then update the desired chunk in-place:

[154]:
data_up = np.arange(2_500_000, dtype="int32")
chunk = blosc2.compress2(data_up)
schunk.update_chunk(nchunk=0, chunk=chunk)
[154]:
100

The function then returns the number of chunks in the SChunk, which is the same as before, since we have overwritten the old chunk data at chunk position 0. On the other hand, if we insert a chunk at position 4 we increase the indices of the following chunks, so the number of chunks in the SChunk will increase by one:

[155]:
%%time
schunk.insert_chunk(nchunk=4, chunk=chunk)
CPU times: user 506 μs, sys: 194 μs, total: 700 μs
Wall time: 705 μs
[155]:
101

In this case the return value is the new number of chunks in the super-chunk. This is a rapid operation since the chunks are not stored contiguously and so incrementing their index is just a matter of updating the metadata, not moving any data around.

Metalayers and variable length metalayers

Upon creation of the SChunk, one may pass compression/decompression and storage parameters to the constructor as we have seen, which may be accessed (although not in general modified) as attributes of the instance. In addition, one may add metalayers which contain custom metadata summarising the container-stored data. There are two kinds of metalayers, both if which use a dictionary-like interface. The first one, meta, must be added at construction time; it cannot be deleted and can only be updated with values that have the same bytes size as the old value. They are easy to access and edit by users:

[156]:
schunk = blosc2.SChunk(meta={"meta1": 234})
print(f"Meta keys: {schunk.meta.keys()}")
print(f"meta1 before modification: {schunk.meta['meta1']}")
schunk.meta["meta1"] = 235
print(f"meta1 after modification: {schunk.meta['meta1']}")
Meta keys: ['meta1']
meta1 before modification: 234
meta1 after modification: 235

A second type of metalayer, vlmeta, offers more flexibility. vlmeta stands for “variable length metadata”, and, as the name suggests, is designed to store general, variable length data. You can add arbitrary entries to vlmeta after the creation of the SChunk, update entries with different bytes size values or indeed delete them. vlmeta follows the dictionary interface, and so one may add entries to it like this:

[157]:
schunk.vlmeta["info1"] = "This is an example"
schunk.vlmeta["info2"] = "of user meta handling"
schunk.vlmeta.getall()
[157]:
{b'info1': 'This is an example', b'info2': 'of user meta handling'}

The entries may also be modified with larger values than the original ones:

[158]:
schunk.vlmeta["info1"] = "This is a larger example"
schunk.vlmeta.getall()
[158]:
{b'info1': 'This is a larger example', b'info2': 'of user meta handling'}

Finally, one may delete some of the entries:

[159]:
del schunk.vlmeta["info1"]
schunk.vlmeta.getall()
[159]:
{b'info2': 'of user meta handling'}

Using metalayers with NDArray

Naturally, any object which inherits from SChunk also supports both flavours of metalayer. Consequently, one may add such metalayers to NDArray objects, which are the most commonly used containers in Blosc2. Hence we may add `metaaat construction time, in the following way

[160]:
meta = {"dtype": "i8", "coords": [5.14, 23.0]}
array = blosc2.zeros((1000, 1000), dtype=np.int16, chunks=(100, 100), blocks=(50, 50), meta=meta)
print(array.meta)
print(array.meta.keys())
{'b2nd': [0, 2, [1000, 1000], [100, 100], [50, 50], 0, '<i2'], 'dtype': 'i8', 'coords': [5.14, 23.0]}
['b2nd', 'dtype', 'coords']

As you can see, Blosc2 internally a 'b2nd' entry to meta (which by default is empty for a vanilla SChunk) to store shapes, ndim, dtype, etc, and retrieve this data when needed. We can hone in on our own user meta that we added like so:

[162]:
array.meta["coords"]
[162]:
[5.14, 23.0]

If adding a metalayer after creation, one must use the vlmeta attribute of the underlying SChunk, which also works like a dictionary:

[163]:
print(array.vlmeta[:])
array.vlmeta["info1"] = "This is an example"
array.vlmeta["info2"] = "of user meta handling"
array.vlmeta[:]  # this return all the metadata as a dictionary
{}
[163]:
{b'info1': 'This is an example', b'info2': 'of user meta handling'}

You can update them with a value larger than the original one:

[164]:
array.vlmeta["info1"] = "This is a larger example"
array.vlmeta
[164]:
{b'info1': 'This is a larger example', b'info2': 'of user meta handling'}

Indeed you can store any kind of data in the vlmeta metalayer, as long as it is serializable with msgpack. This is a very flexible way to store metadata in a Blosc2 container.

[165]:
array.vlmeta["info3"] = {"a": 1, "b": 2}
array.vlmeta
[165]:
{b'info1': 'This is a larger example', b'info2': 'of user meta handling', b'info3': {'a': 1, 'b': 2}}

Variable length metadata can be deleted:

[166]:
del array.vlmeta["info1"]
array.vlmeta
[166]:
{b'info2': 'of user meta handling', b'info3': {'a': 1, 'b': 2}}

This is very useful to store metadata that is not known at the time of creation of the container, or that can be updated or deleted at any time.

Conclusion

That’s all for now. There are more examples in the examples directory of the git repository for you to explore. Enjoy!