Compressing data with the SChunk class¶

Although the NDArray class is the most widely used container for data in Blosc2, it (and many other containers like C2Array, ProxySource, etc.) is built on top of the SChunk class. The machinery of SChunk (from “super-chunk”) is what makes it possible to easily and quickly create, append, insert, update and delete data and metadata for these containers which inherit from the super-chunk container. Hence, it is worthwhile to learn how to use SChunk directly. See this quick overview of the SChunk class in the Python-Blosc2 documentation.

[1]:

import numpy as np

import blosc2

Create a new `SChunk` instance¶

One can initialize an SChunk instance with default parameters. If no data is provided, the space assigned to the chunked data will also be empty (since once can always extend and resize a super-chunk, this is not a problem). However, let’s specify the parameters so they are different to defaults: we’ll set chunksize (the size of each chunk in bytes), the cparams (compression parameters), the dparams (decompression parameters) and pass a Storage instance, which is used to persist the data on-disk.

[2]:

cparams = blosc2.CParams(
    codec=blosc2.Codec.BLOSCLZ,
    typesize=4,
    nthreads=8,
)

dparams = blosc2.DParams(
    nthreads=16,
)

storage = blosc2.Storage(
    contiguous=True,
    urlpath="myfile.b2frame",
    mode="w",  # create a new file
)

schunk = blosc2.SChunk(chunksize=10_000_000, cparams=cparams, dparams=dparams, storage=storage)
schunk

[2]:

<blosc2.schunk.SChunk at 0x7821e04cfd90>

Great! So you have created your first super-chunk, persistent on-disk, with the desired compression codec and chunksize. We can now fill it with data, read it, update it, insert new chunks, etc.

Append and read data¶

We are going to add some data. First, let’s create the dataset, composed of 100 chunks of 2.5 million 4-bit integers each. This means each chunk has an uncompressed size of 10 MB, the chunksize we specified above - this way we know for sure that the batches of data will fit into the predetermined chunks of the super-chunk (although after compression, we expect each chunk to end up being quite a bit smaller).

[3]:

buffer = [i * np.arange(2_500_000, dtype="int32") for i in range(100)]

Now we update the super chunk with the data for each chunk - the super chunk automatically extends the container to accommodate the new data, as we can verify by checking the number of chunks in the super-chunk after each append operation:

[4]:

for i in range(100):
    nchunks = schunk.append_data(buffer[i])
    assert nchunks == (i + 1)
!ls -lh myfile.b2frame

/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
-rw-r--r-- 1 lshaw lshaw 82M Aug  5 10:16 myfile.b2frame

So, while we have added 100 chunks of 10 MB (uncompressed) each, the data size of the frame on-disk is quite a bit less. This is how compression is helping you to use less resources.

In order to read the chunks from the on-disk SChunk we need to initialize a buffer and then use the decompress_chunk method, which will decompress the data into the provided buffer. The first argument is the chunk number to decompress, and the second one is the destination buffer where the decompressed data will be stored. After the loop, dest should contain the final chunk we added, which was 99 * np.arange(2_500_000, dtype="int32"):

[5]:

dest = np.empty(2_500_000, dtype="int32")
for i in range(100):
    chunk = schunk.decompress_chunk(i, dest)
## Final chunk should be equal to checker
checker = 99 * np.arange(2_500_000, dtype="int32")
np.testing.assert_equal(dest, checker)

Updating and inserting¶

We can update the first chunk with some new data. Unlike for the append operation, we must first compress the data into a Blosc2-compatible form and then update the desired chunk in-place:

[6]:

data_up = np.arange(2_500_000, dtype="int32")
chunk = blosc2.compress2(data_up)
schunk.update_chunk(nchunk=0, chunk=chunk)

[6]:

The function then returns the number of chunks in the SChunk, which is the same as before, since we have overwritten the old chunk data at chunk position 0. On the other hand, if we insert a chunk at position 4 we increase the indices of the following chunks, so the number of chunks in the SChunk will increase by one:

[7]:

%%time
schunk.insert_chunk(nchunk=4, chunk=chunk)

CPU times: user 400 μs, sys: 204 μs, total: 604 μs
Wall time: 526 μs

[7]:

In this case the return value is the new number of chunks in the super-chunk. This is a rapid operation since the chunks are not stored contiguously and so incrementing their index is just a matter of updating the metadata, not moving any data around.

Metalayers and variable length metalayers¶

Upon creation of the SChunk, one may pass compression/decompression and storage parameters to the constructor as we have seen, which may be accessed (although not in general modified) as attributes of the instance. In addition, one may add metalayers which contain custom metadata summarising the container-stored data. There are two kinds of metalayers, both of which use a dictionary-like interface. The first one, meta, must be added at construction time; it cannot be deleted and can only be updated with values that have the same bytes size as the old value. They are easy to access and edit by users:

[8]:

schunk = blosc2.SChunk(meta={"meta1": 234})
print(f"Meta keys: {schunk.meta.keys()}")
print(f"meta1 before modification: {schunk.meta['meta1']}")
schunk.meta["meta1"] = 235
print(f"meta1 after modification: {schunk.meta['meta1']}")

Meta keys: ['meta1']
meta1 before modification: 234
meta1 after modification: 235

A second type of metalayer, vlmeta, offers more flexibility. vlmeta stands for “variable length metadata”, and, as the name suggests, is designed to store general, variable length data. You can add arbitrary entries to vlmeta after the creation of the SChunk, update entries with different bytes size values or indeed delete them. vlmeta follows the dictionary interface, and so one may add entries to it like this:

[9]:

schunk.vlmeta["info1"] = "This is an example"
schunk.vlmeta["info2"] = "of user meta handling"
schunk.vlmeta.getall()

[9]:

{b'info1': 'This is an example', b'info2': 'of user meta handling'}

The entries may also be modified with larger values than the original ones:

[10]:

schunk.vlmeta["info1"] = "This is a larger example"
schunk.vlmeta.getall()

[10]:

{b'info1': 'This is a larger example', b'info2': 'of user meta handling'}

Finally, one may delete some of the entries:

[11]:

del schunk.vlmeta["info1"]
schunk.vlmeta.getall()

[11]:

{b'info2': 'of user meta handling'}

Using metalayers with NDArray¶

Naturally, any object which inherits from SChunk also supports both flavours of metalayer. Consequently, one may add such metalayers to NDArray objects, which are the most commonly used containers in Blosc2. Hence we may add meta at construction time, in the following way

[12]:

meta = {"dtype": "i8", "coords": [5.14, 23.0]}
array = blosc2.zeros((1000, 1000), dtype=np.int16, chunks=(100, 100), blocks=(50, 50), meta=meta)
print(array.meta)
print(array.meta.keys())

{'b2nd': [0, 2, [1000, 1000], [100, 100], [50, 50], 0, '<i2'], 'dtype': 'i8', 'coords': [5.14, 23.0]}
['b2nd', 'dtype', 'coords']

As you can see, Blosc2 internally adds a 'b2nd' entry to meta when dealing with an NDArray (which by default is empty for a vanilla SChunk) to store shapes, ndim, dtype, etc, and retrieve this data when needed. We can hone in on our own user meta that we added like so:

[13]:

array.meta["coords"]

[13]:

[5.14, 23.0]

If adding a metalayer after creation, one must use the vlmeta attribute of the underlying SChunk, which also works like a dictionary:

[14]:

print(array.vlmeta[:])
array.vlmeta["info1"] = "This is an example"
array.vlmeta["info2"] = "of user meta handling"
array.vlmeta[:]  # this return all the metadata as a dictionary

{}

[14]:

{b'info1': 'This is an example', b'info2': 'of user meta handling'}

As before, we can update the vlmeta with a value larger than the original one:

[15]:

array.vlmeta["info1"] = "This is a larger example"
array.vlmeta

[15]:

{b'info1': 'This is a larger example', b'info2': 'of user meta handling'}

Indeed you can store any kind of data in the vlmeta metalayer, as long as it is serializable with msgpack. This is a very flexible way to store metadata in a Blosc2 container.

[16]:

array.vlmeta["info3"] = {"a": 1, "b": 2}
array.vlmeta

[16]:

{b'info1': 'This is a larger example', b'info2': 'of user meta handling', b'info3': {'a': 1, 'b': 2}}

Variable length metadata can be deleted:

[17]:

del array.vlmeta["info1"]
array.vlmeta

[17]:

{b'info2': 'of user meta handling', b'info3': {'a': 1, 'b': 2}}

This is very useful to store metadata that is not known at the time of creation of the container, or that can be updated or deleted at any time.

[18]:

# clean up
blosc2.remove_urlpath("myfile.b2frame")

Conclusion¶

That’s all for now. There are more examples in the examples directory of the git repository for you to explore (see blosc2_hdf5_compression.py, schunk.py and schunk_roundtrip.py for some examples of SChunk usage). Enjoy!