Slicing, extending and serializing with SChunks¶

The usual way to store generic binary data in python-blosc2 is through a SChunk (super-chunk) object, where the data is split into chunks of the same size, which we studied in the last tutorial. We saw how to retrieve, update or append data in the form of chunks. In fact, one can work with the individual multi-byte items composing the data (and not the bytes directly), using native SChunk methods - such operations will be the subject of this tutorial. We will use NumPy arrays as data sources, but everything we’re going to do would work equally well with any Python object supporting the Buffer Protocol.

First, we create our own SChunk instance; this time, let’s fill it with data upon creation.

[1]:

import numpy as np

import blosc2

nchunks = 10
data = np.arange(200 * 1000 * nchunks, dtype=np.int32)
cparams = blosc2.CParams(typesize=4)
schunk = blosc2.SChunk(chunksize=200 * 1000 * 4, data=data, cparams=cparams)

It is important to set the typesize correctly as the methods we are going to use will work with items (of size typesize) and not with individual bytes.

Getting data from a SChunk¶

Let’s begin by retrieving the data from the whole SChunk. We could use the decompress_chunk method, decompressing chunk-by-chunk into a buffer, as we did in the previous tutorial:

[2]:

out = np.empty(200 * 1000 * nchunks, dtype=np.int32)
for i in range(nchunks):
    schunk.decompress_chunk(i, out[200 * 1000 * i : 200 * 1000 * (i + 1)])

However, instead of the code above, we can simply use the __getitem__ or the get_slice methods, without even needing to initialise an empty buffer. Let’s begin with __getitem__:

[3]:

out_slice = schunk[:]
type(out_slice)

[3]:

bytes

As you can see, the data is returned as a bytes object. If we want to get a more meaningful container instead, we can use get_slice. This method requires an initialised buffer into which to load the bytes, and one may pass any Python object (supporting the Buffer Protocol) as the out param to fill it with the data. In this case we will use a NumPy array container.

[4]:

out_slice = np.empty(200 * 1000 * nchunks, dtype=np.int32)
schunk.get_slice(out=out_slice)
np.array_equal(out, out_slice)
print(out_slice[:4])

[0 1 2 3]

That’s the expected data indeed!

Setting (and enlarging) data in a SChunk¶

We can also directly set an arbitrary slice of data of a SChunk (without having to define a chunk and using update_chunk as we saw previously). We may use the __setitem__ method of the SChunk and set it equal to some source, which may be any Python object supporting the Buffer Protocol. Let’s see a quick example:

[5]:

start = 34
stop = 1000 * 200 * 4
new_value = np.ones(stop - start, dtype=np.int32)
schunk[start:stop] = new_value

In fact __setitem__ allows you to set a slice of the SChunk which extends past the existing data boundaries, using essentially the same syntax:

[6]:

schunk_nelems = 1000 * 200 * nchunks

new_value = np.zeros(1000 * 200 * 2 + 53, dtype=np.int32)
start = schunk_nelems - 123
new_nitems = start + new_value.size
print(f"Original nchunks: {schunk.nchunks}")
schunk[start:new_nitems] = new_value
print(f"New nchunks: {schunk.nchunks}")

Original nchunks: 10
New nchunks: 12

Here, start is less than the number of elements in SChunk and new_items is larger; __setitem__ can update and append data at the same time, and you don’t have to worry about whether you are exceeding the limits of the SChunk - internally, the necessary chunks are added to accommodate the new data. We can check that the number of chunks has indeed increased.

Building a SChunk from/as a contiguous buffer¶

Recall that SChunks generally store data in a non-contiguous (sparse) manner. Certain operations (e.g. data transfer) are faster if the data is stored contiguously. Thus, one may want to convert the SChunk to a contiguous, serialized buffer (aka cframe). The specification of a cframe (a contiguous compressed representation) can be seen here. Converting to a cframe is as simple as calling the to_cframe method of the SChunk:

[7]:

buf = schunk.to_cframe()

Likewise, since the SChunk format is useful for e.g. extending the data in an efficient way, one may wish to convert a contiguous buffer (e.g. a cframe) to a SChunk. This is also very easy, as you can use the schunk_from_cframe method of the blosc2 module:

[8]:

schunk2 = blosc2.schunk_from_cframe(cframe=buf, copy=True)

In this case we set the copy param to True. If you do not want to copy the buffer, be mindful that you will have to keep a reference to it until you do not want the SChunk anymore, as the buffer memory has not been freed.

Serializing NumPy arrays¶

If what you want is to create a serialized, compressed version of a NumPy array, you can bypass manually creating an SChunk and serializing it using some bespoke Blosc2 functions, which are more efficient and allow one to store the array in-memory or on-disk.

In-memory: To compress and store the array serialized in-memory you can use pack_tensor:

[9]:

np_array = np.arange(2**30, dtype=np.int32)  # 4 GB array

packed_arr2 = blosc2.pack_tensor(np_array)
unpacked_arr2 = blosc2.unpack_tensor(packed_arr2)

Note: pack_tensor is way faster than the deprecated pack_array, which also suffers from a 2 GB size limitation.

On-disk: To compress, store and serialize a buffer on-disk you may use save_tensor (and then to load into memory load_tensor):

[10]:

blosc2.save_tensor(np_array, urlpath="ondisk_array.b2frame", mode="w")
np_array2 = blosc2.load_tensor("ondisk_array.b2frame")
np.array_equal(np_array, np_array2)
blosc2.remove_urlpath("ondisk_array.b2frame")  # remove the files

Conclusions¶

The SChunk class is one of the stars of Python-Blosc2, as should be clear from the last two tutorials, since its sparse format allows one to easily add, insert and update compressed data rapidly. Moreover, for the situations in which one needs it, one can always get a contiguous compressed representation (aka cframe) using the to_cframe method, and convert back to a SChunk using schunk_from_cframe. Finally, we saw how to serialize NumPy arrays in-memory or on-disk using the pack_tensor and save_tensor methods, respectively.