Slicing, extending and serializing#

The newest and coolest way to store data in python-blosc2 is through a SChunk (super-chunk) object. Here the data is split into chunks of the same size. In the past, the only way of working with it was chunk by chunk (see the SChunk basics tutorial), but now, python-blosc2 can retrieve, update or append data at item level (i.e. avoiding doing it chunk by chunk). To see how this works, let’s first create our SChunk.

[11]:
import blosc2
import numpy as np

nchunks = 10
data = np.arange(200 * 1000 * nchunks, dtype=np.int32)
cparams = {"typesize": 4}
schunk = blosc2.SChunk(chunksize=200 * 1000 * 4, data=data, cparams=cparams)

It is important to set the typesize correctly as these methods will work with items and not with bytes.

Getting data from a SChunk#

Let’s begin by retrieving the data from the whole SChunk. We could use the decompress_chunk method:

[12]:
out = np.empty(200 * 1000 * nchunks, dtype=np.int32)
for i in range(nchunks):
    schunk.decompress_chunk(i, out[200 * 1000 * i : 200 * 1000 * (i + 1)])

But instead of the code above, we can simply use the __getitem__ or the get_slice methods. Let’s begin with __getitem__:

[13]:
out_slice = schunk[:]
type(out_slice)
[13]:
bytes

As you can see, the data is returned as a bytes object. If we want to get a more meaningful container instead, we can use get_slice, where you can pass any Python object (supporting the Buffer Protocol) as the out param to fill it with the data. In this case we will use a NumPy array container.

[14]:
out_slice = np.empty(200 * 1000 * nchunks, dtype=np.int32)
schunk.get_slice(out=out_slice)
np.array_equal(out, out_slice)
print(out_slice[:4])
[0 1 2 3]

That’s the expected data indeed!

Setting data in a SChunk#

We can also set the data of a SChunk area from any Python object supporting the Buffer Protocol. Let’s see a quick example:

[15]:
start = 34
stop = 1000 * 200 * 4
new_value = np.ones(stop - start, dtype=np.int32)
schunk[start:stop] = new_value

We have seen how to get or set data. But what if we would like to add data? Well, you can still do that with __setitem__.

[16]:
schunk_nelems = 1000 * 200 * nchunks

new_value = np.zeros(1000 * 200 * 2 + 53, dtype=np.int32)
start = schunk_nelems - 123
new_nitems = start + new_value.size
schunk[start:new_nitems] = new_value

Here, start is less than the number of elements in SChunk and new_items is larger than this; that means that __setitem__ can update and append data at the same time, and you don’t have to worry about whether you are exceeding the limits of the SChunk.

Building a SChunk from/as a contiguous buffer#

Furthermore, you can convert a SChunk to a contiguous, serialized buffer and vice-versa. Let’s get that buffer (aka cframe) first:

[17]:
buf = schunk.to_cframe()

And now the other way around:

[18]:
schunk2 = blosc2.schunk_from_cframe(cframe=buf, copy=True)

In this case we set the copy param to True. If you do not want to copy the buffer, be mindful that you will have to keep a reference to it until you do not want the SChunk anymore.

Serializing NumPy arrays#

If what you want is to create a serialized, compressed version of a NumPy array, you can use the newer (and faster) functions to store it either in-memory or on-disk. The specification of such a contiguous compressed representation, aka cframe can be seen here.

In-memory#

For obtaining an in-memory representation, you can use pack_tensor. In comparison with its former version (pack_array), it is way faster and does not have the 2 GB size limitation:

[19]:
np_array = np.arange(2**30, dtype=np.int32)  # 4 GB array

packed_arr2 = blosc2.pack_tensor(np_array)
unpacked_arr2 = blosc2.unpack_tensor(packed_arr2)

On-disk#

To store the serialized buffer on-disk you want to use save_tensor and load_tensor:

[20]:
blosc2.save_tensor(np_array, urlpath="ondisk_array.b2frame", mode="w")
np_array2 = blosc2.load_tensor("ondisk_array.b2frame")
np.array_equal(np_array, np_array2)
[20]:
True

Conclusions#

Now python-blosc2 offers an easy, yet fast way of creating, getting, setting and expanding data via the SChunk class. Moreover, you can get a contiguous compressed representation (aka cframe) of it and re-create it again later with no sweat.