Slicing, extending and serializing with SChunks¶
The usual way to store generic binary data in python-blosc2 is through a SChunk
(super-chunk) object, where the data is split into chunks of the same size, which we studied in the last tutorial. We saw how to retrieve, update or append data in the form of chunks. In fact, one can work with the individual multi-byte items composing the data (and not the bytes directly), using native SChunk methods - such operations will be the subject of this tutorial. We will use NumPy arrays as data
sources, but everything we’re going to do woul work equally well with any Python object supporting the Buffer Protocol.
First, we create our own SChunk
instance; this time, let’s fill it with data upon creation.
[1]:
import numpy as np
import blosc2
nchunks = 10
data = np.arange(200 * 1000 * nchunks, dtype=np.int32)
cparams = blosc2.CParams(typesize=4)
schunk = blosc2.SChunk(chunksize=200 * 1000 * 4, data=data, cparams=cparams)
It is important to set the typesize
correctly as the methods we are going to use will work with items (of size typesize
) and not with individual bytes.
Getting data from a SChunk¶
Let’s begin by retrieving the data from the whole SChunk. We could use the decompress_chunk
method, decompressing chunk-by-chunk into a buffer, as we did in the previous tutorial:
[2]:
out = np.empty(200 * 1000 * nchunks, dtype=np.int32)
for i in range(nchunks):
schunk.decompress_chunk(i, out[200 * 1000 * i : 200 * 1000 * (i + 1)])
However, instead of the code above, we can simply use the __getitem__
or the get_slice
methods, without even needing to initialise an empty buffer. Let’s begin with __getitem__
:
[3]:
out_slice = schunk[:]
type(out_slice)
[3]:
bytes
As you can see, the data is returned as a bytes object. If we want to get a more meaningful container instead, we can use get_slice
. This method requires an initailised buffer into which to load the bytes, and one may pass any Python object (supporting the Buffer Protocol) as the out
param to fill it with the data. In this case we will use a NumPy array container.
[4]:
out_slice = np.empty(200 * 1000 * nchunks, dtype=np.int32)
schunk.get_slice(out=out_slice)
np.array_equal(out, out_slice)
print(out_slice[:4])
[0 1 2 3]
That’s the expected data indeed!
Setting (and enlarging) data in a SChunk¶
We can also directly set an arbitrary slice of data of a SChunk
(without having to define a chunk and using update_chunk
as we saw previously). We may use the __setitem__
method of the SChunk and set it equal to some source, which may be any Python object supporting the Buffer Protocol. Let’s see a quick example:
[5]:
start = 34
stop = 1000 * 200 * 4
new_value = np.ones(stop - start, dtype=np.int32)
schunk[start:stop] = new_value
In fact __setitem__
allows you to set a slice of the SChunk which extends past the existing data boundaries, using essentially the same syntax:
[6]:
schunk_nelems = 1000 * 200 * nchunks
new_value = np.zeros(1000 * 200 * 2 + 53, dtype=np.int32)
start = schunk_nelems - 123
new_nitems = start + new_value.size
print(f"Original nchunks: {schunk.nchunks}")
schunk[start:new_nitems] = new_value
print(f"New nchunks: {schunk.nchunks}")
Original nchunks: 10
New nchunks: 12
Here, start
is less than the number of elements in SChunk
and new_items
is larger; __setitem__
can update and append data at the same time, and you don’t have to worry about whether you are exceeding the limits of the SChunk
- internally, the necessary chunks are added to accommodate the new data. We can check that the number of chunks has indeed increased.
Building a SChunk from/as a contiguous buffer¶
Recall that SChunks generally store data in a non-contiguous (sparse) manner. Certain operations (e.g. data transfer) are faster if the data is stored contiguously. Thus, one may want to convert the SChunk to a contiguous, serialized buffer (aka cframe
). The specification of a cframe
(a contiguous compressed representation) can be seen here. Converting to a cframe
is as simple as calling the to_cframe
method of the SChunk:
[7]:
buf = schunk.to_cframe()
Likewise, since the SChunk format is useful for e.g. extending the data in an efficient way, one may wish to convert a contiguous buffer (e.g. a cframe
) to a SChunk. This is also very easy, as you can use the schunk_from_cframe
method of the blosc2
module:
[8]:
schunk2 = blosc2.schunk_from_cframe(cframe=buf, copy=True)
In this case we set the copy
param to True
. If you do not want to copy the buffer, be mindful that you will have to keep a reference to it until you do not want the SChunk anymore, as the buffer memory has not been freed.
Serializing NumPy arrays¶
If what you want is to create a serialized, compressed version of a NumPy array, you can bypass manually creating an SChunk and serializing it using some bespoke Blosc2 functions, which are more efficient and allow one to store the array in-memory or on-disk.
In-memory: To compress and store the array serialized in-memory you can use pack_tensor
:
[9]:
np_array = np.arange(2**30, dtype=np.int32) # 4 GB array
packed_arr2 = blosc2.pack_tensor(np_array)
unpacked_arr2 = blosc2.unpack_tensor(packed_arr2)
Note: pack_tensor
is way faster than the deprecated pack_array
, which also suffers from a 2 GB size limitation.
On-disk: To compress, store and serialize a buffer on-disk you may use save_tensor
(and then to load into memory load_tensor
):
[10]:
blosc2.save_tensor(np_array, urlpath="ondisk_array.b2frame", mode="w")
np_array2 = blosc2.load_tensor("ondisk_array.b2frame")
np.array_equal(np_array, np_array2)
blosc2.remove_urlpath("ondisk_array.b2frame") # remove the files
Conclusions¶
The SChunk
class is one of the stars of Python-Blosc2, as should be clear from the last two tutorials, since its sparse format allows one to easily add, insert and update compressed data rapidly. Moreover, for the situations in which one needs it, one can always get a contiguous compressed representation (aka cframe) using the to_cframe
method, and convert back to a SChunk using schunk_from_cframe
. Finally, we
saw how to serialize NumPy arrays in-memory or on-disk using the pack_tensor
and save_tensor
methods, respectively.