NDArray: A NDim, Compressed Data Container¶

NDArray objects let users perform different operations with arrays like setting, copying or slicing them. In this section, we are going to see how to create and manipulate these NDArray arrays, which possess metadata and data. The data is chunked and compressed; the metadata gives information about the data itself, as well as the chunking and compression. Chunking and compression are features which make NDArray arrays very efficient for working with large data.

[1]:

import numpy as np

import blosc2

Creating an array¶

Let’s start by creating a 2D array with 100M elements filled with arange. We can then print out the metadata, which contains information about: the array data (such as shape and dtype); and how the data is compressed and stored, such as chunk- and block-shapes (chunks and blocks) and compression params (CParams). See here for an explanation of chunking and blocking.

[2]:

shape = (10_000, 10_000)
array = blosc2.arange(np.prod(shape), shape=shape)
print(array.info)

type    : NDArray
shape   : (10000, 10000)
chunks  : (625, 10000)
blocks  : (5, 10000)
dtype   : int64
nbytes  : 800000000
cbytes  : 1459352
cratio  : 548.19
cparams : CParams(codec=<Codec.ZSTD: 5>, codec_meta=0, clevel=5, use_dict=False, typesize=8,
        : nthreads=12, blocksize=400000, splitmode=<SplitMode.AUTO_SPLIT: 3>,
        : filters=[<Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>,
        : <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.SHUFFLE: 1>], filters_meta=[0, 0,
        : 0, 0, 0, 0], tuner=<Tuner.STUNE: 0>)
dparams : DParams(nthreads=12)

The cratio parameter tells us how effective the compression is, since it gives the ratio between the number of bytes required to store the array in uncompressed and compressed form. Here we require almost 500x less space for the compressed array! Note that all the compression and decompression parameters are set to the default, and chunks and blocks have been selected automatically - playing around with them will affect the cratio (as well as compression and decompression speed).

We can also create an NDArray by compressing a NumPy array:

[3]:

nparray = np.linspace(0, 100, np.prod(shape), dtype=np.float64).reshape(shape)
b2array = blosc2.asarray(nparray)
print(b2array.info)

type    : NDArray
shape   : (10000, 10000)
chunks  : (625, 10000)
blocks  : (5, 10000)
dtype   : float64
nbytes  : 800000000
cbytes  : 14833410
cratio  : 53.93
cparams : CParams(codec=<Codec.ZSTD: 5>, codec_meta=0, clevel=5, use_dict=False, typesize=8,
        : nthreads=12, blocksize=400000, splitmode=<SplitMode.AUTO_SPLIT: 3>,
        : filters=[<Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>,
        : <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.SHUFFLE: 1>], filters_meta=[0, 0,
        : 0, 0, 0, 0], tuner=<Tuner.STUNE: 0>)
dparams : DParams(nthreads=12)

or an iterator:

[4]:

N = 1000_000
rng = np.random.default_rng()
it = ((-x + 1, x - 2, rng.normal()) for x in range(N))
sa = blosc2.fromiter(it, dtype="i4,f4,f8", shape=(N,))
print(sa.info)

type    : NDArray
shape   : (1000000,)
chunks  : (1000000,)
blocks  : (62500,)
dtype   : [('f0', '<i4'), ('f1', '<f4'), ('f2', '<f8')]
nbytes  : 16000000
cbytes  : 7060101
cratio  : 2.27
cparams : CParams(codec=<Codec.ZSTD: 5>, codec_meta=0, clevel=5, use_dict=False, typesize=16,
        : nthreads=12, blocksize=1000000, splitmode=<SplitMode.AUTO_SPLIT: 3>,
        : filters=[<Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>,
        : <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.SHUFFLE: 1>], filters_meta=[0, 0,
        : 0, 0, 0, 0], tuner=<Tuner.STUNE: 0>)
dparams : DParams(nthreads=12)

Reading and modifying data¶

NDArray arrays cannot be read directly, since they are compressed, and so must be decompressed first (to NumPy arrays, which are stored in memory). This can be done for the full array using the [:] operator, which returns a NumPy array.

[5]:

temp = array[:]  # This will decompress the full array
type(temp)

[5]:

numpy.ndarray

However it is often not necessary (or desirable) to load the whole array into memory. We can easily read just small parts of NDArray arrays to a NumPy array, quickly, via standard indexing routines.

[6]:

res1 = array[0]  # get first element
res2 = array[6:10]  # get slice
print(f"Got one element (of shape {res1.shape}) and slice of shape {res2.shape}.")

Got one element (of shape (10000,)) and slice of shape (4, 10000).

We can modify the data in the array using standard NumPy indexing too, using either NumPy or NDArray arrays as the data source. For example, we can set the first row to zeros (using an NDArray array) and the first column to ones (using a NumPy array)

[7]:

array[0, :] = blosc2.zeros(10000, dtype=array.dtype)
array[:, 0] = np.ones(10000, dtype=array.dtype)
print(array)

<blosc2.ndarray.NDArray object at 0x108e05710>

Note that array is still an NDArray array. Let’s check that the entries were correctly modified.

[8]:

print(array[0, 0])
print(array[0, :])
print(array[:, 0])

1
[1 0 0 ... 0 0 0]
[1 1 1 ... 1 1 1]

Enlarging the array¶

Existing arrays can be enlarged. This is one operation that is greatly enhanced by the chunking procedure implemented in NDArray arrays.

[9]:

array.resize((10_001, 10_000))
print(array.shape)
array[10_000, :] = 1
array[10_000, :]

(10001, 10000)

[9]:

array([1, 1, 1, ..., 1, 1, 1], shape=(10000,))

Enlarging a NumPy array requires a full copy of the data, since underlying data are stored contiguously in memory, which is very costly: new memory to hold the extended array is allocated, the old data is copied to part of the new memory, and then the new data is written to the remaining new memory. Enlarging is a much faster operation for NDArray arrays because data is chunked, and the chunks may be stored non-contiguously in memory, so one may simply write the necessary new chunks to some arbitrary address in memory and leave the old chunks untouched. The references to the new chunk addresses are then added in the NDArray container, which is a very quick operation.

You can also shrink the array.

[10]:

array.resize((9_000, 10_000))
print(array.shape)
print(array[8_999])  # This works
# array[9_000]  # This will raise an exception

(9000, 10000)
[       1 89990001 89990002 ... 89999997 89999998 89999999]

Persistent data¶

We can use the save() method to store the array on disk. This is very useful when you are working with a large array but do not need to access it often.

[11]:

array.save("array_tutorial.b2nd", mode="w")  # , contiguous=True)
!ls -lh array_tutorial.b2nd

-rw-r--r--@ 1 faltet  staff   1.5M Aug 16 14:43 array_tutorial.b2nd

For arrays, it is usual to use the .b2nd extension. Now let’s open the saved array and check that the data saved correctly (decompressing first to be able to compare):

[12]:

array2 = blosc2.open("array_tutorial.b2nd")
np.all(array2[:] == array[:])  # Make sure saved array matches original

[12]:

np.True_

In fact it is possible to create a NDArray array directly on disk, specifying where it will be stored, without first creating it in memory. We may also specify the compression/decompression and other storage parameters (e.g chunks and blocks). For example, a 1000x1000 array filled with the string "pepe" can be created like this:

[13]:

array1 = blosc2.full(
    (1000, 1000),
    fill_value=b"pepe",
    chunks=(100, 100),
    blocks=(50, 50),
    urlpath="array1_tutorial.b2nd",
    mode="w",
)
!ls -lh array1_tutorial.b2nd

-rw-r--r--@ 1 faltet  staff   3.9K Aug 16 14:43 array1_tutorial.b2nd

We can also write direct to disk using the other constructors we saw previously.

[14]:

it = ((-x + 1, x - 2, rng.normal()) for x in range(N))
sa = blosc2.fromiter(it, dtype="i4,f4,f8", shape=(N,), urlpath="sa-1M.b2nd", mode="w")
print("3 first rows of sa:", sa[:3])
b2array = blosc2.asarray(nparray, urlpath="linspace_array.b2nd", mode="w")
print("3 first rows of b2array:", b2array[:3])

3 first rows of sa: [( 1, -2.,  0.21515887) ( 0, -1., -1.93182528) (-1,  0.,  1.18963501)]
3 first rows of b2array: [[0.00000000e+00 1.00000001e-06 2.00000002e-06 ... 9.99700010e-03
  9.99800010e-03 9.99900010e-03]
 [1.00000001e-02 1.00010001e-02 1.00020001e-02 ... 1.99970002e-02
  1.99980002e-02 1.99990002e-02]
 [2.00000002e-02 2.00010002e-02 2.00020002e-02 ... 2.99970003e-02
  2.99980003e-02 2.99990003e-02]]

To delete saved data, one may use the remove_urlpath method.

[15]:

blosc2.remove_urlpath("array_tutorial.b2nd")
blosc2.remove_urlpath("array1_tutorial.b2nd")
blosc2.remove_urlpath("sa-1M.b2nd")
blosc2.remove_urlpath("linspace_array.b2nd")

Compression params¶

Let’s see how to copy the NDArray data whilst altering the compression parameters. This may be useful in many contexts, for example testing how changing the codec of an existing array affects the compression ratio.

[16]:

cparams = blosc2.CParams(
    codec=blosc2.Codec.LZ4,
    clevel=9,
    filters=[blosc2.Filter.BITSHUFFLE],
    filters_meta=[0],
)

array2 = array.copy(chunks=(500, 10_000), blocks=(50, 10_000), cparams=cparams)
print(array2.info)

type    : NDArray
shape   : (9000, 10000)
chunks  : (500, 10000)
blocks  : (50, 10000)
dtype   : int64
nbytes  : 720000000
cbytes  : 10193381
cratio  : 70.63
cparams : CParams(codec=<Codec.LZ4: 1>, codec_meta=0, clevel=9, use_dict=False, typesize=8,
        : nthreads=12, blocksize=4000000, splitmode=<SplitMode.AUTO_SPLIT: 3>,
        : filters=[<Filter.BITSHUFFLE: 2>, <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>,
        : <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>], filters_meta=[0, 0,
        : 0, 0, 0, 0], tuner=<Tuner.STUNE: 0>)
dparams : DParams(nthreads=12)

[17]:

print(array.info)

type    : NDArray
shape   : (9000, 10000)
chunks  : (625, 10000)
blocks  : (5, 10000)
dtype   : int64
nbytes  : 750000000
cbytes  : 1537287
cratio  : 487.87
cparams : CParams(codec=<Codec.ZSTD: 5>, codec_meta=0, clevel=5, use_dict=False, typesize=8,
        : nthreads=12, blocksize=400000, splitmode=<SplitMode.AUTO_SPLIT: 3>,
        : filters=[<Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>,
        : <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.SHUFFLE: 1>], filters_meta=[0, 0,
        : 0, 0, 0, 0], tuner=<Tuner.STUNE: 0>)
dparams : DParams(nthreads=12)

In this case the compression ratio is much higher for the original array, since we have changed to a different codec that is optimised for compression speed, not compression ratio. In general there is a tradeoff between the two.

Native Blosc2 Codecs¶

Blosc2 supports many standard codecs, since there is no one-size-fits-all compression solution - one codec may be perfect for one context, but quite suboptimal in another.

ZLIB codec: uses the DEFLATE algorithm, is standard, and works well for images.
ZSTD codec: similar compression ratio to ZLIB but faster compression/decompression
LZ4 codec: even faster comp/decomp than ZSTD but reduced compression ratio.
BloscLZ: Blosc implementation of the popular LZ algorithms (good for repeated data e.g. text). Similar tradeoff to LZ4.

Finally, via package extensions to Blosc2, one may access the JPEG2000 family of compression algorithms, which aim for a compromise between compression ratio and image quality; Blosc2 implements GROK (blosc2-grok) and OPENHTJ2K (blosc2-openhtj2k).

TreeStore: Endowing your data with a hierarchical structure¶

With the TreeStore class, you can create a hierarchical structure for your data. This is useful when you want to store data in a tree-like format, where each node can have multiple children. The TreeStore class allows you to create, read, and modify trees of NDArray arrays.

Let’s see an example:

[18]:

with blosc2.TreeStore("example_tree.b2z", mode="w") as tstore:
    tstore["/data"] = np.array([1, 2, 3])  # numpy array
    tstore["/dir1/data1"] = blosc2.ones((2, 10))  # blosc2 array
    tstore["/dir1/data2"] = blosc2.linspace(0, 1, 1e7, shape=(10, 1000, 1000))  # blosc2 array
    tstore.vlmeta["author"] = "blosc2"
    tstore["/dir1"].vlmeta["year"] = 2025

Let’s explore the tree structure we just created. Let’s re-open the TreeStore and print out a dataset and some metadata.

[19]:

tstore2 = blosc2.TreeStore("example_tree.b2z", mode="r")
list(tstore2)  # list all keys in the tree

[19]:

['/dir1', '/dir1/data2', '/data', '/dir1/data1']

[20]:

print("/dir1/data1:\n", tstore2["/dir1/data1"][:])
print("root metadata:", tstore2.vlmeta[:])
print("/dir1 metadata:", tstore2["/dir1"].vlmeta[:])

/dir1/data1:
 [[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]]
root metadata: {'author': 'blosc2'}
/dir1 metadata: {'year': 2025}

[21]:

for key, node in tstore2.items():
    print(f"Node: {key}, Data: {node[1] if isinstance(node, blosc2.NDArray) else node.vlmeta[:]}")

Node: /dir1, Data: {'year': 2025}
Node: /dir1/data2, Data: [[0.10000001 0.10000011 0.10000021 ... 0.10009971 0.10009981 0.10009991]
 [0.10010001 0.10010011 0.10010021 ... 0.10019971 0.10019981 0.10019991]
 [0.10020001 0.10020011 0.10020021 ... 0.10029971 0.10029981 0.10029991]
 ...
 [0.19970002 0.19970012 0.19970022 ... 0.19979972 0.19979982 0.19979992]
 [0.19980002 0.19980012 0.19980022 ... 0.19989972 0.19989982 0.19989992]
 [0.19990002 0.19990012 0.19990022 ... 0.19999972 0.19999982 0.19999992]]
Node: /data, Data: 2
Node: /dir1/data1, Data: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]

Note that all the data has been stored on a single file:

[22]:

!ls -lh example_tree.b2z
# !zipinfo example_tree.b2z  # only if you have zipinfo installed

-rw-r--r--@ 1 faltet  staff   1.6M Aug 16 14:43 example_tree.b2z

That’s all for now. There are more examples in the examples directory of the git repository for you to explore. Enjoy!