NDArray: A NDim, Compressed Data Container¶
NDArray objects let users perform different operations with arrays like setting, copying or slicing them. In this section, we are going to see how to create and manipulate these NDArray arrays, which possess metadata and data. The data is chunked and compressed; the metadata gives information about the data itself, as well as the chunking and compression. Chunking and compression are features which make NDArray arrays very efficient for working with large data.
[1]:
import numpy as np
import blosc2
Creating an array¶
Let’s start by creating a 2D array with 100M elements filled with arange
. We can then print out the metadata, which contains information about: the array data (such as shape
and dtype
); and how the data is compressed and stored, such as chunk- and block-shapes (chunks
and blocks
) and compression params (CParams
). See here for an explanation of chunking and blocking.
[2]:
shape = (10_000, 10_000)
array = blosc2.arange(np.prod(shape), shape=shape)
print(array.info)
type : NDArray
shape : (10000, 10000)
chunks : (100, 10000)
blocks : (2, 10000)
dtype : int64
cratio : 319.84
cparams : CParams(codec=<Codec.ZSTD: 5>, codec_meta=0, clevel=1, use_dict=False, typesize=8,
: nthreads=28, blocksize=160000, splitmode=<SplitMode.AUTO_SPLIT: 3>,
: filters=[<Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>,
: <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.SHUFFLE: 1>], filters_meta=[0, 0,
: 0, 0, 0, 0], tuner=<Tuner.STUNE: 0>)
dparams : DParams(nthreads=28)
The cratio
parameter tells us how effective the compression is, since it gives the ratio between the number of bytes required to store the array in uncompressed and compressed form. Here we require almost 500x less space for the compressed array! Note that all the compression and decompression parameters are set to the default, and chunks
and blocks
have been selected automatically - playing around with them will affect the cratio
(as well as compression and decompression speed).
We can also create an NDArray by compressing a NumPy array:
[3]:
nparray = np.linspace(0, 100, np.prod(shape), dtype=np.float64).reshape(shape)
b2array = blosc2.asarray(nparray)
print(b2array.info)
type : NDArray
shape : (10000, 10000)
chunks : (100, 10000)
blocks : (2, 10000)
dtype : float64
cratio : 20.41
cparams : CParams(codec=<Codec.ZSTD: 5>, codec_meta=0, clevel=1, use_dict=False, typesize=8,
: nthreads=28, blocksize=160000, splitmode=<SplitMode.AUTO_SPLIT: 3>,
: filters=[<Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>,
: <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.SHUFFLE: 1>], filters_meta=[0, 0,
: 0, 0, 0, 0], tuner=<Tuner.STUNE: 0>)
dparams : DParams(nthreads=28)
or an iterator:
[4]:
N = 1000_000
rng = np.random.default_rng()
it = ((-x + 1, x - 2, rng.normal()) for x in range(N))
sa = blosc2.fromiter(it, dtype="i4,f4,f8", shape=(N,))
print(sa.info)
type : NDArray
shape : (1000000,)
chunks : (500000,)
blocks : (31250,)
dtype : [('f0', '<i4'), ('f1', '<f4'), ('f2', '<f8')]
cratio : 2.24
cparams : CParams(codec=<Codec.ZSTD: 5>, codec_meta=0, clevel=1, use_dict=False, typesize=16,
: nthreads=28, blocksize=500000, splitmode=<SplitMode.AUTO_SPLIT: 3>,
: filters=[<Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>,
: <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.SHUFFLE: 1>], filters_meta=[0, 0,
: 0, 0, 0, 0], tuner=<Tuner.STUNE: 0>)
dparams : DParams(nthreads=28)
Reading and modifying data¶
NDArray arrays cannot be read directly, since they are compressed, and so must be decompressed first (to NumPy arrays, which are stored in memory). This can be done for the full array using the [:]
operator, which returns a NumPy array.
[5]:
temp = array[:] # This will decompress the full array
type(temp)
[5]:
numpy.ndarray
However it is often not necessary (or desirable) to load the whole array into memory. We can easily read just small parts of NDArray arrays to a NumPy array, quickly, via standard indexing routines.
[6]:
res1 = array[0] # get first element
res2 = array[6:10] # get slice
print(f"Got one element (of shape {res1.shape}) and slice of shape {res2.shape}.")
Got one element (of shape (10000,)) and slice of shape (4, 10000).
We can modify the data in the array using standard NumPy indexing too, using either NumPy or NDArray arrays as the data source. For example, we can set the first row to zeros (using an NDArray array) and the first column to ones (using a NumPy array)
[7]:
array[0, :] = blosc2.zeros(10000, dtype=array.dtype)
array[:, 0] = np.ones(10000, dtype=array.dtype)
print(array)
<blosc2.ndarray.NDArray object at 0x7f5104194b10>
Note that array
is still an NDArray array. Let’s check that the entries were correctly modified.
[8]:
print(array[0, 0])
print(array[0, :])
print(array[:, 0])
1
[1 0 0 ... 0 0 0]
[1 1 1 ... 1 1 1]
Enlarging the array¶
Existing arrays can be enlarged. This is one operation that is greatly enhanced by the chunking procedure implemented in NDArray arrays.
[9]:
array.resize((10_001, 10_000))
print(array.shape)
array[10_000, :] = 1
array[10_000, :]
(10001, 10000)
[9]:
array([1, 1, 1, ..., 1, 1, 1], shape=(10000,))
Enlarging a NumPy array requires a full copy of the data, since underlying data are stored contiguously in memory, which is very costly: new memory to hold the extended array is allocated, the old data is copied to part of the new memory, and then the new data is written to the remaining new memory. Enlarging is a much faster operation for NDArray arrays because data is chunked, and the chunks may be stored non-contiguously in memory, so one may simply write the necessary new chunks to some arbitrary address in memory and leave the old chunks untouched. The references to the new chunk addresses are then added in the NDArray container, which is a very quick operation.
You can also shrink the array.
[10]:
array.resize((9_000, 10_000))
print(array.shape)
print(array[8_999]) # This works
# array[9_000] # This will raise an exception
(9000, 10000)
[ 1 89990001 89990002 ... 89999997 89999998 89999999]
Persistent data¶
We can use the save()
method to store the array on disk. This is very useful when you are working with a large array but do not need to access it often.
[11]:
array.save("array_tutorial.b2nd", mode="w") # , contiguous=True)
!ls -lh array_tutorial.b2nd
-rw-r--r-- 1 faltet blosc 2,4M ago 6 11:31 array_tutorial.b2nd
For arrays, it is usual to use the .b2nd
extension. Now let’s open the saved array and check that the data saved correctly (decompressing first to be able to compare):
[12]:
array2 = blosc2.open("array_tutorial.b2nd")
np.all(array2[:] == array[:]) # Make sure saved array matches original
[12]:
np.True_
In fact it is possible to create a NDArray array directly on disk, specifying where it will be stored, without first creating it in memory. We may also specify the compression/decompression and other storage parameters (e.g chunks
and blocks
). For example, a 1000x1000 array filled with the string "pepe"
can be created like this:
[13]:
array1 = blosc2.full(
(1000, 1000),
fill_value=b"pepe",
chunks=(100, 100),
blocks=(50, 50),
urlpath="array1_tutorial.b2nd",
mode="w",
)
!ls -lh array1_tutorial.b2nd
-rw-r--r-- 1 faltet blosc 4,0K ago 6 11:32 array1_tutorial.b2nd
We can also write direct to disk using the other constructors we saw previously.
[14]:
it = ((-x + 1, x - 2, rng.normal()) for x in range(N))
sa = blosc2.fromiter(it, dtype="i4,f4,f8", shape=(N,), urlpath="sa-1M.b2nd", mode="w")
print("3 first rows of sa:", sa[:3])
b2array = blosc2.asarray(nparray, urlpath="linspace_array.b2nd", mode="w")
print("3 first rows of b2array:", b2array[:3])
3 first rows of sa: [( 1, -2., 0.45272938) ( 0, -1., -0.13468548) (-1, 0., -0.07419887)]
3 first rows of b2array: [[0.00000000e+00 1.00000001e-06 2.00000002e-06 ... 9.99700010e-03
9.99800010e-03 9.99900010e-03]
[1.00000001e-02 1.00010001e-02 1.00020001e-02 ... 1.99970002e-02
1.99980002e-02 1.99990002e-02]
[2.00000002e-02 2.00010002e-02 2.00020002e-02 ... 2.99970003e-02
2.99980003e-02 2.99990003e-02]]
To delete saved data, one may use the remove_urlpath
method.
[15]:
blosc2.remove_urlpath("array_tutorial.b2nd")
blosc2.remove_urlpath("array1_tutorial.b2nd")
blosc2.remove_urlpath("sa-1M.b2nd")
blosc2.remove_urlpath("linspace_array.b2nd")
Compression params¶
Let’s see how to copy the NDArray data whilst altering the compression parameters. This may be useful in many contexts, for example testing how changing the codec of an existing array affects the compression ratio.
[16]:
cparams = blosc2.CParams(
codec=blosc2.Codec.LZ4,
clevel=9,
filters=[blosc2.Filter.BITSHUFFLE],
filters_meta=[0],
)
array2 = array.copy(chunks=(500, 10_000), blocks=(50, 10_000), cparams=cparams)
print(array2.info)
type : NDArray
shape : (9000, 10000)
chunks : (500, 10000)
blocks : (50, 10000)
dtype : int64
cratio : 70.63
cparams : CParams(codec=<Codec.LZ4: 1>, codec_meta=0, clevel=9, use_dict=False, typesize=8,
: nthreads=28, blocksize=4000000, splitmode=<SplitMode.AUTO_SPLIT: 3>,
: filters=[<Filter.BITSHUFFLE: 2>, <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>,
: <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>], filters_meta=[0, 0,
: 0, 0, 0, 0], tuner=<Tuner.STUNE: 0>)
dparams : DParams(nthreads=28)
[17]:
print(array.info)
type : NDArray
shape : (9000, 10000)
chunks : (100, 10000)
blocks : (2, 10000)
dtype : int64
cratio : 289.53
cparams : CParams(codec=<Codec.ZSTD: 5>, codec_meta=0, clevel=1, use_dict=False, typesize=8,
: nthreads=28, blocksize=160000, splitmode=<SplitMode.AUTO_SPLIT: 3>,
: filters=[<Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>,
: <Filter.NOFILTER: 0>, <Filter.NOFILTER: 0>, <Filter.SHUFFLE: 1>], filters_meta=[0, 0,
: 0, 0, 0, 0], tuner=<Tuner.STUNE: 0>)
dparams : DParams(nthreads=28)
In this case the compression ratio is much higher for the original array, since we have changed to a different codec that is optimised for compression speed, not compression ratio. In general there is a tradeoff between the two.
Native Blosc2 Codecs¶
Blosc2 supports many standard codecs, since there is no one-size-fits-all compression solution - one codec may be perfect for one context, but quite suboptimal in another.
ZLIB codec: uses the DEFLATE algorithm, is standard, and works well for images.
ZSTD codec: similar compression ratio to ZLIB but faster compression/decompression
LZ4 codec: even faster comp/decomp than ZSTD but reduced compression ratio.
BloscLZ: Blosc implementation of the popular LZ algorithms (good for repeated data e.g. text). Similar tradeoff to LZ4.
Finally, via package extensions to Blosc2, one may access the JPEG2000 family of compression algorithms, which aim for a compromise between compression ratio and image quality; Blosc2 implements GROK (blosc2-grok
) and OPENHTJ2K (blosc2-openhtj2k
).
That’s all for now. There are more examples in the examples directory of the git repository for you to explore. Enjoy!