Expressions containing NDArray objects¶

Python-Blosc2 implements a powerful way to operate with NDArray arrays and other objects, called “lazy expressions”. A lazy expression is a lightweight object which stores a desired computation symbolically, with references to its operands (stored on disk or in memory), but does not execute until data is explicitly requested, e.g. if a slice of the computation result is requested. The lazy expression will then execute, but only on the necessary portion of the data, making it especially efficient, and avoiding large in-memory computations.

In this tutorial, we will see how to do such lazy computations, which are especially useful when working with large arrays, owing to this avoidance of costly in-memory temporaries.

[1]:

import numpy as np

import blosc2

A simple example¶

First, let’s create a couple of NDArray arrays. We’re going to write them to disk since in principle we are interested in large arrays (so big that they can’t fit in memory).

[2]:

shape = (500, 1000)
a = blosc2.linspace(0, 1, np.prod(shape), dtype=np.float32, shape=shape, urlpath="a.b2nd", mode="w")
b = blosc2.linspace(1, 2, np.prod(shape), dtype=np.float64, shape=shape, urlpath="b.b2nd", mode="w")

Now, let’s create an expression that involves a and b, called c.

[3]:

c = a**2 + b**2 + 2 * a * b + 1
print(c.info)  # at this stage, the expression has not been computed yet

type       : LazyExpr
expression : ((((o0 ** 2) + (o1 ** 2)) + ((2 * o0) * o1)) + 1)
operands   : {'o0': 'a.b2nd', 'o1': 'b.b2nd'}
shape      : (500, 1000)
dtype      : float64

We see that the type of c is a LazyExpr object. This object is a placeholder for the actual computation that will be done when we compute the expression. This is a very powerful feature because it allows us to build complex expressions without actually computing anything until we really need the result (or a portion of the result).

Now, let’s compute it. LazyExpr objects follow the LazyArray interface, and this provides several ways for performing the computation, depending on the object we want as the desired output.

1. Returning a NDArray array¶

First, let’s use the compute method. The result will be another NDArray array:

[4]:

d = c.compute()  # compute the expression
print(f"Class: {type(d)}")
print(f"Compression ratio: {d.schunk.cratio:.2f}x")

Class: <class 'blosc2.ndarray.NDArray'>
Compression ratio: 1.89x

We can specify different compression parameters for the result. For example, we can change the codec to ZLIB, use the bitshuffle filter, and set the compression level to 9:

[5]:

cparams = blosc2.CParams(codec=blosc2.Codec.ZLIB, filters=[blosc2.Filter.BITSHUFFLE], clevel=9)
d = c.compute(cparams=cparams)
print(f"Compression ratio: {d.schunk.cratio:.2f}x")

Compression ratio: 2.14x

Or, we can write the result to disk:

[6]:

d = c.compute(urlpath="result.b2nd", mode="w")
!ls -lh result.b2nd

/bin/bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8)
-rw-r--r-- 1 lshaw lshaw 2.1M Aug  4 13:50 result.b2nd

or compute just a slice of the result, which will only compute the necessary chunks of the result which intersect with the desired slice:

[7]:

d_slice = c.compute(item=slice(100, 200, 1))  # compute a slice of the expression

What is happening when we call the compute method? The operands are all NDArray arrays, chunked and stored on disk. When the compute method is called, the expression is executed, chunk-by-chunk, and the result stored, chunk-by-chunk. Hence at any given time, only a small amount of data (a chunk for each operand and the result) must be operated on in memory; and secondly, the computation is only performed on the necessary chunks required to give the result slice. Both operands and results are stored on disk here, so in fact you can operate with very large arrays in a very small memory footprint.

2. Returning a NumPy array¶

Now, let’s compute the expression and store the result in a NumPy array. For this, we will use the __getitem__ method. We may execute the expression with a slice, or without it, in which case the whole result will be computed:

[8]:

npd = c[100:200]  # compute a slice of the expression
print(f"Class: {type(npd)}, shape: {npd.shape}")
npd = c[:]  # compute the whole expression
print(f"Class: {type(npd)}, shape: {npd.shape}")

Class: <class 'numpy.ndarray'>, shape: (100, 1000)
Class: <class 'numpy.ndarray'>, shape: (500, 1000)

As you can see, in either case the result is a NumPy array now.

Depending on your needs, you can choose to get the result as a NDArray array or as a NumPy array. The former is more storage efficient, but the latter is more flexible when interacting with other libraries that do not support NDArray arrays, or for reading out data.

Saving expressions to disk¶

Lazy expressions may be saved to disk if all operands they refer to are also stored on disk. For this, use the save method of LazyArray objects. For example, let’s save the expression c to disk:

[9]:

c = a**2 + b**2 + 2 * a * b + 1
c.save(urlpath="expr.b2nd")

We may then load the expression with the open function, and check to see that the addresses of the operands are correct, and proceed to computation:

[10]:

c2 = blosc2.open("expr.b2nd")
print(c2.info)
d2 = c2.compute()
print(f"Result shape: {d2.shape}")

type       : LazyExpr
expression : (o0 ** 2 + o1 ** 2 + 2 * o0 * o1 + 1)
operands   : {'o0': 'a.b2nd', 'o1': 'b.b2nd'}
shape      : (500, 1000)
dtype      : float64

Result shape: (500, 1000)

Functions and Reductions¶

Functions¶

Lazy expressions also support many standard functions (essentially those available in NumPy), such as sin, cos, exp, log, etc. Let’s see an example:

[11]:

new_expr = blosc2.sin(a) + blosc2.cos(b) + blosc2.exp(a * b)
print(f"Class: {type(new_expr)}")
new_expr[1, :4]

Class: <class 'blosc2.lazyexpr.LazyExpr'>

[11]:

array([1.5426243 , 1.54262662, 1.54262895, 1.54263128])

Reductions¶

Reductions (mean, sum, variance etc.) are useful in many applications, such as data science, for summarising or reducing data. Reductions may also be incorporated as part of expressions, although their behaviour is somewhat different to that of other functions. Let’s see an example of a reduction:

[12]:

c = (a + b).sum()
c

[12]:

np.float64(999999.9999999473)

As we can see, the result is a scalar (not a LazyExpr). This is because reductions in expressions are always executed “eagerly” (i.e. on creation of the lazy expression). We can also specify the axis for the reduction:

[13]:

c = (a + b).sum(axis=1)
print(f"Shape of c: {c.shape}")
# Show the first 4 elements of the result
c[:4]

Shape of c: (500,)

[13]:

array([1001.998004  , 1005.998012  , 1009.99802   , 1013.99802799])

Reductions can also be part of more complex expressions, but in this case the final result may be a lazy expression (only the reduction is executed eagerly and its result stored as an operand in the full expression):

[14]:

c = (a + b).sum(axis=0) + 2 * a + 1
print(f"Type of c: {type(c)}. Shape of c: {c.shape}")
# Show the first 4 elements of the result
c[0, 0:4]

Type of c: <class 'blosc2.lazyexpr.LazyExpr'>. Shape of c: (500, 1000)

[14]:

array([1000.0010009 , 1000.00300336, 1000.00500598, 1000.00700854])

The __getitem__ method causes the remainder of the expression to execute (only using the relevant slices of the operands, including the result of the reduction (a + b).sum(axis=0)).

Note that the result of the reduction above has a different shape (500,) to the operand a, but the expression is still computed correctly. This is because the shape of the reduction is compatible with the shape of the operands according to the broadcasting convention.

Broadcasting¶

NumPy arrays support broadcasting, and so do NDArray arrays. Let’s see an example:

[15]:

b2 = b[0]  # take the first row of b
print(f"Shape of a: {a.shape}, shape of b2: {b2.shape}")

Shape of a: (500, 1000), shape of b2: (1000,)

We see that the shapes of a and b2 are different. However, as the shapes are compatible, we can still operate with them and the broadcasting will be done automatically (à la NumPy) and efficiently:

[16]:

c2 = a + b2
d2 = c2.compute()
print(f"Shape: {d2.shape}")

Shape: (500, 1000)

BONUS: Querying and Filtering NDArray arrays¶

The Blosc2 compute engine enables one to perform lazy queries on NDArray arrays with structured types. Let’s see an example.

[17]:

N = 1000_000
rng = np.random.default_rng(seed=1)
it = ((-x + 1, x - 2, rng.normal()) for x in range(N))
sa = blosc2.fromiter(
    it, dtype=[("A", "i4"), ("B", "f4"), ("C", "f8")], shape=(N,), urlpath="sa-1M.b2nd", mode="w"
)
print("First 3 rows:\n", sa[:3])

First 3 rows:
 [( 1, -2., 0.34558419) ( 0, -1., 0.82161814) (-1,  0., 0.33043708)]

One could select rows depending on the value of the different fields (A, B, C) in the following way, using a lazy boolean index

[18]:

A = sa["A"]
B = sa["B"]
C = sa["C"]
expr = sa[A > B]
print(type(expr))
expr[:]

<class 'blosc2.lazyexpr.LazyExpr'>

[18]:

array([(1, -2., 0.34558419), (0, -1., 0.82161814)],
      dtype=[('A', '<i4'), ('B', '<f4'), ('C', '<f8')])

In fact we can do the same in a more compact way by using an expression in string form inside the brackets. In both cases the result is a LazyExpr object, on which we then need to call the __getitem__ or compute method to get an actual array-like result:

[19]:

expr = sa["A > B"]
print(type(expr))
expr[:]

<class 'blosc2.lazyexpr.LazyExpr'>

[19]:

array([(1, -2., 0.34558419), (0, -1., 0.82161814)],
      dtype=[('A', '<i4'), ('B', '<f4'), ('C', '<f8')])

The expression can be arbitrarily complex:

[20]:

expr = sa["(A > B) & (sin(C) > .5)"]
expr[:]

[20]:

array([(0, -1., 0.82161814)],
      dtype=[('A', '<i4'), ('B', '<f4'), ('C', '<f8')])

Queries will also work on the individual fields (of type NDField), as they still possess references to the other fields of the parent array:

[21]:

print(type(C))
C["A > B"][:]

<class 'blosc2.ndarray.NDField'>

[21]:

array([0.34558419, 0.82161814])

Reductions are also supported, although since the array dtype is bespoke, the sum method fails on the full array

[22]:

C["((C > 0) & (B < 0))"].sum()  # succeeds
# sa["((C > 0) & (B < 0))"].sum() # fails

[22]:

np.float64(1.1672023355659444)

Finally, more complex queries can be done using the where() function. For example, let’s sum all the rows with the maximum of field A or field B:

[23]:

blosc2.where(A > B, A, B).sum()

[23]:

np.float32(499997670000.0)

Combining all this weaponry allows you to query your data quickly. As the computation is lazy, all the operations are grouped and executed together for maximum performance. The only exception is that, when a reduction is found, it is computed eagerly, but it can still be part of more general expressions, and can be saved to and loaded from disk.

Now that we’re finished, let’s delete the files we wrote to disk to clean up our directory.

[24]:

blosc2.remove_urlpath("a.b2nd")
blosc2.remove_urlpath("b.b2nd")
blosc2.remove_urlpath("expr.b2nd")
blosc2.remove_urlpath("sa-1M.b2nd")
blosc2.remove_urlpath("result.b2nd")

Summary¶

In this section, we have seen how to perform computations with NDArray arrays: how to create lazy expressions, compute them, and save them to disk. Also, we have looked at performing reductions, broadcasting, queries and combinations of all three. Lazy expressions allow you to build and compute complex computations from operands that can be in-memory, on-disk or remote (see `C2Array <reference/c2array.html>`__) in a simple and effective way.