Can a library designed for computing with compressed data ever hope to outperform highly optimized numerical engines like NumPy and Numexpr? The answer is complex, and it hinges on the "memory wall" — a phenomenon which occurs when system memory limitations start to drag on CPU. This post uses Roofline analysis to explore this very question, dissecting the performance of Blosc2 and revealing the surprising scenarios where it can gain a competitive edge.

TL;DR

Before we dive in, here's what we discovered:

For in-memory tasks, Blosc2's overhead can make it slower than Numexpr, especially on x86 CPUs.
This changes on Apple Silicon, where Blosc2's performance is much more competitive.
For on-disk tasks, Blosc2 consistently outperforms NumPy/Numexpr on both platforms.
The "memory wall" is real, and disk I/O is an even bigger one, which is where compression shines.

A Trip Down Memory Lane

Let's rewind to 2008. NumPy 1.0 was just a toddler, and the computing world was buzzing with the arrival of multi-core CPUs and their shiny new SIMD instructions. On the NumPy mailing list, a group of us were brainstorming how to harness this new power to make Python's number-crunching faster.

The idea seemed simple: trust newer compilers to use SIMD (and, possibly, data alignment) to perform operations on multiple data points at once. To test this, a simple benchmark was shared: multiply two large vectors element-wise. Developers from around the community ran the code and shared their results. What came back was a revelation.

For small arrays that fit snugly into the CPU's high-speed cache, SIMD was quite good at accelerating computations. But as soon as the arrays grew larger, the performance boost vanished. Some of us were already suspicious about the new "memory wall" that had been growing lately, seemingly due to the widening gap between CPU speeds and memory bandwidth. However, a conclusive answer (and solution) was still lacking.

But amidst the confusion, a curious anomaly emerged. One machine, belonging to NumPy legend Charles Harris, was consistently outperforming the rest—even those with faster processors. It made no sense. We checked our code, our compilers, everything. Yet, his machine remained inexplicably faster. The answer, when it finally came, wasn't in the software at all. Charles, a hardware wizard, had tinkered with his BIOS to overclock his RAM from 667 MHz to a whopping 800 MHz.

That was my lightbulb moment: for data-intensive tasks, raw CPU clock speed was not the limiting factor; memory bandwidth was what truly mattered.

This led me to a wild idea: what if we could make memory effectively faster? What if we could compress data in memory and decompress it on-the-fly, just in time for the CPU? This would slash the amount of data being moved, boosting our effective memory bandwidth. That idea became the seed for Blosc, a project I started in 2010 that has been my passion ever since. Now, 15 years later, it is time to revisit that idea and see how well it holds up in today's computing landscape.

Roofline Model: Understanding the Memory Wall

Not all computations are equally affected by the memory wall - in general performance can be either CPU-bound or memory-bound. To diagnose which resource is the limiting factor, the Roofline model provides an insightful analytical framework. This model plots computational performance against arithmetic intensity (i.e. floating-point operations per second versus memory accesses per second) to visually determine whether a task is constrained by CPU speed or memory bandwidth.

/images/roofline-surprising-story/roofline-intro.avif

We will use Roofline plots to analyze Blosc2's performance, compared to that of NumPy and Numexpr. NumPy, with its highly optimized linear algebra backends, and Numexpr, with its efficient evaluation of element-wise expressions, together form a strong performance baseline for the full range of arithmetic intensities tested.

To highlight the role of memory bandwidth, we will conduct our benchmarks on an AMD Ryzen 7800X3D CPU at two different memory speeds: the standard 4800 MTS and an overclocked 6000 MTS. This allows us to directly observe how memory frequency impacts computational performance.

To cover a range of computational scenarios, our benchmarks include five operations with varying arithmetic intensities:

Very Low: A simple element-wise addition (a + b + c).
Low: A moderately complex element-wise expression (sqrt(a + 2 * b + (c / 2)) ^ 1.2).
Medium: A highly complex element-wise calculation involving trigonometric and exponential functions.
High: Matrix multiplication on small matrices (labeled matmul0).
Very High: Matrix multiplication on large matrices (labeled matmul1 and matmul2).

/images/roofline-surprising-story/roofline-mem-speed-AMD-7800X3D.png

The Roofline plot confirms that increasing memory speed only benefits memory-bound operations (low arithmetic intensity), while CPU-bound tasks (high arithmetic intensity) are unaffected, as expected. Although this might suggest the "memory wall" is not a major obstacle, low-intensity operations like element-wise calculations, reductions, and selections are extremely common and often create performance bottlenecks. Therefore, optimizing for memory performance remains crucial.

The In-Memory Surprise: Why Wasn't Compression Faster?

We benchmarked Blosc2 (both compressed and uncompressed) against NumPy and Numexpr. For this test, Blosc2 was configured with the LZ4 codec and shuffle filter, a setup known for its balance of speed and compression ratio. The benchmarks were executed on an AMD Ryzen 7800X3D CPU with memory speed set to 6000 MTS, ensuring optimal memory bandwidth for the tests.

/images/roofline-surprising-story/roofline-7800X3D-mem-def.png

The analysis reveals a surprising outcome: for memory-bound operations, Blosc2 is up to five times slower than Numexpr. Although operating on compressed data provides a marginal improvement over uncompressed Blosc2, it is not enough to overcome this performance gap. This result is unexpected because Blosc2 leverages Numexpr internally, and the reduced memory bandwidth from compression should theoretically lead to better performance in these scenarios.

To understand this counter-intuitive result, we must examine Blosc2's core architecture. The key lies in its double partitioning scheme, which, while powerful, introduces an overhead that can negate the benefits of compression in memory-bound contexts.

Unpacking the Overhead: A Look Inside Blosc2's Architecture

The performance characteristics of Blosc2 are rooted in its double partitioning architecture, which organizes data into chunks and blocks.

/images/roofline-surprising-story/double-partition-b2nd.avif

This design is crucial for both aligning with the CPU's memory hierarchy and enabling efficient multidimensional array representation (important for things like e.g. n-dimensional slicing). However, this structure introduces an inherent overhead from additional indexing logic. In memory-bound scenarios, this latency counteracts the performance gains from reduced memory traffic, explaining why Blosc2 does not surpass Numexpr.

Conversely, as arithmetic intensity increases, the computational demands begin to dominate the total execution time. In these CPU-bound regimes, the partitioning overhead is effectively amortized, allowing Blosc2 to close the performance gap and eventually match NumPy's performance in tasks like large matrix multiplications.

Modern ARM Architectures

CPU architecture is a rapidly evolving field. To investigate how these changes impact performance, we extended our analysis to the Apple Silicon M4 Pro, a modern ARM-based processor.

/images/roofline-surprising-story/roofline-m4pro-mem-def.png

The results show that Blosc2 performs significantly better on this platform, narrowing the performance gap with NumPy/NumExpr, especially for operations on compressed data. While compute engines optimized for uncompressed data still hold an edge, these findings suggest that compression will play an increasingly important role in improving computational performance in the future.

However, while the in-memory results are revealing, they don't tell the whole story. Blosc2 was designed not just to fight the memory wall, but to conquer an even greater bottleneck: disk I/O. Although compression has the benefit of fitting more data into RAM when used in-memory (which is per se extremely interesting in these times, where RAM prices skyrocketed), its true power is unleashed when computations move off-motherboard. Now, let's shift the battlefield to the disk and see how Blosc2 performs in its native territory.

A Different Battlefield: Blosc2 Shines with On-Disk Data

Blosc2's architecture extends its computational engine to operate seamlessly on data stored on disk, a significant advantage for large-scale analysis. This is particularly relevant in scenarios where datasets exceed available memory, necessitating out-of-core processing, as commonly encountered in data science, machine learning workflows or cloud computing environments.

Our on-disk benchmarks were designed to use datasets larger than the system's available memory to prevent filesystem caching from influencing the results. To establish a baseline, we implemented an out-of-core solution for NumPy/NumExpr, leveraging memory-mapped files. Here Blosc2 has a performance edge, particularly for memory-bound operations on compressed data, being able to send and receive data faster to disk than the memory-mapped NumPy arrays.

In this case, we've used high-performance NVMe SSDs (NVMe 4.0) to minimize the impact of disk speed on the results. We also switched to the ZSTD codec for Blosc2, as its superior compression ratio over LZ4 further minimizes data transfer to and from the disk.

First, let's see the results for the AMD Ryzen 7800X3D system:

/images/roofline-surprising-story/roofline-7800X3D-disk-def.png

The plots above show that Blosc2 outperforms both NumPy and Numexpr for all low-to-medium intensity operations. This is because the high latency of disk I/O amortizes the overhead of Blosc2's double partitioning scheme. Furthermore, the reduced bandwidth required for compressed data gives Blosc2 an additional performance advantage in this scenario.

Now, let's see the results for the Apple Silicon M4 Pro system:

/images/roofline-surprising-story/roofline-m4pro-disk-def.png

On the Apple Silicon M4 Pro system, Blosc2 again outperforms both NumPy and Numexpr for all on-disk operations, mirroring the results from the AMD system. However, the performance advantage is even more significant here, especially for memory-bound tasks. This is mainly because memory-mapped arrays are less efficient on Apple Silicon than on x86_64 systems, increasing the overhead for the NumPy/Numexpr baseline.

Roofline Plot: In-Memory vs On-Disk

To better understand the trade-offs between in-memory and on-disk processing with Blosc2, the following plot contrasts their performance characteristics for compressed data:

/images/roofline-surprising-story/roofline-mem-disk-def.png

A notable finding for the AMD system is that Blosc2's on-disk operations are noticeably faster than its in-memory operations, especially for memory-bound tasks (low arithmetic intensity). This is likely due to two factors: first, the larger datasets used for on-disk tests allow Blosc2 to use more efficient internal partitions (chunks and blocks), and second, parallel data reads from disk further reduce bandwidth requirements.

In contrast, for CPU-bound tasks (high arithmetic intensity), on-disk performance is comparable to, albeit slightly slower than, in-memory performance. The analysis also reveals a specific weakness: small matrix multiplications (matmul0) are significantly slower on-disk, identifying a clear target for future optimization.

In contrast to the AMD system, the Apple Silicon M4 Pro shows that Blosc2's on-disk operations are slower than in-memory, a difference that is most significant for memory-bound tasks. This performance disparity suggests that current on-disk optimizations may favor x86_64 architectures over ARM.

As with the AMD platform, CPU-bound operations exhibit similar performance for both on-disk and in-memory contexts. The notable exception remains the small matrix multiplication (matmul0), which performs significantly worse on-disk. This recurring pattern pinpoints a clear opportunity for future optimization efforts.

Finally, and in addition to its on-disk performance, Blosc2 offers a significant cost advantage. With the recent rise in SSD prices, compressing data on disk becomes an economically attractive strategy, allowing you to store more data in less space and thereby reduce hardware expenses.

Reproducibility

All the benchmarks and plots presented in this blog post can be reproduced. You are invited to run the scripts on your own hardware to explore the performance characteristics of Blosc2 in different environments. In case you get interesting results, please consider sharing them with the community!

Conclusions

In this blog post, we explored the Roofline model to analyze the performance of Blosc2, NumPy, and Numexpr. We've confirmed that memory-bound operations are significantly affected by the "memory wall", making data compression of interest when maximizing performance. However, for in-memory operations, the overhead of Blosc2's double partitioning scheme can be a limiting factor, especially on x86_64 architectures. Encouragingly, this performance gap narrows considerably on modern ARM platforms like Apple Silicon, suggesting a promising future.

The situation changes dramatically for on-disk operations. Here, Blosc2 consistently outperforms NumPy and Numexpr, as the high latency of disk I/O (even if we used SSDs here) amortizes its internal overhead. This makes Blosc2 a compelling choice for out-of-core computations, one of its primary use cases.

Overall, this analysis has provided valuable insights, highlighting the importance of the memory hierarchy. It has also exposed specific areas for improvement, such as the performance of small matrix multiplications. As Blosc2 continues to evolve, I am confident we can address these points and further enhance its performance, making it an even more powerful tool for numerical computations in Python.

Read more about ironArray SLU — the company behind Blosc2, Caterva2, Numexpr and other high-performance data processing libraries.

Compress Better, Compute Bigger!

Blosc2: A Universal Lazy Engine for Array Operations

Francesc Alted, Luke Shaw

2025-10-15 10:32

Comments

While compression is often seen merely as a way to save storage, the Blosc development team has long viewed it as a foundational element for high-performance computing. This philosophy is at the heart of Blosc2, which is not just a compression library but a powerful framework for handling large datasets. This post will highlight one of Python-Blosc2's most exciting capabilities: its lazy evaluation engine for array operations.

Libraries optimised for computation on large datasets that don't fit in memory - such as Dask or Spark - often use lazy evaluation of computation expressions. This typically speeds up evaluation since one can build the full chain of computations and only execute them when the final result is needed. Consequently, Python-Blosc2's compute engine also uses the lazy imperative paradigm, which proves to be both powerful and efficient.

An additional benefit of the engine is its ability to act as a universal backend. Python-Blosc2 has a native blosc2.NDArray format, but it can also easily execute lazy operations on arrays from other popular libraries like NumPy, HDF5, Zarr, Xarray or TileDB - basically any array object which complies with a minimal protocol.

In the recent Python-Blosc2 3.10.x series, we added support for lazy evaluation of eager functions, expanding the capabilities of the compute engine, and making interaction with other formats easier. Let's explore how this works using an out-of-core tensordot operation as an example.

From Eager to Lazy with `blosc2.lazyexpr`

Functions which return a result with a different shape to the input operands - such as reductions or linear algebra operations - must be evaluated eagerly (computed and the result returned immediately). For example, blosc2.tensordot() executes eagerly.

Nevertheless, we can defer this computation, by wrapping the call in a string and passing it to blosc2.lazyexpr. This creates a LazyExpr object that represents the operation without executing it.

# Assume a and b are large, on-disk blosc2 arrays
axis = (0, 1)

# Create a lazy expression object
lexpr = blosc2.lazyexpr("tensordot(a, b, axes=(axis, axis))")

# The computation has not run yet.
# To execute it and save the result to a new persistent array:
out_blosc2 = lexpr.compute(urlpath="out.b2nd", mode="w")

This is useful, and highly efficient both in terms of computation time and memory usage, as we'll see later. But the real magic happens when we use this computation engine with other array formats.

One Engine, Many Backends

The blosc2.evaluate() function takes the same string expression but can operate on any array-like objects that follow the blosc2.Array protocol. This protocol simply requires the object to have shape, dtype, __getitem__, and __setitem__ attributes, which are standard in h5py, zarr, tiledb, xarray and numpy arrays.

This means you can use Blosc2's efficient evaluation engine to perform out-of-core computations directly on your existing (HDF5, Zarr, etc.) datasets.

Example with HDF5

Here, we instruct blosc2.evaluate to run the tensordot operation on two h5py datasets and store the result in a third one.

# Open HDF5 datasets
f = h5py.File("a_b_out.h5", "a")
a = f["a"]
b = f["b"]
out_hdf5 = f["out"]

# Use blosc2.evaluate() with HDF5 arrays
blosc2.evaluate("tensordot(a, b, axes=(axis, axis))", out=out_hdf5)

Notice that the expression string is identical to the one we used before. blosc2 inspects the objects in the expression's namespace and computes with them, regardless of their underlying format.

Example with Zarr

The same principle applies to Zarr arrays.

# Open Zarr arrays
a = zarr.open("a.zarr", mode="r")
b = zarr.open("b.zarr", mode="r")
zout = zarr.open_array("out.zarr", mode="w", ...)

# Use blosc2.evaluate() with Zarr arrays
blosc2.evaluate("tensordot(a, b, axes=(axis, axis))", out=zout)

This makes blosc2.evaluate a powerful, backend-agnostic tool for out-of-core array computations.

Performance Comparison

As well as offering smooth integration, blosc2.evaluate is highly performant. Python-Blosc2 uses a lazy evaluation engine that integrates tightly with the Blosc2 format. This means that the computation is performed on-the-fly, without any intermediate copies. This is a huge advantage for large datasets, as it allows us to perform computations on arrays that don't fit in memory. In addition, it actively tries to leverage the hierarchical memory layout in modern CPUs, so that it can use both private and shared caches in the best way possible.

We ran a benchmark performing a tensordot operation (run over three different axis combinations) on two 3D arrays stored on disk; we then write the output to disk as well. We consider four approaches:

Blosc2 Native: Using blosc2.lazyexpr with blosc2.NDArray containers.
Blosc2+HDF5: Using blosc2.evaluate with HDF5 for storage.
Blosc2+Zarr: Using blosc2.evaluate with Zarr for storage.
Dask+HDF5: The combination of Dask for computation and HDF5 for storage.
Dask+Zarr: The combination of Dask for computation and Zarr for storage.

For each approach we plot the memory consumption vs. time for arrays of increasing size.

Results on two (600, 600, 600) float64 arrays (3 GB working set):

/images/tensordot_pure_persistent/tensordot-600c-amd.png

Results on two (1200, 1200, 1200) float64 arrays (26 GB working set):

/images/tensordot_pure_persistent/tensordot-1200c-amd.png

Results on two (1500, 1500, 1500) float64 arrays (50 GB working set):

/images/tensordot_pure_persistent/tensordot-1500c-amd.png

As can be seen, the amount of memory required by the different approaches is very different, although none requires more than a small fraction of the total working set (which is 3, 26 and 50 GB, respectively). This is because all approaches are out-of-core, and only load small chunks of data into memory at any given time.

The benchmarks were executed on an AMD Ryzen 9800X3D CPU, with 16 logical cores and 64GB of RAM, using Ubuntu Linux 25.04. We have used the following versions of the libraries: python-blosc2 3.10.1, h5py 3.14.0, zarr 3.1.3, 2025.9.1, and numpy 2.3.3. All backends are using Blosc or Blosc2 as the compression backend, with same codecs and filters, and using the same number of threads for compression and decompression.

Analysis

The results are revealing:

Blosc2 native is fastest: The tight integration between the Blosc2 compute engine and its native array format yields the best performance, making it the fastest solution by a significant margin.
Rapid computation time: blosc2.evaluate delivers impressive speed when operating directly on HDF5 and Zarr files, outperforming the more complex Dask+HDF5 and Dask+Zarr stack. This is great news for anyone with existing HDF5/Zarr datasets.
Low memory usage: While the memory consumption for the Blosc2+HDF5 combination is a bit high (we are still analyzing why), the memory usage for the Blosc2 native approach is pretty low, making it suitable for systems with limited RAM and/or operands not fitting in memory.

This is not to say that Dask (or Spark) is an inferior choice for out-of-core computations. It's a great tool for large-scale data processing, especially when using clusters, is very flexible, and offers a wide range of functions; it's certainly a first-class citizen in the PyData ecosystem. However, if your needs are more modest and you want a simple, efficient way to run computations on existing datasets, using a core of common functions, and leveraging the full capabilities of modern multi-core systems, all without the overhead of a full Dask setup, blosc2.evaluate() is a fantastic alternative.

Conclusion

Python-Blosc2 is more than just a compression library for storing data in blosc2.NDArray objects; it's a high-performance computing tool as well. Its lazy evaluation engine provides a simple yet powerful way to handle out-of-core operations. The computation engine is completely decoupled from the compression backend, and thus can easily work with many different array formats; however, the compute engine meshes most tightly with the Blosc2 native array format, achieving maximal performance (in terms of both computation time and memory usage).

By adhering to the Array API standard, it acts as a universal engine that can work with different storage backends; we already implement more than 100 functions that are required by that standard, and the number will only grow in the future. If you have existing datasets in HDF5 or Zarr or TileDB (and we are always looking forward to support even more formats), and need a lightweight, efficient way to run computations on them, blosc2.evaluate() is a fantastic tool to have in your arsenal. Of course, for maximum performance, the native Blosc2 format is a clear winner.

Our work continues. We are committed to enhancing Python-Blosc2 by expanding its supported operations, improving performance across backends, and adding new ones. Stay tuned for more updates! If you found this post useful, please share it. For questions or comments, reach out to us on GitHub.

TreeStore: Endowing Your Data With Hierarchical Structure

Francesc Alted

2025-08-17 10:33

Comments

When working with large and complex datasets, having a way to organize your data efficiently is crucial. blosc2.TreeStore is a powerful feature in the blosc2 library that allows you to store and manage your compressed arrays in a hierarchical, tree-like structure, much like a filesystem. This container, typically saved with a .b2z extension, can hold not only blosc2.NDArray or blosc2.SChunk objects but also metadata, making it a versatile tool for data organization.

What is a TreeStore?

A TreeStore lets you arrange your data into groups (like directories) and datasets (like files). Each dataset is a blosc2.NDArray or blosc2.SChunk instance, benefiting from Blosc2's high-performance compression. This structure is ideal for scenarios where data has a natural hierarchy, such as in scientific experiments, simulations, or any project with multiple related datasets.

Basic Usage: Creating and Populating a TreeStore

Creating a TreeStore is straightforward. You can use a with statement to ensure the store is properly managed. Inside the with block, you can create groups and datasets using a path-like syntax.

import blosc2
import numpy as np

# Create a new TreeStore
with blosc2.TreeStore("my_experiment.b2z", mode="w") as ts:
    # You can store numpy arrays, which are converted to blosc2.NDArray
    ts["/dataset0"] = np.arange(100)

    # Create a group with a dataset that can be a blosc2 NDArray
    ts["/group1/dataset1"] = blosc2.zeros((10,))

    # You can also store blosc2 arrays directly (vlmeta included)
    ext = blosc2.linspace(0, 1, 10_000, dtype=np.float32)
    ext.vlmeta["desc"] = "dataset2 metadata"
    ts["/group1/dataset2"] = ext

In this example, we created a TreeStore in a file named my_experiment.b2z.

/images/new-treestore-blosc2/tree-store-blog.png

It contains two groups, root and group1, each holding datasets.

Reading from a TreeStore

To access the data, you open the TreeStore in read mode ('r') and use the same path-like keys to retrieve your arrays.

# Open the TreeStore in read-only mode ('r')
with blosc2.TreeStore("my_experiment.b2z", mode="r") as ts:
    # Access a dataset
    dataset1 = ts["/group1/dataset1"]
    print("Dataset 1:", dataset1[:])  # Use [:] to decompress and get a NumPy array

    # Access the external array that has been stored internally
    dataset2 = ts["/group1/dataset2"]
    print("Dataset 2", dataset2[:])
    print("Dataset 2 metadata:", dataset2.vlmeta[:])

    # List all paths in the store
    print("Paths in TreeStore:", list(ts))

Dataset 1: [0 1 2 3 4 5 6 7 8 9]
Dataset 2 [0.0000000e+00 1.0001000e-04 2.0002000e-04 ... 9.9979997e-01 9.9989998e-01
 1.0000000e+00]
Dataset 2 metadata: {b'desc': 'dataset2 metadata'}
Paths in TreeStore: ['/group1/dataset2', '/group2', '/group1', '/group2/another_dataset', '/group1/dataset1']

Advanced Usage: Metadata and Subtrees

TreeStore becomes even more powerful when you use metadata and interact with subtrees (groups).

Storing Metadata with `vlmeta`

You can attach variable-length metadata (vlmeta) to any group or to the root of the tree. This is useful for storing information like author names, dates, or experiment parameters. vlmeta is essentially a dictionary where you can store your metadata.

# Appending metadata to the TreeStore
with blosc2.TreeStore("my_experiment.b2z", mode="a") as ts:  # 'a' for append/modify
    # Add metadata to the root
    ts.vlmeta["author"] = "The Blosc Team"
    ts.vlmeta["date"] = "2025-08-17"

    # Add metadata to a group
    ts["/group1"].vlmeta["description"] = "Data from the first run"

# Reading metadata
with blosc2.TreeStore("my_experiment.b2z", mode="r") as ts:
    print("Root metadata:", ts.vlmeta[:])
    print("Group 1 metadata:", ts["/group1"].vlmeta[:])

Root metadata: {'author': 'The Blosc Team', 'date': '2025-08-17'}
Group 1 metadata: {'description': 'Data from the first run'}

Working with Subtrees (Groups)

A group object can be retrieved from the TreeStore and treated as a smaller, independent TreeStore. This capability is useful for better organizing your data access code.

with blosc2.TreeStore("my_experiment.b2z", mode="r") as ts:
    # Get the group as a subtree
    group1 = ts["/group1"]

    # Now you can access datasets relative to this group
    dataset2 = group1["dataset2"]
    print("Dataset 2 from group object:", dataset2[:])

    # You can also list contents relative to the group
    print("Contents of group1:", list(group1))

Dataset 2 from group object: [0.0000000e+00 1.0001000e-04 2.0002000e-04 ... 9.9979997e-01 9.9989998e-01
 1.0000000e+00]
Contents of group1: ['/dataset2', '/dataset1']

Iterating Through a TreeStore

You can easily iterate through all the nodes in a TreeStore to inspect its contents.

with blosc2.TreeStore("my_experiment.b2z", mode="r") as ts:
    for path, node in ts.items():
        if isinstance(node, blosc2.NDArray):
            print(f"Found dataset at '{path}' with shape {node.shape}")
        else:  # It's a group
            print(f"Found group at '{path}' with metadata: {node.vlmeta[:]}")

Found dataset at '/group1/dataset2' with shape (10000,)
Found group at '/group1' with metadata: {'description': 'Data from the first run'}
Found dataset at '/group1/dataset1' with shape (10,)
Found dataset at '/dataset0' with shape (100,)

That's it for this introduction to blosc2.TreeStore! You now know how to create, read, and manipulate a hierarchical data structure that can hold compressed datasets and metadata. You can find the source code for this example in the blosc2 repository.

Some Benchmarks

TreeStore is based on powerful abstractions from the blosc2 library, so it is very fast. Here are some benchmarks comparing TreeStore to other data storage formats, like HDF5 and Zarr. We have used two different configurations: one with small arrays, where sizes follow a normal distribution centered at 10 MB each, and the other with larger arrays, where sizes follow a normal distribution centered at 1 GB each. We have compared the performance of TreeStore against HDF5 and Zarr for both small and large arrays, measuring the time taken to create and read datasets. For comparing apples with apples, we have used the same compression codec (zstd) and filter (shuffle) for all three formats.

For assessing different platforms, we have used a desktop with an Intel i9-13900K CPU and 32 GB of RAM, running Ubuntu 25.04, and also a Mac mini with an Apple M4 Pro processor and 24 GB of RAM. The benchmarks were run using this script.

Results for the Intel i9-13900K desktop

100 small arrays (around 10 MB each) scenario:

/images/new-treestore-blosc2/benchmark_comparison_b2z-i13900K-10M.png

For the small arrays scenario, we can see that TreeStore is the fastest to create datasets (due to use of multi-threading), but it is slower than HDF5 and Zarr when reading datasets. The reason for this is two-fold: first, TreeStore is designed to work using multi-threading, so it must setup the necessary threads at the beginning of the read operation, which takes some time; second, TreeStore is using NDArray objects internally, which are using a double partitioning scheme (chunks and blocks) to store the data, which adds some overhead when reading small slices of data. Regarding the space used, TreeStore is the most efficient, very close to HDF5, and significantly more efficient than Zarr.

100 large arrays (around 1 GB each) scenario:

/images/new-treestore-blosc2/benchmark_comparison_b2z-i13900K-1G.png

When handling larger arrays, TreeStore maintains its lead in creation and full-read performance. Although HDF5 and Zarr offer faster access to small data slices, TreeStore compensates by being the most storage-efficient format, followed by HDF5, with Zarr being the most space-intensive.

Results for the Apple M4 Pro Mac mini

100 small arrays (around 10 MB each) scenario:

/images/new-treestore-blosc2/benchmark_comparison_b2z-MacM4-10M.png

100 large arrays (around 1 GB each) scenario:

/images/new-treestore-blosc2/benchmark_comparison_b2z-MacM4-1G.png

Consistent with the previous results, TreeStore is the most space-efficient format and the fastest for creating and reading datasets, particularly for larger arrays. Its performance is slower than HDF5 and Zarr only when reading small data slices (access time). This can be improved by reducing the number of threads from the default of eight, which lessens the thread setup overhead. For more details on this, see these slides comparing 8-thread vs 1-thread performance.

Notably, the Apple M4 Pro processor shows competitive performance against the Intel i9-13900K CPU, a high-end desktop processor that consumes up to 8x more power. This result underscores the efficiency of the ARM architecture in general and Apple silicon in particular.

Conclusion

In summary, blosc2.TreeStore offers a straightforward yet potent solution for hierarchically organizing compressed datasets. By merging the high-performance compression of blosc2.NDArray and blosc2.SChunk with a flexible, filesystem-like structure and metadata support, it stands out as an excellent choice for managing complex data projects.

As TreeStore is currently in beta, we welcome feedback and suggestions for its improvement. For further details, please consult the official documentation for blosc2.TreeStore.

Blosc2 Gets Fancy (Indexing)

Luke Shaw

2025-07-16 13:33

Comments

Update (2025-08-26): After some further effort, the 1D fast path mentioned below has been extended to the multidimensional case, with consequent speedups in Blosc2 3.7.3! See below plot comparing maximum and minimum indexing times for the Blosc2-supported fancy indexing cases mentioned below.

/images/blosc2-fancy-indexing/newfancybench.png

---

In response to requests from our users, the Blosc2 team has introduced a fancy indexing capability into the flagship Blosc2 NDArray object. In the future, this could be extended to other classes within the Blosc2 library, such as C2Array and LazyArray.

What is Fancy Indexing?

In many array libraries, most famously NumPy, fancy indexing refers to a vectorized indexing format which allows for simultaneous selection and reshaping of arrays (see this excerpt). For example, one may wish to select three entries from a 1D array:

arr = array([10, 11, 12])

which can be done like so:

arr[[1,2,1]]
>> array([11, 12, 11])

Note that the order of the indices is arbitrary (i.e. the elements of the output may occur in a different order to the original array) and indices may be repeated. Moreover, if the array is multidimensional, for example:

arr = array([[10, 11],
             [12, 13],
             [14, 15]])

then the output consists of the relevant rows:

arr[[1,2,0]]
>> array([[12, 13],
          [14, 15],
          [10, 11]])

and so on for arbitrary numbers of dimensions.

Indeed one can output arbitrary shapes, for example via:

arr[[[1,2],[0,1]]]
>> array([[[12, 13],
          [14, 15]],

         [[10, 11],
          [12, 13]]])

NumPy supports many different kinds of fancy indexing, a flavour of which can be seen from the following examples, where row and col are integer array objects. If they are not of the same shape then broadcasting conventions will be applied to try to massage the index into an understandable format.

arr[row]
arr[[row, col]]
arr[row, col]
arr[row[:, None], col]
arr[1, col] or arr[1:9, col]

In addition, one may use a boolean mask, in combination with integer indices, slices, or integer arrays via

arr[row[:, None], mask]

where the mask must have the same length as the indexed dimension(s).

Support for Fancy Indexing and `ndindex`

Other libraries for management of large arrays such as zarr and h5py offer fancy indexing support but neither are as comprehensive as NumPy. h5py, which uses the HDF5 format, is quite limited in that one may only use one integer array, no repeated indices are allowed, and the array must be sorted in increasing order, although mixed slice and integer array indexing is possible. zarr, via its vindex (for vectorized index), offers more support, but is rather limited when it comes to mixed indexing, as slices may not be used with integer arrays, and an integer array must be provided for every dimension of the array (i.e. arr[row] fails on any non-1D arr).

This makes it difficult (in the case of zarr) or impossible (in the case of h5py) to do the kind of reshaping we saw in the introduction (i.e. case 2 above arr[[[1,2],[0,1]]]). This lack of support is due to a combination of: 1) the computational difficulty of many of these operations; and 2) the at times counter-intuitive behaviour of fancy indexing (see the end of this blog post for more details).

When implementing fancy indexing for Blosc2 we strove to match the functionality of NumPy as closely as possible, and we have almost been able to do so — all the 6 cases mentioned above are perfectly feasible with this new Blosc2 release! There are only some minor edge cases which are not supported (see Example 2 in the Addendum). This would not have been possible without the excellent ndindex library, which offers many very useful, efficient functions for index conversion between different shapes and chunks. We can then call NumPy behind-the-scenes, chunk-by-chunk, and exploit its native support for fancy indexing, without having to load the entire array into memory.

Results: Blosc2, Zarr, H5Py and NumPy

Hence, when averaging over the indexing cases above on 2D arrays of varying sizes, we observe only a minor slowdown for Blosc2 compared to NumPy when the array size is small compared to total memory (24GB), suggesting a small chunking-and-indexing overhead. As expected, when the array grows to an appreciable fraction of memory (16GB), loading the full NumPy array into memory starts to impact performance. The black error bars in the plots indicate the maximum and minimum times observed over the indexing cases (for which there is clearly a large variation).

Note that for cases 4 and 6 with large row or col index arrays, broadcasting causes the resulting index (stored in memory) to be very large, and even for array sizes of 2GB computation is too slow. In the future, we would like to see if this can be improved.

/images/blosc2-fancy-indexing/fancyIdxNumpyBlosc22D.png

Blosc2 is also as fast or faster than Zarr and HDF5 even for the limited use cases that the latter two libraries both support. HDF5 in particular is especially slow when the indexing array is very large.

/images/blosc2-fancy-indexing/fancyIdxNumpyBlosc2ZarrHDF52D.png

These plots have been generated using a Mac mini with the Apple M4 Pro processor. The benchmark is available on the Blosc2 github repo here.

Conclusion

Blosc2 offers a powerful and flexible fancy indexing functionality that is more extensive than that of Zarr and H5Py, while also being able to handle large arrays on-disk without loading them into memory. This makes it a great choice for applications that require complex indexing operations on large datasets. Give it a try in your own projects! If you have questions, the Blosc2 community is here to help.

If you appreciate what we're doing with Blosc2, please think about supporting us. Your help lets us keep making these tools better.

Addendum: Oindex, Vindex and FancyIndex via Two Examples

Zarr's implementation of fancy indexing is packaged as vindex (vectorized indexing). It also offers another indexing functionality, called orthogonal indexing, via oindex.

The reason for this dual support becomes clear when one considers a simple example.

Example 1

For a 2D array, we have seen that the fancy-indexing rules will cause the two index arrays below to be broadcast together:

arr[[0, 1], [2, 3]] -> [arr[0,2], arr[1,3]]

giving an output with two elements of shape (2,). This is vindexing.

However, one could understand this indexing as selecting rows 0 and 1 in the array, and then their intersection with columns 2 and 3. This gives an output with four elements of shape (2, 2), with elements:

[[arr[0,2], arr[0,3]],
 [arr[1,2], arr[1,3]]]

This is oindexing. Clearly, given the same index, the output is in general different; it is for this reason that the debate about fancy indexing can be quite polemical, and why there is a movement to introduce the vindex/oindex duality in NumPy.

Example 2

I have glossed over this until now, but vindex is not the same as fancy indexing. For this reason Zarr does not support all the functionality of fancy indexing, since it only supports vindex. The most important distinction between the two is that it seeks to avoid certain unexpected fancy indexing behaviour, as can be seen by considering a 3D NumPy array of shape (X, Y, Z) as in the example here. Consider the unexpected behaviour of:

arr[:10, :, [0,1]] has shape (10, Y, 2).

arr[0, :, [0, 1]] has shape (2, Y), not (Y, 2)!!

NumPy indexing treats non-slice indices differently, and will always put the axes introduced by the index array first, unless the non-slice indexes are consecutive, in which case it will try to massage the result to something intuitive (which normally coincides with the result of an oindex) — hence arr[:, 0, [0, 1]] has shape (X, 2), not (2, X).

The hypothesised NumPy vindex would eliminate this transposition behaviour, and be internally consistent, always putting the axes introduced by the index array first. Unfortunately, this is difficult and costly, and so the alternative is to simply not allow such indexing and throw an error, or force the user to be very specific.

Blosc2 will throw an error when one inserts a slice between array indices:

arr[:, 0, [0, 1]] -> shape (X, 2)
arr.vindex[0, :, [0,1]] -> ERROR

Zarr's vindex (called by __getitem__), by requiring integer array indices for all dimensions, throws an error for all mixed indices of this type:

arr[:, 0, [0, 1]] -> ERROR
arr[0, :, [0,1]] -> ERROR

Thus to reproduce the result of Blosc2 for the first case, one must use an explicit index array:

idx = np.array([0,1]).reshape(1,-1)
arr[np.arange(X).reshape(-1,1), 0 , idx] -> shape (X, 2)

For both Blosc2 and Zarr, one must use an explicit index array like so for the second case:

arr[0, np.arange(Y).reshape(-1,1), idx] -> shape (Y, 2)

Hopefully you now understand why fancy indexing can be so tricky, and why few libraries seek to support it to the same extent as NumPy - some would say it is perhaps not even desirable to do so!

Efficient array concatenation launched in Blosc2

Francesc Alted

2025-06-16 13:33

Comments

Update (2025-06-23): Recently, Luke Shaw added a stack() function in Blosc2, using the concatenate feature described here. The new function allows you to stack arrays along a new axis, which is particularly useful for creating higher-dimensional arrays from lower-dimensional ones. We have added a section at the end of this post to show the usage and performance of this new function.

---

Blosc2 just got a cool new trick: super-efficient array concatenation! If you've ever needed to combine several arrays into one, especially when dealing with lots of data, this new feature is for you. It's built to be fast and use as little memory as possible. This is especially true if your array sizes line up nicely with Blosc2's internal "chunks" (think of these as the building blocks of your compressed data). When this alignment happens, concatenation is lightning-fast, making it perfect for demanding tasks.

You can use this new concatenate feature whether you're coding in C or Python, and it works with any Blosc2 NDArray (Blosc2's way of handling multi-dimensional arrays).

Let's see how easy it is to use in Python. If you're familiar with NumPy, the blosc2.concatenate function will feel very similar:

import blosc2
# Create some sample arrays
a = blosc2.full((10, 20), 1, urlpath="arrayA.b2nd", mode="w")
b = blosc2.full((10, 20), 2, urlpath="arrayB.b2nd", mode="w")
c = blosc2.full((10, 20), 3, urlpath="arrayC.b2nd", mode="w")
# Concatenate the arrays along the first axis
result = blosc2.concat([a, b, c], axis=0, urlpath="destination.b2nd", mode="w")
# The result is a new Blosc2 NDArray containing the concatenated data
print(result.shape)  # Output: (30, 20)
# You can also concatenate along other axes
result_axis1 = blosc2.concat([a, b, c], axis=1, urlpath="destination_axis1.b2nd", mode="w")
print(result_axis1.shape)  # Output: (10, 60)

The blosc2.concatenate function is pretty straightforward. You give it a list of the arrays you want to join together. You can also tell it which way to join them using the axis parameter (like joining them end-to-end or side-by-side).

A really handy feature is that you can use urlpath and mode to save the combined array directly to a file. This is great when you're working with huge datasets because you don't have to load everything into memory at once. What you get back is a brand new, persistent Blosc2 NDArray with all your data combined.

Aligned versus Non-Aligned Concatenation

Blosc2's concatenate function is smart. It processes your data in small pieces of compressed data (chunks). This has two consequences. The first is that you can join very large arrays, stored on your disk, chunk-by-chunk without using up all your computer's memory. Secondly, if the chunks fit neatly into the arrays to be concatenated, the process is much faster. Why? Because Blosc2 can avoid a lot of extra work, chiefly decompressing and re-compressing the chunks.

Let's look at some pictures to see what "aligned" and "unaligned" concatenation means. "Aligned" means that chunk boundaries of the arrays to be concatenated line up with each other. "Unaligned" means that this is not the case.

/images/blosc2-new-concatenate/concat-unaligned.png

/images/blosc2-new-concatenate/concat-aligned.png

The pictures show why "aligned" concatenation is faster. In Blosc2, all data pieces (chunks) inside an array must be the same size. So, if the chunks in the arrays you're joining match up ("aligned"), Blosc2 can combine them very quickly. It doesn't have to rearrange the data into new, same-sized chunks for the final array. This is a big deal for large arrays.

If the arrays are "unaligned," Blosc2 has more work to do. It has to decompress and then re-compress the data to make the new chunks fit, which takes longer. There's one more small detail for this fast method to work: the first array's size needs to be a neat multiple of its chunk size along the direction you're joining.

A big plus with Blosc2 is that it always processes data in these small chunks. This means it can combine enormous arrays without ever needing to load everything into your computer's memory at once.

Performance

To show you how much faster this new concatenate feature is, we did a speed test using LZ4 as the internal compressor in Blosc2. We compared it to the usual way of joining arrays with numpy.concatenate.

/images/blosc2-new-concatenate/benchmark-lz4-20k-i13900K.png

The speed tests show that Blosc2's new concatenate is rather slow for small arrays (like 1,000 x 1,000). This is because it has to do a lot of work to set up the concatenation. But when you use larger arrays (like 20,000 x 20,000) that start to exceed the memory limits of our test machine (32 GB of RAM), Blosc2's new concatenate peformance is much better, and nearing the performance of NumPy's concatenate function.

However, if your array sizes line up well with Blosc2's internal chunks ("aligned" arrays), Blosc2 becomes much faster—typically more than 10x times faster than NumPy for large arrays. This is because it can skip a lot of the work of decompressing and re-compressing data, and the cost of copying compressed data is also lower (as much as the achieved compression ratio, which for this case is around 10x).

Using the Zstd compressor with Blosc2 can make joining "aligned" arrays even quicker, since Zstd is good at making data smaller.

/images/blosc2-new-concatenate/benchmark-zstd-20k-i13900K.png

So, when arrays are aligned, there's less data to copy (compression ratios here are around 20x), which speeds things up. If arrays aren't aligned, Zstd is a bit slower than the previous compressor (LZ4) because its decompression and re-compression algorithm is slower. Conclusion? Pick the compressor that works best for what you're doing!

Stacking Arrays

We've also added a new stack() function in Blosc2 that uses the concatenate feature. This function lets you stack arrays along a new axis, which is super useful for creating higher-dimensional arrays from lower-dimensional ones. Here's how it works:

import blosc2
# Create some sample arrays
a = blosc2.full((10, 20), 1, urlpath="arrayA.b2nd", mode="w")
b = blosc2.full((10, 20), 2, urlpath="arrayB.b2nd", mode="w")
c = blosc2.full((10, 20), 3, urlpath="arrayC.b2nd", mode="w")
# Stack the arrays along a new axis
stacked_result = blosc2.stack([a, b, c], axis=0, urlpath="stacked_destination.b2nd", mode="w")
print(stacked_result.shape)  # Output: (3, 10, 20)
# You can also stack along other axes
stacked_result_axis1 = blosc2.stack([a, b, c], axis=1, urlpath="stacked_destination_axis1.b2nd", mode="w")
print(stacked_result_axis1.shape)  # Output: (10, 3, 20)

Benchmarks for the stack() function show that it performs similarly to the concat() function, especially when the input arrays are aligned. Here are the results for the same data sizes and machine used in the previous benchmarks, and using the LZ4 compressor.

/images/blosc2-new-concatenate/stack-lz4-20k-i13900K.png

And here are the results for the Zstd compressor.

/images/blosc2-new-concatenate/stack-zstd-20k-i13900K.png

As can be seen, the stack() function is also very fast when the input arrays are aligned, and it performs well even for large arrays that don't fit into memory. Incidentally, when using the blosc2.stack() function in the last dim, it is slightly faster than numpy.stack() even when the arrays are not aligned; we are not sure why this is the case, but the fact that we can reproduces this behaviour is probably a sign that NumPy can optimize this use case better.

Conclusion

Blosc2's new concatenate and stack features are a great way to combine arrays quickly and without using too much memory. They are especially fast when your array sizes are an exact multiple of Blosc2's "chunks" (aligned arrays), making it perfect for big data jobs. They also work well for large arrays that don't fit into memory, as it processes data in small chunks. Finally, they are supported in both C and Python, so you can use them in your favorite programming language.

Give it a try in your own projects! If you have questions, the Blosc2 community is here to help.

If you appreciate what we're doing with Blosc2, please think about supporting us. Your help lets us keep making these tools better.

Make NDArray Transposition Fast (and Compressed!) within Blosc 2

Ricardo Sales Piquer

2025-04-08 09:00

Comments

Update (2025-04-30): The transpose function is now officially deprecated and replaced by the new permute_dims. This transition follows the Python array API standard v2022.12, aiming to make Blosc2 even more compatible with modern Python libraries and workflows.

In contrast with the previous transpose, the new permute_dims offers:

Support for arrays of any number of dimensions.
Full handling of arbitrary axis permutations, including support for negative indices.

Moreover, I have found a new way to transpose matrices more efficiently for Blosc2. This blog contains updated plots and discussions.

---

Matrix transposition is more than a textbook exercise, it plays a key role in memory-bound operations where layout and access patterns can make or break performance.

When working with large datasets, efficient data transformation can significantly improve both performance and compression ratios. In Blosc2, we recently implemented a matrix transposition function, a fundamental operation that rearranges data by swapping rows and columns. In this post, I'll share the design insights, implementation details, performance considerations that went into this feature, and an unexpected NumPy behaviour.

What was the old behavior?

Previously, calling blosc2.transpose(A) would transpose the data within each chunk, and a new chunk shape would be chosen for the output array. However, this new chunk shape was not necessarily aligned with the new memory access patterns induced by the transpose. As a result, even though the output looked correct, accessing data along the new axes still incurred a significant overhead due to increased number of I/O operations. This lead to performance bottlenecks, particularly in workloads that rely on efficient memory access patterns.

Transposition explanation for old operation

What's new?

The permute_dims function in Blosc2 has been redesigned to greatly improve performance when working with compressed, multidimensional arrays. The main improvement lies in transposing the chunk layout alongside the array data, which eliminates the overhead of cross-chunk access patterns.

The new implementation transposes the chunk layout along with the data. For example, an array with chunks=(2, 5) that is transposed with axes=(1, 0) will result in an array with chunks=(5, 2). This ensures that the output layout matches the new data order, making block access contiguous and efficient.

This logic generalizes to N-dimensional arrays and applies regardless of their shape or chunk configuration.

Transposition explanation for new operation

Performance benchmark: Transposing matrices with Blosc2 vs NumPy

To evaluate the performance of the new matrix transposition implementation in Blosc2, I conducted a series of benchmarks comparing it to NumPy, which serves as the baseline due to its widespread use and high optimization level. The goal was to observe how both approaches perform when handling matrices of increasing size and to understand the impact of different chunk configurations in Blosc2.

Benchmark setup

All tests were conducted using matrices filled with float64 values, covering a wide range of sizes, starting from small 100×100 matrices and scaling up to very large matrices of size 17000×17000, covering data sizes from just a few megabytes to over 2 GB. Each matrix was transposed using the Blosc2 API under different chunking strategies:

In the case of NumPy, I used the .transpose() function followed by a .copy() to ensure that the operation was comparable to that of Blosc2. This is because, by default, NumPy's transposition is a view operation that only modifies the array's metadata, without actually rearranging the data in memory. Adding .copy() forces NumPy to perform a real memory reordering, making the comparison with Blosc2 fair and accurate.

For Blosc2, I tested the transposition function across several chunk configurations. Specifically, I included:

Automatic chunking, where Blosc2 decides the optimal chunk size internally.
Fixed chunk sizes: (150, 300), (1000, 1000) and (5000, 5000).

These chunk sizes were chosen to represent a mix of square and rectangular blocks, allowing me to study how chunk geometry impacts performance, especially for very large matrices.

Each combination of library and configuration was tested across all matrix sizes, and the time taken to perform the transposition was recorded in seconds. This comprehensive setup makes it possible to compare not just raw performance, but also how well each method scales with data size and structure.

Results and discussion

The chart below summarizes the benchmark results for matrix transposition using NumPy and Blosc2, across various chunk shapes and matrix sizes.

Transposition performance for new method

While NumPy sets a strong performance baseline, the behaviour of Blosc2 becomes particularly interesting when we dive into how different chunk configurations affect transposition speed. The following observations highlight how crucial the choice of chunk shape is to achieving optimal performance.

Large square chunks (e.g., (4000, 4000)) showed the worst performance, especially with large matrices. Despite having fewer chunks, their size seems to hinder cache performance and introduces memory pressure that degrades throughput. Execution times were consistently higher than other configurations.
Small rectangular chunks such as (150, 300) also underperformed. As matrix size grew, execution times increased significantly, reaching nearly 3 seconds at around 2200 MB, likely due to poor cache utilization and the overhead of managing many tiny chunks.
Mid-sized square chunks like (1000, 1000) delivered consistently solid results across all tested sizes. Their timings stay below ~1.2 s with minimal variance, making them a reliable manual choice.
Automatically selected chunks consistently achieved the best performance. By adapting chunk layout to the data shape and size, the internal heuristics outpaced all fixed configurations, even rivaling plain NumPy transpose times.

The second plot provides a direct comparison between the standard NumPy transpose and the newly optimized Blosc2 version. It shows that Blosc2’s optimized implementation closely matches NumPy's performance, even for larger matrices. The results confirm that with good chunking strategies and proper memory handling, Blosc2 can achieve performance on par with NumPy for transposition operations.

Conclusion

The benchmarks highlight one key insight: Blosc2 is highly sensitive to chunk shape, and its performance can range from excellent to poor depending on how it is configured. With the right chunk size, Blosc2 can offer both high-speed transpositions and advanced features like compression and out-of-core processing. However, misconfigured chunks, especially those that are too big or too small, can drastically reduce its effectiveness. This makes chunk tuning an essential step for anyone seeking to get the most out of Blosc2 for large-scale matrix operations.

Appendix A: Unexpected NumPy behaviour

While running the benchmarks, two unusual spikes were consistently observed in the performance of NumPy around matrices of approximately 500 MB, 1100 MB and 2000 MB in size. This can be clearly seen in the plot below:

This sudden increase in transposition time is consistently reproducible and does not seem to correlate with the gradual increase expected from larger memory sizes. We have also observed this behaviour in other machines, although at different sizes.

This observation reinforces the importance of testing under realistic and varied conditions, as performance is not always linear or intuitive.

Optimizing chunks for matrix multiplication in Blosc2

Ricardo Sales Piquer

2025-03-12 09:00

Comments

As data volumes continue to grow in fields like machine learning and scientific computing, optimizing fundamental operations like matrix multiplication becomes increasingly critical. Blosc2's chunk-based approach offers a new path to efficiency in these scenarios.

Matrix Multiplication

Matrix multiplication is a fundamental operation in many scientific and engineering applications. With the introduction of matrix multiplication into Blosc2, users can now perform this operation on compressed arrays efficiently. The key advantages of having matrix multiplication in Blosc2 include:

Compressed matrices in memory: Blosc2 enables matrices to be stored in a compressed format without sacrificing the ability to perform operations directly on them.
Efficiency with chunks: In computation-intensive applications, matrix multiplication can be executed without fully decompressing the data, operating on small blocks of data independently, saving both time and memory.
Out-of-core computation: When matrices are too large to fit in main memory, Blosc2 facilitates out-of-core processing. Data stored on disk is read and processed in optimized chunks, allowing matrix multiplication operations without loading the entire dataset into memory.

These features are especially valuable in big data environments and in scientific or engineering applications where matrix sizes can be overwhelming, enabling complex calculations efficiently.

Implementation

The matrix multiplication functionality is implemented in the matmul function. It supports Blosc2 NDArray objects and leverages chunked operations to perform the multiplication efficiently.

The image illustrates a blocked matrix multiplication approach. The key idea is to divide matrices into smaller blocks (or chunks) to optimize memory access and computational efficiency.

In the image, matrix A (M x K) and matrix B (K x N) are partitioned into chunks, and these are partitioned into blocks. The resulting matrix C (M x N) is computed as a sum of block-wise multiplication.

This method significantly improves cache utilization by ensuring that only the necessary parts of the matrices are loaded into memory at any given time. In Blosc2, storing matrix blocks as compressed chunks reduces memory footprint and enhances performance by enabling on-the-fly decompression.

Also, Blosc2 supports a wide range of data types. In addition to standard Python types such as int, float, and complex, it also fully supports various NumPy types. The currently supported types include:

np.int8

np.int16

np.int32

np.int64

np.float32

np.float64

np.complex64

np.complex128

This versatility allows compression and subsequent processing to be applied across diverse scenarios, tailored to the specific needs of each application.

Together, these features make Blosc2 a flexible and adaptable tool for various scenarios, but especially suited for the handling of large datasets.

Benchmarks

The benchmarks have been designed to evaluate the performance of the matmul function under various conditions. Here are the key aspects of our experimental setup and findings:

Different matrix sizes were tested using both float32 and float64 data types. All the matrices used for multiplication are square. The variation in matrix sizes helps observe how the function scales and how the overhead of chunk management impacts performance.

The x-axis represents the size of the resulting matrix in megabytes (MB). We used GFLOPS (Giga Floating-Point Operations per Second) to gauge the computational throughput, allowing us to compare the efficiency of the matmul function relative to highly optimized libraries like NumPy.

Blosc2 also incorporates a functionality to automatically select chunks, and it is represented in the benchmark by "Auto".

For smaller matrices, the overhead of managing chunks in Blosc2 can result in lower GFLOPS compared to NumPy. As the matrix size increases, Blosc2 scales well, approaching its performance to NumPy.

Each chunk shape exhibits a peak performance when the matrix size matches the chunk size, or is a multiple of the chunk shape.

Conclusion

The new matrix multiplication feature in Blosc2 introduces efficient, chunked computation for compressed arrays. This allows users to handle large datasets both in memory and on disk without sacrificing performance. The implementation supports a wide range of data types, making it versatile for various numerical applications.

Real-world applications, such as neural network training, demonstrate the potential benefits in scenarios where memory constraints and large data sizes are common. While there are some limitations —such as support only for 2D arrays and the overhead of blocking— the applicability looks promising, like potential integration with deep learning frameworks.

Overall, Blosc2 offers a compelling alternative for applications where the advantages of compression and out-of-core computation are critical, paving the way for more efficient processing of massive datasets.

Getting my feet wet with Blosc2

In the initial phase of the project, my biggest challenge was understanding how Blosc2 manages data internally. For matrix multiplication, it was critical to grasp how to choose the right chunks, since the operation requires that the ranges of both matrices coincide. After some considerations and a few insightful conversations with Francesc, I finally understood the underlying mechanics. This breakthrough allowed me to begin implementing the first versions of my solution, adjusting the data fragmentation so that each block was properly aligned for precise computation.

Another important aspect was adapting to the professional workflow of using Git for version control. Embracing Git —with its branch creation, regular commits, and conflict resolution— represented a significant shift in my development approach. This experience not only improved the organization of my code and facilitated collaboration but also instilled a structured and disciplined mindset in managing my projects. This tool has shown to be both valuable and extremely helpful.

Finally, the moment when the function finally returned the correct result was really exciting. After multiple iterations, the rigorous debugging process paid off as everything fell into place. This breakthrough validated the robustness of the implementation and boosted my confidence to further optimize and tackle new challenges in data processing.

Mastering Persistent, Dynamic Reductions and Lazy Expressions in Blosc2

Oumaima Ech Chdig, Francesc Alted

2024-11-05 12:58

Comments

Working with large volumes of data is challenging, but Blosc2 offers unique tools to facilitate processing.

Blosc2 is a powerful data compression library designed to handle and process large datasets effectively. One standout feature is its support for lazy expressions and persistent and dynamic reductions. These tools make it possible to define complex calculations that execute only when necessary, reducing memory usage and optimizing processing time, which can be a game-changer when dealing with massive arrays.

In this guide, we’ll break down how to use these features to streamline data manipulation and get better performance out of your workflows. We’ll also see how resizing operand arrays is automatically reflected in the results, highlighting the flexibility of lazy expressions.

Getting Started with Arrays and Broadcasting

Blosc2 works smoothly with arrays of various shapes and dimensions, enabling users to perform calculations such as addition or multiplication across arrays of different sizes. This is where broadcasting comes in. With broadcasting, Blosc2 automatically aligns the shapes of arrays for easy operations. This means you don’t need to manually adjust array dimensions to match, a huge time-saver when working with multidimensional data.

For example, let’s suppose we have an array representing a large dataset and, a, another representing a smaller dimension, c.

a = blosc2.full((1, 3, 2), fill_value=3)
c = blosc2.full(2, fill_value=9, dtype=np.int8)
expr = a + c - 1

As seen above, broadcasting works automatically (and efficiently) with arrays of compressed data. Also, the correct data type of the result will be inferred from the operands and the expression. Thanks to this mechanism, the interpreter automatically adjusts the dimensions and data types of the arrays involved in the operation, allowing calculations to be performed without the need for manual adjustments.

This approach is ideal for quick and simple data analysis, especially when working with large volumes of information that require frequent operations across different dimensions.

Setting Up and Saving Lazy Expressions

Imagine you need to perform a calculation like sum(a, axis=0) + b * sin(c). Rather than immediately calculating this, Blosc2’s lazy expression feature lets you store the expression for later. By using blosc2.lazyexpr, you define complex mathematical formulas and only trigger their execution when required, and only for the part of the resulting array that you are interested in. This is highly advantageous for large computations that might not be needed right away or that may depend on evolving data.

Let's see how that works with a little more complex expression:

# Create arrays with specific dimensions and values
a = blosc2.full((2, 3, 4), 1, urlpath="a.b2nd", mode="w")
b = blosc2.full((2, 4), 2, urlpath="b.b2nd", mode="w")
c = blosc2.full(4, 3, dtype=np.uint8, urlpath="c.b2nd", mode="w")
# Define a lazy expression and the operands for later execution
# Note that we are using a string version of the expression here
# so that it can be re-opened as-is later on
expression = "sum(a, axis=0) + b * sin(c)"
lazy_expression = blosc2.lazyexpr(expression)
lazy_expression.save("arrayResult.b2nd", mode="w")

In this code, sum(a, axis=0) + b * sin(c) is defined but not executed immediately. When you’re ready to use the result, you can call lazy_expression.compute() (returns a Blosc2 array that is compressed by default) to run the calculation. Alternatively, you can specify the part of the result that you are interested in with lazy_expression[0, :] (returns a NumPy array). This way, you save CPU and memory and only perform the computation when necessary.

Dynamic Computation: Reusing and Updating Results

Another big advantage of Blosc2 is its ability to compute persistent expressions that are dynamic: when an operand is enlarged, Blosc2 re-adapts the expression to account for its new shape. This approach significantly speeds up processing time, especially when working with frequently updated or real-time data.

For instance, if you have an expression stored, and only part of your dataset changes, Blosc2 can apply reductions dynamically to efficiently update the sum:

# Resizing arrays and updating values
a.resize((30, 30, 40))
a[20:30] = 5
b.resize((30, 40))
b[20:30] = 7
# Open the saved file
lazy_expression = blosc2.open(urlpath=url_path)
result = lazy_expression.compute()

In this case, the final result will have a shape of (30, 40) (instead of the previous (20, 40)). This allows for quick adaptability, which is crucial in data environments where values evolve constantly.

Why Persistent Reductions and Lazy Expressions Matter

These features make Blosc2 a top choice for working with large datasets, as they allow for:

Broadcasting of memory, on-disk or network operands.
Efficient use of CPU and memory by only executing calculations when needed.
Dynamic expressions that adapt to changing data in operands.
Enhanced performance due to streamlined, multi-threaded and pre-fetched calculations.

Together, lazy expressions and persistent reductions provide a flexible, resource-efficient way to manage complex data processes. They’re perfect for real-time analysis, evolving datasets, or any high-performance computing tasks requiring dynamic data handling.

Conclusion

Blosc2’s features offer a way to make data processing smarter and faster. If you work with large arrays or require adaptable workflows, Blosc2 can help you make the most of your data processing resources.

For more in-depth guidance, visit the full tutorial on Blosc2.

N-dimensional reductions with Blosc2

Oumaima Ech Chdig, Francesc Alted

2024-08-28 10:32

Comments

NumPy is widely recognized for its ability to perform efficient computations and manipulations on multidimensional arrays. This library is fundamental for many aspects of data analysis and science due to its speed and flexibility in handling numerical data. However, when datasets reach considerable sizes, working with uncompressed data can result in prolonged access times and intensive memory usage, which can negatively impact overall performance.

Python-Blosc2 leverages the power of NumPy to perform reductions on compressed multidimensional arrays. But, by compressing data with Blosc2, it is possible to reduce the memory and storage space required to store large datasets, while maintaining fast reduction times. This is especially beneficial for systems with memory constraints, as it allows for faster data access and operation.

In this blog, we will explore how Python-Blosc2 can perform data reductions with in-memory NDArray objects (or any other object fulfilling the LazyArray interface) and how the speed of these operations can be optimized by using different chunk shapes, compression levels and codecs. We will then compare the performance of Python-Blosc2 with NumPy.

Note: The code snippets shown in this blog are part of a Jupyter notebook that you can run on your own machine. For that, you will need to install a recent version of Python-Blosc2: pip install 'blosc2>=3.0.0b3'; feel free to experiment with different parameters and share your results with us!

The 3D array

We will use a 3D array of type float64 with shape (1000, 1000, 1000). This array will be filled with values from 0 to 1000, and the goal will be to compute the sum of values in stripes of 100 elements in one axis, and including all the values in the other axis. We will perform reductions along the X, Y, and Z axes, comparing Blosc2 performance (with and without compression) against NumPy.

Reducing with NumPy

We will start by performing different sum reductions using NumPy. First, summing along the X, Y, and Z axes (and getting 2D arrays as result) and then summing along all axis (and getting an scalar as result).

axes = ("X", "Y", "Z", "all")
meas_np = {"sum": {}, "time": {}}
for n, axis in enumerate(axes):
    n = n if axis != "all" else None
    t0 = time()
    meas_np["sum"][axis] = np.sum(a, axis=n)
    t = time() - t0
    meas_np["time"][axis] = time() - t0

Reducing with Blosc2

Now let's create the Blosc2 array from the NumPy array. First, let's define the parameters for Blosc2: number of threads, compression levels, codecs, and chunk sizes. We will exercise different combinations of these parameters (including no compression) to evaluate the performance of Python-Blosc2 in reducing data in 3D arrays.

# Params for Blosc2
clevels = (0, 5)
codecs = (blosc2.Codec.LZ4, blosc2.Codec.ZSTD)

The function shown below is responsible for creating the different arrays and performing the reductions for each combination of parameters.

# Create a 3D array of type float64
def measure_blosc2(chunks):
    meas = {}
    for codec in codecs:
        meas[codec] = {}
        for clevel in clevels:
            meas[codec][clevel] = {"sum": {}, "time": {}}
            cparams = {"clevel": clevel, "codec": codec}
            a1 = blosc2.asarray(a, chunks=chunks, cparams=cparams)
            if clevel > 0:
                print(f"cratio for {codec.name} + SHUFFLE: {a1.schunk.cratio:.1f}x")
            # Iterate on Blosc2 and NumPy arrays
            for n, axis in enumerate(axes):
                n = n if axis != "all" else None
                t0 = time()
                # Perform the sum of the stripe (defined by the slice_)
                meas[codec][clevel]["sum"][axis] = a1.sum(axis=n)
                t = time() - t0
                meas[codec][clevel]["time"][axis] = t
                # If interested, you can uncomment the following line to check the results
                #np.testing.assert_allclose(meas[codec][clevel]["sum"][axis],
                #                           meas_np["sum"][axis])
    return meas

Automatic chunking

Let's plot the results for the X, Y, and Z axes, comparing the performance of Python-Blosc2 with different configurations against NumPy.

We can see that reduction along the X axis is much slower than those along the Y and Z axis for the Blosc2 case. This is because the automatically computed chunk shape is (1, 1000, 1000) making the overhead of partial sums larger. In addition, we see that, with the exception of the X axis, Blosc2+LZ4+SHUFFLE actually achieves far better performance than NumPy. Finally, when not using compression inside Blosc2, we never see an advantage. See later for a discussion on these results.

Manual chunking

Let's try to improve the performance by manually setting the chunk size. In the next case, we want to make performance similar along the three axes, so we will set the chunk size to (100, 100, 100) (8 MB).

In this case, performance in the X axis is already faster than Y and Z axes for Blosc2. Interestingly, performance is also faster than NumPy in X axis, while being very similar in Y and Z axis.

We could proceed further and try to fine tune the chunk size to get even better performance, but this is out of the scope of this blog (and more a task for Btune). Instead, we will try to make some sense on the results above; see below.

Why Blosc2 can be faster than NumPy?

As Blosc2 is using the NumPy machinery for computing reductions behind the scenes, why is Blosc2 faster than NumPy in several cases above? The answer lies in the way Blosc2 and NumPy access data in memory.

Blosc2 splits data into chunks and blocks to compress and decompress data efficiently. When accessing data, a full chunk is fetched from memory and decompressed by the CPU (as seen in the image below, left side). If the chunk size is small enough to fit in the CPU cache, the CPU can write the decompressed chunk faster, as it does not need to travel back to the main memory. Later, when NumPy is called to perform the reduction on the decompressed chunk, it can access the data faster, as it is already in the CPU cache (image below, right side).

/images/ndim-reductions/Blosc2-decompress.png

/images/ndim-reductions/Blosc2-NumPy.png

But for allowing NumPy go faster, Blosc2 needs to decompress several chunks prior to NumPy performing the reduction operation. The decompressed chunks are stored on a queue, waiting for further processing; this is why Blosc2 needs to handle several (3 or 4) chunks simultaneously. In our case, the L3 cache size of our CPU (Intel 13900K) is 36 MB, and Blosc2 has chosen 8 MB for the chunk size, allowing to store up to 4 chunks in L3, which is near to optimal. Also, when we have chosen the chunk size to be (100, 100, 100), the chunk size is still 8 MB, which continues to be fine indeed.

All in all, it is not that Blosc2 is faster than NumPy, but rather that it is allowing NumPy to leverage the CPU cache more efficiently. Having said this, we still need some explanation on why the performance can be so different along the X, Y, and Z axes, specially for the first chunk shape (automatic) above. Let's address this in the next section.

Performing reductions on 3D arrays

/images/ndim-reductions/3D-cube-plane.png

On a three-dimensional environment, like the one shown in the image, data is organized in a cubic space with three axes: X, Y, and Z. By default, Blosc2 chooses the chunk size so that it fits in the CPU cache comfortably. On the other hand, it tries to follow the NumPy convention of storing data row-wise; so, this is why the default chunk shape has been chosen as (1, 1000, 1000). In this case, it is clear that reduction times along different axes are not going to be the same, as the sizes of the chunk in different axes are not uniform (actually, there is a large asymmetry).

The difference in cost while traversing data values can be visualized more easily on a 2D array:

/images/ndim-reductions/memory-access-2D-x.png

Reduction along the X axis: When accessing a row (red line), the CPU can access these values (red points) from memory sequentially, but they need to be stored on an accumulator. The next rows needs to be fetched from memory and be added to the accumulator. If the size of the accumulator is large (in this case is 1000 * 1000 * 8 = 8 MB), it does not fit in low level CPU caches, and has to be peformed in the relatively slow L3.

/images/ndim-reductions/memory-access-2D-y.png

Reducing along the Y axis: When accessing a row (green line), the CPU can access these values (green points) from memory sequentially but, contrarily to the case above, they don't even need an accumulator, and the sum of the row (marked as an *) is final. So, although the number of sum operations is the same as above, the required time is smaller because there is no need of updating all the values of the accumulator per row, but only one at a time, which is faster in modern CPUs.

Tweaking the chunk size

However, when Blosc2 is instructed to create chunks that are the same size for all the axes (chunks=(100, 100, 100)), the situation changes. In this case, an accumulator is needed for each chunk (sub-cube in figure above), but as it is relatively small (100 * 100 * 8 = 80 KB), and fits in L2, so accumulation in the X axis is faster than in the previous scenario (remember that it needs to do the accumulation in L3).

Incidentally, now Blosc2 performance along X axis is even better than in the Y and Z axes, as the CPU can access data in a more efficient way. Furthermore, Blosc2 performance is up to 1.5x better than NumPy in the X axis (while being similar, or even a bit better along Y and Z axes), which is a quite remarkable feat.

Effect of using different codecs in Python-Blosc2

Compression and decompression consume CPU and memory resources. Differentiating between various codecs and configurations allows for evaluating how each option impacts the use of these resources, helping to choose the most efficient option for the operating environment. Finding the right balance between compression ratio and speed is crucial for optimizing performance.

In the plots above, we can see how using the LZ4 codec is striking such a balance, as it achieves the best performance in general, even above a non-compressed scenario. This is because LZ4 is tuned towards speed, and the time to compress and decompress the data is very low. On the other hand, ZSTD is a codec that is optimized for compression ratio (although not shown, in this case it typically compresses between 2x and 3x more than LZ4), and hence it is a bit slower. However, it is still faster than the non-compressed case, as compression requires reduced memory transmission, and this compensates for the additional CPU time required for compression and decompression.

We have just scraped the surface for some of the compression parameters that can be tuned in Blosc2. You can use the cparams dict with the different parameters in blosc2.compress2() to set the compression level, codec , filters and other parameters.

Conclusion

Understanding the balance between space savings and the additional time required to process the data is important. Testing different compression settings can help finding the method that offers the best trade-off between reduced size and processing time. The fact that Blosc2 automatically chooses the chunk shape, makes it easy for the user to get a decently good performance, without having to worry about the details of the CPU cache. In addition, as we have shown, we can fine tune the chunk shape in case the default one does not fit our needs (e.g. we need more uniform performance along all axes).

Besides the sum() reduction exercised here, Blosc2 supports a fair range of reduction operators (mean, std, min, max, all, any, etc.), and you are invited to explore them. Moreover, it is also possible to use reductions even for very large arrays that are stored on disk. This opens the door to a wide range of possibilities for data analysis and science, allowing for efficient reductions on large datasets that are compressed on-disk and with minimal memory usage. We will explore this in a forthcoming blog.

We would like to thank ironArray for supporting the development of the computing capabilities of Blosc2. Then, to NumFOCUS for recently providing a small grant that is helping us to improve the documentation for the project. Last but not least, we would like to thank the Blosc community for providing so many valuable insights and feedback that have helped us to improve the performance and usability of Blosc2.

Peaking compression performance in PyTables with direct chunking

Ivan Vilata-i-Balaguer

2024-08-26 09:20

Comments

It took a while to put things together, but after many months of hard work by maintainers, developers and contributors, PyTables 3.10 finally saw the light, full of enhancements and fixes. Thanks to a NumFOCUS Small Development Grant, we were able to include a new feature that can help you squeeze considerable performance improvements when using compression: the direct chunking API.

In a previous post about optimized slicing we saw the advantages of avoiding the overhead introduced by the HDF5 filter pipeline, in particular when working with multi-dimensional arrays compressed with Blosc2. This is achieved by specialized, low-level code in PyTables which understands the structure of the compressed data in each chunk and accesses it directly, with the least possible intervention of the HDF5 library.

However, there are many reasons to exploit direct chunk access in your own code, from customizing compression with parameters not allowed by the PyTables Filters class, to using yet-unsupported compressors or even helping you develop new plugins for HDF5 to support them (you may write compressed chunks in Python while decompressing transparently in a C filter plugin, or vice versa). And of course, as we will see, skipping the HDF5 filter pipeline with direct chunking may be instrumental to reach the extreme I/O performance required in scenarios like continuous collection or extraction of data.

PyTables' new direct chunking API is the machinery that gives you access to these possibilities. Keep in mind though that this is a low-level functionality that may help you largely customize and accelerate access to your datasets, but may also break them. In this post we'll try to show how to use it to get the best results.

Using the API

The direct chunking API consists of three operations: get information about a chunk (chunk_info()), write a raw chunk (write_chunk()), and read a raw chunk (read_chunk()). They are supported by chunked datasets (CArray, EArray and Table), i.e. those whose data is split into fixed-size chunks of the same dimensionality as the dataset (maybe padded at its boundaries), with HDF5 pipeline filters like compressors optionally processing them on read/write.

chunk_info() returns an object with useful information about the chunk containing the item at the given coordinates. Let's create a simple 100x100 array with 10x100 chunks compressed with Blosc2+LZ4 and get info about a chunk:

>>> import tables, numpy
>>> h5f = tables.open_file('direct-example.h5', mode='w')
>>> filters = tables.Filters(complib='blosc2:lz4', complevel=2)
>>> data = numpy.arange(100 * 100).reshape((100, 100))
>>> carray = h5f.create_carray('/', 'carray', chunkshape=(10, 100),
                               obj=data, filters=filters)
>>> coords = (42, 23)
>>> cinfo = carray.chunk_info(coords)
>>> cinfo
ChunkInfo(start=(40, 0), filter_mask=0, offset=6779, size=608)

So the item at coordinates (42, 23) is stored in a chunk of 608 bytes (compressed) which starts at coordinates (40, 0) in the array and byte 6779 in the file. The latter offset may be used to let other code access the chunk directly on storage. For instance, since Blosc2 was the only HDF5 filter used to process the chunk, let's open it directly:

>>> import blosc2
>>> h5f.flush()
>>> b2chunk = blosc2.open(h5f.filename, mode='r', offset=cinfo.offset)
>>> b2chunk.shape, b2chunk.dtype, data.itemsize
((10, 100), dtype('V8'), 8)

Since Blosc2 does understand the structure of data (thanks to b2nd), we can even see that the chunk shape and the data item size are correct. The data type is opaque to the HDF5 filter which wrote the chunk, hence the V8 dtype. Let's check that the item at (42, 23) is indeed in that chunk:

>>> chunk = numpy.ndarray(b2chunk.shape, buffer=b2chunk[:],
                          dtype=data.dtype)  # Use the right type.
>>> ccoords = tuple(numpy.subtract(coords, cinfo.start))
>>> bool(data[coords] == chunk[ccoords])
True

This offset-based access is actually what b2nd optimized slicing performs internally. Please note that neither PyTables nor HDF5 were involved at all in the actual reading of the chunk (Blosc2 just got a file name and an offset). It's difficult to cut more overhead than that!

It won't always be the case that you can (or want to) read a chunk in that way. The read_chunk() method allows you to read a raw chunk as a new byte string or into an existing buffer, given the chunk's start coordinates (which you may compute yourself or get via chunk_info()). Let's use read_chunk() to redo the reading that we just did above:

>>> rchunk = carray.read_chunk(coords)
Traceback (most recent call last):
    ...
tables.exceptions.NotChunkAlignedError: Coordinates are not multiples
    of chunk shape: (42, 23) !* (np.int64(10), np.int64(100))
>>> rchunk = carray.read_chunk(cinfo.start)  # Always use chunk start!
>>> b2chunk = blosc2.ndarray_from_cframe(rchunk)
>>> chunk = numpy.ndarray(b2chunk.shape, buffer=b2chunk[:],
                          dtype=data.dtype)  # Use the right type.
>>> bool(data[coords] == chunk[ccoords])
True

The write_chunk() method allows you to write a byte string into a raw chunk. Please note that you must first apply any filters manually, and that you can't write chunks beyond the dataset's current shape. However, remember that enlargeable datasets may be grown or shrunk in an efficient manner using the truncate() method, which doesn't write new chunk data. Let's use that to create an EArray with the same data as the previous CArray, chunk by chunk:

>>> earray = h5f.create_earray('/', 'earray', chunkshape=carray.chunkshape,
                               atom=carray.atom, shape=(0, 100),  # Empty.
                               filters=filters)  # Just to hint readers.
>>> earray.write_chunk((0, 0), b'whatever')
Traceback (most recent call last):
    ...
IndexError: Chunk coordinates not within dataset shape:
    (0, 0) <> (np.int64(0), np.int64(100))
>>> earray.truncate(len(carray))  # Grow the array (cheaply) first!
>>> for cstart in range(0, len(carray), carray.chunkshape[0]):
...     chunk = carray[cstart:cstart + carray.chunkshape[0]]
...     b2chunk = blosc2.asarray(chunk)  # May be customized.
...     wchunk = b2chunk.to_cframe()  # Serialize.
...     earray.write_chunk((cstart, 0), wchunk)

You can see that such low-level writing is more involved than usual. Though we used default Blosc2 parameters here, the explicit compression step allows you to fine-tune it in ways not available through PyTables like setting internal chunk and block sizes or even using Blosc2 compression plugins like Grok/JPEG2000. In fact, the filters given on dataset creation are only used as a hint, since each Blosc2 container holding a chunk includes enough metadata to process it independently. In the example, the default chunk compression parameters don't even match dataset filters (using Zstd instead of LZ4):

>>> carray.filters
Filters(complevel=2, complib='blosc2:lz4', ...)
>>> earray.filters
Filters(complevel=2, complib='blosc2:lz4', ...)
>>> b2chunk.schunk.cparams['codec']
<Codec.ZSTD: 5>

Still, the Blosc2 HDF5 filter plugin included with PyTables is able to read the data just fine:

>>> bool((carray[:] == earray[:]).all())
True
>>> h5f.close()

You may find a more elaborate example of using direct chunking in PyTables' examples.

Benchmarks

b2nd optimized slicing shows us that removing the HDF5 filter pipeline from the I/O path can result in sizable performance increases, if the right chunking and compression parameters are chosen. To check the impact of using the new direct chunking API, we ran some benchmarks that compare regular and direct read/write speeds. On an AMD Ryzen 7 7800X3D CPU with 8 cores, 96 MB L3 cache and 8 MB L2 cache, clocked at 4.2 GHz, we got the following results:

/images/pytables-direct-chunking/AMD-7800X3D.png

We can see that direct chunking yields 3.75x write and 4.4x read speedups, reaching write/read speeds of 1.7 GB/s and 5.2 GB/s. These are quite impressive numbers, though the base equipment is already quite powerful. Thus we also tried the same benchmark on a consumer-level MacBook Air laptop with an Apple M1 CPU with 4+4 cores and 12 MB L2 cache, clocked at 3.2 GHz, with the following results:

/images/pytables-direct-chunking/MacAir-M1.png

In this case direct chunking yields 4.5x write and 1.9x read speedups, with write/read speeds of 0.8 GB/s and 1.6 GB/s. The absolute numbers are of course not as impressive, but the performance is still much better than that of the regular mechanism, especially when writing. Please note that the M1 CPU has a hybrid efficiency+performance core configuration; as an aside, running the benchmark on a high-range Intel Core i9-13900K CPU also with a hybrid 8+16 core configuration (32 MB L2, 5.7 GHz) raised the write speedup to 4.6x, reaching an awesome write speed of 2.6 GB/s.

All in all, it's clear that bypassing the HDF5 filter pipeline results in immediate I/O speedups. You may find a Jupyter notebook with the benchmark code and AMD CPU data in PyTables' benchmarks.

Conclusions

First of all, we (Ivan Vilata and Francesc Alted) want to thank everyone who made this new 3.10 release of PyTables possible, especially Antonio Valentino for his role of project maintainer, and the many code and issue contributors. Trying the new direct chunking API is much easier because of them. And of course, a big thank you to the NumFOCUS Foundation for making this whole new feature possible by funding its development!

In this post we saw how PyTables' direct chunking API allows one to squeeze the extra drop of performance that the most demanding scenarios require, when adjusting chunking and compression parameters in PyTables itself can't go any further. Of course, its low-level nature makes its use less convenient and safe than higher-level mechanisms, so you should evaluate whether the extra effort pays off. If used carefully with robust filters like Blosc2, the direct chunking API should shine the most in the case of large datasets with sustained I/O throughput demands, while retaining compatibility with other HDF5-based tools.

TL;DR

A Trip Down Memory Lane

Roofline Model: Understanding the Memory Wall

The In-Memory Surprise: Why Wasn't Compression Faster?

Unpacking the Overhead: A Look Inside Blosc2's Architecture

Modern ARM Architectures

A Different Battlefield: Blosc2 Shines with On-Disk Data

Roofline Plot: In-Memory vs On-Disk

Reproducibility

Conclusions

From Eager to Lazy with blosc2.lazyexpr

One Engine, Many Backends

Example with HDF5

Example with Zarr

Performance Comparison

Analysis

Conclusion

What is a TreeStore?

Basic Usage: Creating and Populating a TreeStore

Reading from a TreeStore

Advanced Usage: Metadata and Subtrees

Storing Metadata with vlmeta

Working with Subtrees (Groups)

Iterating Through a TreeStore

Some Benchmarks

Results for the Intel i9-13900K desktop

Results for the Apple M4 Pro Mac mini

Conclusion

What is Fancy Indexing?

Support for Fancy Indexing and ndindex

Results: Blosc2, Zarr, H5Py and NumPy

Conclusion

Addendum: Oindex, Vindex and FancyIndex via Two Examples

Example 1

Example 2

Aligned versus Non-Aligned Concatenation

Performance

Stacking Arrays

Conclusion

What was the old behavior?

What's new?

Performance benchmark: Transposing matrices with Blosc2 vs NumPy

Benchmark setup

Results and discussion

Conclusion

Appendix A: Unexpected NumPy behaviour

Matrix Multiplication

Implementation

Benchmarks

Conclusion

Getting my feet wet with Blosc2

Getting Started with Arrays and Broadcasting

Setting Up and Saving Lazy Expressions

Dynamic Computation: Reusing and Updating Results

Why Persistent Reductions and Lazy Expressions Matter

Conclusion

The 3D array

Reducing with NumPy

Reducing with Blosc2

Automatic chunking

Manual chunking

Why Blosc2 can be faster than NumPy?

Performing reductions on 3D arrays

Tweaking the chunk size

Effect of using different codecs in Python-Blosc2

Conclusion

Using the API

Benchmarks

Conclusions

From Eager to Lazy with `blosc2.lazyexpr`

Storing Metadata with `vlmeta`

Support for Fancy Indexing and `ndindex`