Prefilters, postfilters and fillers¶
Via decorators, one may set functions that will be applied to an SChunk instance when compressing data while appending (prefilters), filling in data on creation (fillers) or decompressing data upon accessing (postfilters) from the SChunk. Note that then prefilters and fillers modify the stored data of the SChunk, while postfilters act on data decompressed and returned by access operations on the SChunk.
These procedures are implemented via user defined (python) functions that can be executed before compressing the data when filling a SChunk. In this tutorial we will see how these work, so let’s start by creating our SChunk! Because we will be using python functions, we will not be able to use parallelism, so nthreads
has to be 1 when compressing:
[1]:
import numpy as np
import blosc2
typesize = 4
cparams = {
"nthreads": 1,
"typesize": typesize,
}
storage = {
"cparams": cparams,
}
chunk_len = 10_000
my_schunk = blosc2.SChunk(chunksize=chunk_len * typesize, **storage)
my_schunk
[1]:
<blosc2.schunk.SChunk at 0x740d9ea49310>
Now that we have the schunk, we can create the different prefilter, postfilter and filler functions.
Prefilters¶
For setting the prefilter, you will first have to create it as a function that receives three params: input, output and the offset in schunk where the block starts. Then, you will use a decorator and pass to it the input data type that the prefilter will receive and the output data type that it will fill and append to the schunk:
[2]:
input_dtype = np.int32
output_dtype = np.int32
@my_schunk.prefilter(input_dtype, output_dtype)
def prefilter(input, output, offset):
output[:] = input - 3 + offset
Awesome! Now each time we add data in the schunk, the prefilter will modify it before storing it. Let’s append an array and see that the actual appended data has been modified:
[3]:
buffer = np.arange(chunk_len * 100, dtype=input_dtype)
my_schunk[: buffer.size] = buffer
out = np.empty(10, dtype=output_dtype)
my_schunk.get_slice(stop=10, out=out)
print(buffer[:10])
print(out)
[0 1 2 3 4 5 6 7 8 9]
[-3 -2 -1 0 1 2 3 4 5 6]
As you can see, the data was modified according to the prefilter function.
Removing a prefilter¶
What if we don’t want the prefilter to be executed anymore? Then you can remove the prefilter from the schunk just like so:
[4]:
my_schunk.remove_prefilter("prefilter")
Since we no longer use a user-defined python function, we might want to enable multi-threading again, via my_schunk.cparams = blosc2.CParams(**{"nthreads": 8})
.
Fillers¶
So far, we have seen a way to set a function that will be executed each time we append some data. Now, we may instead want to fill an empty schunk with some more complex operation only once, and then update the data without being modified. This is where fillers come into play.
A filler is a function that receives a tuple of inputs, an output and the offset where the block begins. First let’s create another empty schunk (with parallelism disabled of course):
[5]:
schunk_fill = blosc2.SChunk(chunksize=chunk_len * typesize, **storage)
Next, we will create our filler function, which must have the following signature: a 2-element inputs tuple of the input and the data type; an output data type; and the number of elements you want the filled schunk to have. We then associate the filler function to the schunk_fill
that we want to fill via the relevant decorator like so, using as input the my_schunk
that we created:
[6]:
nelem = my_schunk.nbytes // my_schunk.typesize
@schunk_fill.filler(((my_schunk, output_dtype),), output_dtype, nelem)
def filler(inputs_tuple, output, offset):
output[:] = inputs_tuple[0] + offset
Let’s see how the filled data looks:
[7]:
out = np.empty(nelem, dtype=output_dtype)
schunk_fill.get_slice(out=out)
out
[7]:
array([ -3, -2, -1, ..., 2979994, 2979995, 2979996],
shape=(1000000,), dtype=int32)
That looks right. If we now update schunk_fill
with some data, the filler function will not be applied to it:
[8]:
new_data = np.ones(chunk_len, dtype=np.int32)
schunk_fill[: new_data.size] = new_data
schunk_fill.get_slice(out=out)
out
[8]:
array([ 1, 1, 1, ..., 2979994, 2979995, 2979996],
shape=(1000000,), dtype=int32)
As you can see, the filler function has not been applied to the new data. That makes sense because the filler, contrarily to a regular prefilter, is only active during the schunk creation. Since the filler will not be called again, there is no need to remove it, although we may want to enable parallelism again, via schunk_fill.cparams = blosc2.CParams(**{"nthreads": 8})
.
Postfilters¶
Contrary to prefilters, a postfilter is executed every time one decompresses SChunk data during access operations. We’ll use the my_schunk
we created above to show how to set a postfilter, which already has parallelism disabled for compression - but not for decompression, as is necessary when using postfilter functions. The postfilter function has the same three arguments as the prefilter function: input, output and offset. However, the decorator used to associate the function to
my_schunk
only requires the input data type:
[9]:
my_schunk.dparams = blosc2.DParams(**{"nthreads": 1}) # Disable parallelism for decompression
@my_schunk.postfilter(input_dtype)
def postfilter(input, output, offset):
output[:] = input + 3 + np.arange(input.size) + offset
Let’s try decompressing some data from the schunk and see how the postfilter is applied:
[10]:
out = np.empty(10, dtype=input_dtype)
my_schunk.get_slice(stop=10, out=out)
out
[10]:
array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18], dtype=int32)
If we do not want the postfilter to be executed anymore, we can remove it from the SChunk easily. We can then check that it is no longer applied when decompressing data:
[11]:
my_schunk.remove_postfilter("postfilter")
my_schunk.get_slice(stop=10, out=out)
out
[11]:
array([-3, -2, -1, 0, 1, 2, 3, 4, 5, 6], dtype=int32)
Conclusions¶
If you want a function to be applied each time before compressing some data, you will use a prefilter. But if you just want to use it once to fill an empty schunk, you may want to use a filler. Finally, if you want to modify data upon access, but leave the internal data of the SChunk untouched, you would use a postfilter. And of course, you can remove any of these functions at any time, and re-enable parallelism if you decide to stop using user-defined functions.
Prefilters, postfilters and fillers can also be applied to an NDArray array via its SChunk attribute(NDArray.schunk
).
That’s all for now. There are more examples in the examples directory for you to explore. Enjoy!