How it all started
I think I began to become truly interested in compression when, back
in 1992, I was installing C-News, a news server package meant
to handle Usenet News
articles in our university. For younger
audiences, Usenet News was a very popular way to discuss about all
kind of topics, but at the same time it was pretty difficult to cope
with the huge amount of articles, specially because spam practices
started to appear by that time. As Gene Spafford put it in 1992:
"Usenet is like a herd of performing elephants with
diarrhea. Massive, difficult to redirect, awe-inspiring,
entertaining, and a source of mind-boggling amounts of excrement
when you least expect it."
But one thing was clear: Usenet brought massive amounts of data
that had to be transmitted through the typical low-bandwidth data
lines of that time: 64 Kbps shared for everyone at our university.
My mission then was to bring the Usenet News feed by making use of as
low of resources as possible. Of course, one of the first things that
I did was to start news transmission during the night, when everyone
was warm at bed and nobody was going to complain about others stealing
the precious and scarce Internet bandwidth. Another measure was to
subscribe to just a selection of groups so that the transmission would
end before the new day would start. And of course, I started
experimenting with compression for maximizing the number of groups
that we could bring to our community.
Compressing Usenet News
The most used compressor by 1992 was compress, a Unix program based on the
LZW compression algorithm. But
LZW had patents issues, so by that time Jean-Loup Gailly and Mark
Adler started the work with gzip. At the beginning of 1993 gzip
1.0 was ready for consumption and I find it exciting not only because
it was not patent-encumbered, but also because it compressed way
better than the previous compress program, allowed different
compression levels, and it was pretty fast too (although compress
still had an advantage here, IIRC).
So I talked with the university that was
providing us with the News feed and we manage to start compressing it,
first with compress and then with gzip. Shortly after that,
while making measurements on the new gzip improvements, I discovered
that the bottleneck was in our News workstation (an HP 9000-730 with a
speedy PA-7000 RISC microprocessor @ 66 MHz) being unable to
decompress all the gzipped stream of subscribed news on-time. The
bottleneck suddenly changed from the communication line to the CPU!
I remember spending large hours playing with different combinations of
data chunk sizes and gzip compression levels, plotting the results
(with the fine gnuplot)
before finally coming with a combination that stroked a fair balance
between available bandwidth and CPU speed, maximizing the amount of
news articles hitting our university. I think this was my first
realization of how compression could help bringing data faster to the
system, making some processes more effective. In fact, that actually
blew-up my mind and made me passionate about compression technologies
for the years to come.
LZO and the explosion of compression technology
During 1996, Markus F.X.J. Oberhumer started to announce the
availability of his own set of LZO compressors. These consisted in
many different compressors, all of them being variations of his own
compression algorithm (LZO), but tweaked to achieve either better
compression ratios or compression speed. The suite was claimed to
being able to achieve speeds reaching 1/3 of the memory speed of
the typical Pentium-class computers available at that time. An entire
set of compressors being able to approach memory speed? boy, that was
a very exciting news for me.
LZO was in the back of my mind when I started my work on PyTables in August 2002 and shortly after, in May
PyTables gained support for LZO. My goal was indeed to accelerate
data transmission from disk to the CPU (and back), and these plots
are testimonial of how beneficial LZO was for achieving that goal.
Again, compression was demonstrating that it could effectively
increase disk bandwidth, and not only slow internet lines.
However, although LZO was free of patent issues and fast as hell,
it had a big limitation for a project like PyTables: the licensing.
LZO was using the GPL license, and that prevented the inclusion of its
sources in distributions without re-licensing PyTables itself as GPL,
a thing that I was not willing to do (PyTables has a BSD license, as
it is usual in the NumPy ecosystem). Because of that, LZO was a nice
compressor to be included in GPL projects like the Linux kernel
itself, but not a good fit for PyTables (although support for LZO still
exists, as long as it is downloaded and installed separately).
By that time (mid 2000's) it started to appear a plethora of fast
compressors with the same spirit than LZO, but with more permissive
licenses (typically BSD/MIT), many of them being a nice fit for PyTables.
A new compressor for PyTables/HDF5
By 2008 it was clear that PyTables needed a compressor whose sources
could be included in the PyTables tarball, so minimizing the
installation requirements. For this I started considering a series of
libraries and immediately devised FastLZ as a
nice candidate because of its simplicity and performance. Also,
FastLZ had a permissive MIT license, which was what I was looking for.
But pure FastLZ was not completely satisfactory because it was not
simple enough. It had 2 compression levels that
complicated the implementation quite a bit, so I decided to keep just the
highest level, and then optimize certain parts of it so that speed
would be acceptable. These modifications gave birth to BloscLZ, which
is still being default compressor in Blosc.
But I had more ideas on what other features the new Blosc compressor
should have, namely, multi-threading and an integrated shuffle filter.
Multi-threading made a lot of sense by 2008 because both Intel and AMD
already had a wide range of multi-core processors by then, and it was
clear that the race for throwing more and more cores into systems was
going to intensify. A fast compressor had to be able to use all these
cores dancing around, period.
Shuffle (see slide 71 of this presentation) was the other important
component of the new compressor. This algorithm relies on
neighboring elements of a dataset being highly correlated to improve
data compression. A shuffle filter already came as part of the HDF5
library but it was implemented in
pure C, and as it had an important overhead in terms of computation, I
decided to do an SIMD version using
the powerful SSE2 instructions
present in all Intel and AMD processors since 2003. The result is
that this new shuffle implementation adds almost zero overhead
compared with the compression/decompression stages.
Once all of these features were implemented, I designed a pretty
comprehensive suite of tests and asked the PyTables
community to help me testing the new compressor in as much systems as
possible. After some iterations, we were happy when the new
compressor worked flawlessly compressing and decompressing hundreds
of terabytes on many different Windows and Unix boxes, both in
32-bit and 64-bit. The new beast was ready to ship.
Blosc was born
I then grabbed BloscLZ, the multi-threading support and the
SSE2-powered shuffle and put it all in the same package. That also became a
standalone, pure C library, with no attachments to PyTables or HDF5,
so any application could make
use of it. I have got the first stable version (1.0) of Blosc
released by July 2010.
Before this, I already introduced Blosc publicly in my EuroSciPy 2009 keynote and also made a small
reference to it in an article about Starving CPUs where I
"As the gap between CPU and memory speed continues to widen, I
expect Blosc to improve memory-to-CPU data transmission rates over
an increasing range of datasets."
And that is the thing. As CPUs are getting faster, the chances for
using compression for an advantage can be applied to more and more
scenarios, to the point that improving the bandwidth of main memory
(RAM) is becoming possible now. And surprisingly enough, the methodology
for achieving that is the same than back in the C-news ages: strike a good
balance between data block sizes and compression speed, and let
compression make your applications handle data faster and not only
making it more compact.
When seen in perspective, it has been a long quest over the last
decades. During the 90's, compression was useful to improve the
bandwidth of slow internet connections. In the 2000's, it made
possible accelerating disk I/O operation. In the 2010's Blosc goal is
making the memory subsystem faster and whether it is able to
achieve this or not will be the subject of future blogs (hint: data
arrangement is critical too). But one
thing is clear, achieving this (by Blosc or any other compressor out
there) is just a matter of time. Such is the fate of the ever
increasing gap in CPU versus memory speeds.