249 points | by gatesn2 months ago
That is quite interesting. One challenge in general with parqet and arrow in the otel / observability ecosystem is that the shape of data is not quite known with spans. There are arbitrary attributes on them, and they can change. To the best of my knowledge no particularly great solution exists today for encoding this. I wonder to which degree this system could be "abused" for that.
Basically, since I ended up building a custom library for this, I wanted to solve the portability problem by making it stupidly simple to reverse engineer, so I cooked up a convention where each column (and supporting column) is a file, with a file name that describes its format and role.
So a real-world production table looks like this if you ls in the directory (omitting a few columns for brevity):
combinedId.0.dat.s64le.bin
documentMeta.0.dat.s64le.bin
features.0.dat.s32le.bin
size.0.dat.s32le.bin
termIds.0.dat-len.varint.bin
termIds.0.dat.s64le[].zstd
termMetadata.0.dat-len.varint.bin
termMetadata.0.dat.s8[].zstd
The design goal is that just based on an ls output, someone who has never seen the code of the library producing the files should be able to trivially write code that reads it.It's also a very old format, so not without its warts :)
- https://indico.fnal.gov/event/23628/contributions/240607/
- https://indico.cern.ch/event/1338689/contributions/6077632/
I have no idea since I've never had access to Snowflake...
Vortex is a file format. In their canonicalized, uncompressed form, vortex files are simply Apache Arrow IPC files with some of the bits and bobs moved around a bit (enabling transformation to/from Arrow), plus some extra metadata about types, summary statistics, data layout, etc.
The Vortex spec supports fancy strategies for compressing columns, the ability to store summary statistics alongside data, and the ability to specify special compute operations for particular data columns. Vortex also specifies the schema of the data as metadata, separately from the physical layout of the data on disk. All Arrow arrays can be converted zero-copy into Vortex arrays, but not vice-versa.
Vortex also supports extensions in the form of new encodings and compression strategies. The idea here is that, as new ways of encoding data appear, they can be supported by Vortex without creating a whole new file format.
Vortex-serde is a serde library for Vortex files. In addition to classic serialization/deserialization, it supports giving applications access to all those fancy compute and summary statistics features I mentioned above.
You say "Vortex is a toolkit for working with compressed Apache Arrow arrays in-memory, on-disk, and over-the-wire," but that's kind of like saying "MKV is a toolkit for working with compressed AVI and WAV files." It sounds like Vortex is a flexible file spec that lets you:
1. Work with Arrow arrays on disk with options for compression.
2. Create files that model data that can't be modeled in Arrow due to Arrow's hard coupling between encoding and logical typing.
3. Utilize a bunch of funky and innovative new features not available in existing data file formats and probably only really interesting to people who are nerds about this (laypeople will be interested in the performance improvements, though).
The previous generation of analytics/"Big Data" projects (think Hadoop, Spark, Kafka, Elastic) were all built in the JVM. They were monolithic distributed systems clusters hosted on VMs or on-premise. They were servers with clients implemented in Java. It is effectively impossible to embed a Java library into anything non-Java, the best you can do is fork a JVM with a carefully maintained classpath and hit it over the network (c.f. PySpark). Kafka has externally maintained bindings that lag the official JVM client.
Parquet was built during this era, so naturally its reference implementation was written in Java. For many years, the only implementation of Parquet was in Java. Even when parquet-cpp and subsequent implementations began to pop up, the Parquet Java implementation was still the best maintained. Over time as the spec got updated and new features made their way into Parquet, different implementations had different support. Files written by parquet-cpp or parquet-rs could not be opened via Spark or Presto.
The newer generation of data analytics tooling is meant to be easily embedded, so that generally means a native language that can export shared objects with a C ABI that can be consumed by the FFI layer of different languages. That leaves you a few options, and of those Rust is arguably the best for reasons of tooling and ecosystem, though different projects make different choices. DuckDB for example is an extremely popular library with bindings in several languages and it was built in C++ long after Rust became in-vogue.
While Vortex doesn't (yet) have a C API, we do have Python bindings that we expect to be the main way people use it.
Thank god everyone wisened up. The tool maketh not the craftsman. These days the "written in rust" tag is met with knee jerk skepticism, as-if the hive mind over corrected.
How does this compare to Lance (https://lancedb.github.io/lance/)?
What do you think the key applied use case for Vortex is?
Two things I would hope to see before I'd start using Vorter is geospatial data support (there's already Geoparquet [1]) and WASM for in-browser Arrow processing. Things like Lonboard [2] and Observable framework [3] rely on Parquet, Arrow and Duckdb files for their powerful data analytics and visualisation.
So it’s a toolkit written in Rust. It is not a file format.
That said, the immediate next line in the README perhaps clarifies a bit?
"Vortex is designed to be to columnar file formats what Apache DataFusion is to query engines (or, analogously, what LLVM + Clang are to compilers): a highly extensible & extremely fast framework for building a modern columnar file format, with a state-of-the-art, "batteries included" reference implementation."
It’s a framework for building file formats. This does not indicate that Vortex is, itself, a file format.
Perhaps we should clean up the wording in the intro, but yes there is in fact a file format!
We actually built the toolkit first, before building the file format. The interesting thing here is that we have a consistent in-memory and on-disk representation of compressed, typed arrays.
This is nice for a couple of reasons:
(a) It makes it really easy to test out new compression algorithms and compute functions. We just implement a new codec and it's automatically available for the file format.
(b) We spend a lot of energy on efficient push down. Many compute functions such as slicing and cloning are zero-cost, and all compute operations can execute directly over compressed data.
Highly encourage you to checkout the vortex-serde crate in the repo for file format things, and the vortex-datafusion crate for some examples of integrating the format into a query engine!
I’m working on the Python API now. I think we probably want the user to specify, on write, whether they want row groups or not and then we can enforce that as we write.
[1] https://blog.spiraldb.com/life-in-the-fastlanes [2] https://blog.spiraldb.com/compressing-strings-with-fsst/
In Vortex, we've specifically invested in high throughput compression techniques that admit O(1) random access. These kinds of techniques are also sometimes called "lightweight compression". The DuckDB folks have a good writeup [1] on the common ones.
[1] https://duckdb.org/2022/10/28/lightweight-compression.html
https://blog.acolyer.org/2018/09/26/the-design-and-implement...
The compression and decompression throughputs of Vortex (and other lightweight compression schemes) are similar or better than Parquet for many common datasets. Unlike Zstd or Blosc, the lightweight encodings are, generally, both computationally simple and SIMD friendly. We’re seeing multiple gibibytes per second on an M2 MacBook Pro on various datasets in the PBI benchmark [1].
The key insight is that most data we all work with has common patterns that don’t require sophisticated, heavyweight compression algorithm. Let’s take advantage of that fact to free up more cycles for compute kernels!
That’s the main thing currently irritating me about parquet
It's then generally up to a higher-level component called a table format to handle the idea of edits. See for example how Apache Iceberg handles deletes https://iceberg.apache.org/spec/#row-level-deletes
Due to the way S3 and the ilk are structured as globally replicated KV stores, you're not likely to get in-place edits anytime soon, and until the cost structure incentivizes otherwise you're going to continue to see data systems that preference immutable cloud storage.
Of course if your arrow file is in some object store how you delete random bytes over that is unclear.