The paper[1] is helpful. Compression may also be helpful - but it may be
difficult to standardize this.
[1] https://vldb.org/pvldb/vol12/p2022-chattopadhyay.pdf
On 12/25/21 5:37 AM, Micah Kornfield wrote:
What exactly are you looking for? To my knowledge neither Capacitor nor
Artus have been described in enough detail external to Google to allow for
external benchmarking, so the details would probably only be relevant to
Google.
Both formats have more complicated encodings and embedded data-structures
making them closer to Parquet (which is loosely based on precursor to
capacitor) and ORC then Arrow. There are interesting ideas from the
Procella paper which covers Artus that might be worth thinking about in the
context of these formats (or a new one).
Arrow has not spent much focus on optimizing storage size.
Cheers,
Micah
On Wednesday, December 22, 2021, Benson Muite <benson_mu...@emailplus.org>
wrote:
On 12/23/21 7:14 AM, Hayden Livingston wrote:
Has anyone been able to benchmark the Artus file format vs Arrow?
It seems that the Artus file format is gaining traction inside Google,
replacing their current columnar format Capacitor.
Hayden,
Do you have a link to a specification or implementation of Artus?
Performance may also be related to disk type, network etc.