Re: Format dillema

Gopal Vijayaraghavan Tue, 20 Jun 2017 14:05:51 -0700

> 1) both do the same thing. 

The start of this thread is the exact opposite - trying to suggest ORC is 
better for storage & wanting to use it.


> As it relates the columnar formats, it is silly arms race. 

I'm not sure "silly" is the operative word - we've lost a lot of fragmentation 
of the community and are down to 2 good choices, neither of them wrong.

Impala's original format was Trevni, which lives on in Avro docs. And there was 
RCFile - a sequence file format, which stored columnar data in a <K,V> pair. 
And then there was LazySimple SequenceFile, LazyBinary SequenceFile, Avro and 
Text with many SerDes.

Purely speculatively, we're headed into more fragmentation again, with people 
rediscovering that they need updates.

Uber's Hoodie is the Parquet fork, but for Spark, not Impala. While ORC ACID is 
getting much easier to update with MERGE statements and a deadlock aware txn 
manager.

> Parquet had C/C++ right off the bat of course because impala has to work in 
> C/C++.

I think that is the primary reason why the Java Parquet readers are still way 
behind in performance.

Nobody sane wants to work on performance tuning a data reader library in Java, 
when it is so much easier to do it in C++.

Doing C++ after tuning the format for optimal performance in Java8 makes a lot 
of sense, in hindsight. The marshmallow test is easier if you can't have a 
marshmallow now.

> 1) uses text file anyway because it is the ONLY format all tools support

I see this often, folks who just throw in plain text into S3 and querying it.

Hive 3.x branch has text vectorization and LLAP cache support for it, so 
hopefully the only relevant concern about Text will be the storage costs due to 
poor compression (& the lack of updates).

Cheers,
Gopal

Re: Format dillema

Reply via email to