Re: Format dillema

Edward Capriolo Fri, 23 Jun 2017 08:15:44 -0700

"Yes, it's a tautology - if you cared about performance, you'd use ORC,
because ORC is the fastest format."

It is not that simple. The average Hadoop user has years 6-7 of data. They
do not have a "magic" convert everything button. They also have legacy
processes that don't/can't be converted. They do not want the "fastest
format" they want "the fastest hive for their data". They get data dumps
from potentially non sophisticated partners maybe using S3 and csv and,
cause maybe their partner uses vertica or redshift. I think you understand
this.

Suppose you have 100 GB text data in an S3 bucket, and say queying it takes
lets just say "50 seconds for a group by type query".

It takes a "70 second CTAS query" and maybe 40GB more storage to create a
second copy in ORC.  Now that second copy..Maybe I can do the same group by
in 30 seconds. But in reality, you are
1) io bound
2) have 10 seconds of startup time anyway.
3) now have two copies of data 2x metastore 2 to cleanup

So its great that ORC is great but the reality is I can not make my
webserver spit out a log in ORC format :)

On Thu, Jun 22, 2017 at 7:30 PM, Gopal Vijayaraghavan <gop...@apache.org>
wrote:

>
> > I kept hearing about vectorization, but later found out it was going to
> work if i used ORC.
>
> Yes, it's a tautology - if you cared about performance, you'd use ORC,
> because ORC is the fastest format.
>
> And doing performance work to support folks who don't quite care about it,
> is not exactly "see a need, fill a need".
>
> > Litterally years have come and gone and we are talking like 3.x is going
> to vectorize text.
>
> Literally years have gone by since the feature came into Hive. Though it
> might have crept up on you - if Vectorization had been enabled by default,
> it would've been immediately obvious.
>
> HIVE-9937 is so old, that I'd say the first line towards Text
> vectorization came in in Q1 2015.
>
> In the current master, you can get a huge boost out of it - if you want
> you can run BI over 100Tb of text.
>
> https://www.slideshare.net/Hadoop_Summit/llap-building-cloudfirst-bi/27
>
> > … where some not negligible part of the features ONLY work with ORC.
>
> You've got it backwards - ORC was designed to support those features.
>
> Parquet could be following ORC closely, but at least the Java
> implementation hasn't.
>
> Cheers,
> Gopal
>
>
>
>
>

Re: Format dillema

Reply via email to