Re: Format dillema

2017-06-23 Thread Gopal Vijayaraghavan
> I guess I see different things. Having used all the tech. In particular for > large hive queries I see OOM simply SCANNING THE INPUT of a data directory, > after 20 seconds! If you've got an LLAP deployment you're not happy with - this list is the right place to air your grievances. I usual

Re: Format dillema

2017-06-23 Thread Edward Capriolo
"You're off by a couple of orders of magnitude - in fact, that was my last year's Hadoop Summit demo, 10 terabytes of Text on S3, converted to ORC + LLAP." "We've got sub-second SQL execution, sub-second compiles, sub-second submissions … with all of it adding up to a single or double digit second

Re: Format dillema

2017-06-23 Thread Gopal Vijayaraghavan
> It is not that simple. The average Hadoop user has years 6-7 of data. They do > not have a "magic" convert everything button. They also have legacy processes > that don't/can't be converted. … > They do not want the "fastest format" they want "the fastest hive for their > data". I've yet to

Re: Format dillema

2017-06-23 Thread Edward Capriolo
"Yes, it's a tautology - if you cared about performance, you'd use ORC, because ORC is the fastest format." It is not that simple. The average Hadoop user has years 6-7 of data. They do not have a "magic" convert everything button. They also have legacy processes that don't/can't be converted. The

Re: Format dillema

2017-06-22 Thread Gopal Vijayaraghavan
> I kept hearing about vectorization, but later found out it was going to work > if i used ORC. Yes, it's a tautology - if you cared about performance, you'd use ORC, because ORC is the fastest format. And doing performance work to support folks who don't quite care about it, is not exactly

Re: Format dillema

2017-06-20 Thread Edward Capriolo
"Hive 3.x branch has text vectorization and LLAP cache support for it, so hopefully the only relevant concern about Text will be the storage costs due to poor compression (& the lack of updates)." I kept hearing about vectorization, but later found out it was going to work if i used ORC. Litterall

Re: Format dillema

2017-06-20 Thread Gopal Vijayaraghavan
> 1) both do the same thing.  The start of this thread is the exact opposite - trying to suggest ORC is better for storage & wanting to use it. > As it relates the columnar formats, it is silly arms race. I'm not sure "silly" is the operative word - we've lost a lot of fragmentation of the c

Re: Format dillema

2017-06-20 Thread Edward Capriolo
"Hive and LLAP do support Parquet precisely because the developers want to be able to process everyone's data." Yes. But there are a number of optimizations on the Hive ORC side that we know are not implemented on the Parquet support. Which is why I made my statement. Impala( Parq=yes, orc=no) Hiv

Re: Format dillema

2017-06-20 Thread Owen O'Malley
On Tue, Jun 20, 2017 at 10:12 AM, Edward Capriolo wrote: > It is whack that two optimized row columnar formats exists and each > respective project (hive/impala) has good support for one and lame/no > support for the other. > We have two similar formats because they were designed at roughly the

Re: Format dillema

2017-06-20 Thread Edward Capriolo
It is whack that two optimized row columnar formats exists and each respective project (hive/impala) has good support for one and lame/no support for the other. Impala is now an Apache project. Also 'whack' and 'lame' are technical terms often used by the people in the real world that have to use

Re: Format dillema

2017-06-20 Thread Owen O'Malley
You should also try LLAP. With ORC or text, it will cache the hot columns and partitions in memory. I can't seem to find the slides yet, but the Comcast team had good results with LLAP: https://dataworkssummit.com/san-jose-2017/sessions/hadoop-query-performance-smackdown/ https://twitter.com/thej

Re: Format dillema

2017-06-20 Thread Furcy Pin
Another option would be to try Facebook's Presto https://prestodb.io/ Like Impala, Presto is designed for fast interactive querying over Hive tables, but it is also capable of querying data from many other SQL sources (mySQL, postgreSQL, Kafka, Cassandra, ... https://prestodb.io/docs/current/conne

Re: Format dillema

2017-06-19 Thread Sruthi Kumar Annamneedu
Try using Parquet with Snappy compression and Impala will work with this combination. On Sun, Jun 18, 2017 at 3:35 AM, rakesh sharma wrote: > We are facing an issue of format. We would like to do bi style queries > from hive using impala and that supports parquet but we would like the data > to