> I guess I see different things. Having used all the tech. In particular for
> large hive queries I see OOM simply SCANNING THE INPUT of a data directory,
> after 20 seconds!
If you've got an LLAP deployment you're not happy with - this list is the right
place to air your grievances. I usual
"You're off by a couple of orders of magnitude - in fact, that was my last
year's Hadoop Summit demo, 10 terabytes of Text on S3, converted to ORC +
LLAP."
"We've got sub-second SQL execution, sub-second compiles, sub-second
submissions … with all of it adding up to a single or double digit second
> It is not that simple. The average Hadoop user has years 6-7 of data. They do
> not have a "magic" convert everything button. They also have legacy processes
> that don't/can't be converted.
…
> They do not want the "fastest format" they want "the fastest hive for their
> data".
I've yet to
"Yes, it's a tautology - if you cared about performance, you'd use ORC,
because ORC is the fastest format."
It is not that simple. The average Hadoop user has years 6-7 of data. They
do not have a "magic" convert everything button. They also have legacy
processes that don't/can't be converted. The
> I kept hearing about vectorization, but later found out it was going to work
> if i used ORC.
Yes, it's a tautology - if you cared about performance, you'd use ORC, because
ORC is the fastest format.
And doing performance work to support folks who don't quite care about it, is
not exactly
"Hive 3.x branch has text vectorization and LLAP cache support for it, so
hopefully the only relevant concern about Text will be the storage costs
due to poor compression (& the lack of updates)."
I kept hearing about vectorization, but later found out it was going to
work if i used ORC. Litterall
> 1) both do the same thing.
The start of this thread is the exact opposite - trying to suggest ORC is
better for storage & wanting to use it.
> As it relates the columnar formats, it is silly arms race.
I'm not sure "silly" is the operative word - we've lost a lot of fragmentation
of the c
"Hive and LLAP do support Parquet precisely because the developers want to
be able to process everyone's data."
Yes. But there are a number of optimizations on the Hive ORC side that we
know are not implemented on the Parquet support. Which is why I made my
statement. Impala( Parq=yes, orc=no) Hiv
On Tue, Jun 20, 2017 at 10:12 AM, Edward Capriolo
wrote:
> It is whack that two optimized row columnar formats exists and each
> respective project (hive/impala) has good support for one and lame/no
> support for the other.
>
We have two similar formats because they were designed at roughly the
It is whack that two optimized row columnar formats exists and each
respective project (hive/impala) has good support for one and lame/no
support for the other.
Impala is now an Apache project. Also 'whack' and 'lame' are technical
terms often used by the people in the real world that have to use
You should also try LLAP. With ORC or text, it will cache the hot columns
and partitions in memory. I can't seem to find the slides yet, but the
Comcast team had good results with LLAP:
https://dataworkssummit.com/san-jose-2017/sessions/hadoop-query-performance-smackdown/
https://twitter.com/thej
Another option would be to try Facebook's Presto https://prestodb.io/
Like Impala, Presto is designed for fast interactive querying over Hive
tables, but it is also capable of querying data from many other SQL sources
(mySQL, postgreSQL, Kafka, Cassandra, ...
https://prestodb.io/docs/current/conne
Try using Parquet with Snappy compression and Impala will work with this
combination.
On Sun, Jun 18, 2017 at 3:35 AM, rakesh sharma
wrote:
> We are facing an issue of format. We would like to do bi style queries
> from hive using impala and that supports parquet but we would like the data
> to
13 matches
Mail list logo