Re: Why does ORC use Deflater instead of native ZlibCompressor?

2016-06-23 Thread Owen O'Malley
For compression, I'm also interested in investigating the pure java compression codecs that were done by the Presto project: https://github.com/airlift/aircompressor They've implemented LZ4, Snappy, and LZO in pure java. On Thu, Jun 23, 2016 at 8:04 PM, Gopal Vijayaraghavan wrote: > > Though,

Re: Why does ORC use Deflater instead of native ZlibCompressor?

2016-06-23 Thread Gopal Vijayaraghavan
> Though, I'm also wondering about about performance difference between >the two. Since they both use native implementations, theoretically they >can be close in performance. ZlibCompressor block compression was extremely slow due to the non-JNI bits in Hadoop -

Re: Why does ORC use Deflater instead of native ZlibCompressor?

2016-06-23 Thread Aleksei Statkevich
It might be a good idea. Though, I'm also wondering about about performance difference between the two. Since they both use native implementations, theoretically they can be close in performance. Are there any benchmarks for them? *Aleksei Statkevich *| Engineering Manager

Re: Why does ORC use Deflater instead of native ZlibCompressor?

2016-06-23 Thread Owen O'Malley
Actually, that should work. I'm a little concerned about the memory copy that the Hadoop ZlibCompressor does, but it should be a win. If you want to work on it, why don't you create a jira on the orc project? Don't forget that you'll need to handle the other options in CompressionCodec.modify. ..

Re: Hive/Tez ORC tables -- rawDataSize value

2016-06-23 Thread Lalitha MV
Thanks for the responses Prasanth and Mich. They were helpful. @Mich: Output of desc formatted: 1. Textfile: Table Parameters: COLUMN_STATS_ACCURATE {\"BASIC_STATS\":\"true\"} numFiles1 numRows 100 rawDataSize 471

Re: Hive/Tez ORC tables -- rawDataSize value

2016-06-23 Thread Prasanth Jayachandran
Please find answers inline. On Jun 23, 2016, at 3:49 PM, Lalitha MV mailto:lalitham...@gmail.com>> wrote: Hi, I am using Hadoop 2.7.2, Tez 0.8.3 and Hive 2.0.1. I created a hive table with text file size = ~141 Mb. show tblproperties of this table (textfile): numFiles1 numRows 100

Re: Hive/Tez ORC tables -- rawDataSize value

2016-06-23 Thread Mich Talebzadeh
Hi, Can you please send the output of DESC FORMATTED after running (if you have not so already) ANALYZE TABLE COMPUTE STATISTICS FOR COLUMN For both tables? HTH, Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Hive/Tez ORC tables -- rawDataSize value

2016-06-23 Thread Lalitha MV
Hi, I am using Hadoop 2.7.2, Tez 0.8.3 and Hive 2.0.1. I created a hive table with text file size = ~141 Mb. show tblproperties of this table (textfile): numFiles1 numRows 100 rawDataSize 141869803 totalSize 142869803 I then created a hive table, with orc compression from t

RE: Optimize Hive Query

2016-06-23 Thread Markovitz, Dudu
Thanks, I wanted to rule out skewedness over m_d_key,sb_gu_key Dudu From: @Sanjiv Singh [mailto:sanjiv.is...@gmail.com] Sent: Thursday, June 23, 2016 11:55 PM To: user@hive.apache.org; Markovitz, Dudu ; sanjiv singh (ME) Subject: Re: Optimize Hive Query Hi Dudu, find below query response. Qu

Hive on Spark issues with Hive-XML-Serde

2016-06-23 Thread yeshwanth kumar
Hi we are using Cloudera 5.7.0 there's a use case to process XML data, we are using the https://github.com/dvasilen/Hive-XML-SerDe XML serde is working with Hive execution engine as Map-Reduce, we enabled Hive on Spark to test the performance, and we are facing following issue 16/06/23 12:4

Re: Why does ORC use Deflater instead of native ZlibCompressor?

2016-06-23 Thread Aleksei Statkevich
Hi Owen, Thanks for the response. I saw that DirectDecompressor will be used if available and the difference was only in compression. Keeping in mind what you said, I looked at the code again. I see that the only specific piece that ORC uses is "nowrap" = true in Deflater. As far as I understand f

Re: Optimize Hive Query

2016-06-23 Thread @Sanjiv Singh
Thanks Mich. for your inputs. Let me try that as well. Will post response.

Aggregated table larger than expected

2016-06-23 Thread Matt Olson
Hi, I am working with an hourly table and a daily table in Hive 1.0.1. Both tables have the same schema except that the hourly table is partitioned by dt and hour, but the daily table is partitioned only by dt. At the end of each day, the records from the hourly table are aggregated into the daily

Re: Why does ORC use Deflater instead of native ZlibCompressor?

2016-06-23 Thread Owen O'Malley
On Fri, Jun 17, 2016 at 11:31 PM, Aleksei Statkevich < astatkev...@rocketfuel.com> wrote: > Hello, > > I recently looked at ORC encoding and noticed > that hive.ql.io.orc.ZlibCodec uses java's java.util.zip.Deflater and not > Hadoop's native ZlibCompressor. > > Can someone please tell me what is t

Re: Optimize Hive Query

2016-06-23 Thread @Sanjiv Singh
Hi Dudu, find below query response. Query : > select m_d_key,sb_gu_key ,count (*) as cnt > fromtuning_dd_key > group bym_d_key,sb_gu_key > order bycnt desc > limit 100; Output : 16 9042668 1361 > 16 8063808 1361 > 16 8569864 1361 > 16 8909889 1361 > 16 9864785 1

Re: Optimize Hive Query

2016-06-23 Thread Mich Talebzadeh
Funny enough it is pretty close to similar ORC transactional tables I have. Standard with 256 buckets with two columns as below number of distinct value in column m_d_key : 29 > number of distinct value in column sb_gu_key : 15434343 You have also vectorised data taking 1024 rows at once. Still

Re: Optimize Hive Query

2016-06-23 Thread @Sanjiv Singh
Hi Mich , Please find below output of command. desc formatted tuning_dd_key ; +---+---+---+--+ | col_name| data_type

RE: Optimized Hive query

2016-06-23 Thread Markovitz, Dudu
Any progress on this one? Dudu From: Aviral Agarwal [mailto:aviral12...@gmail.com] Sent: Wednesday, June 15, 2016 1:04 PM To: user@hive.apache.org Subject: Re: Optimized Hive query I ok to digging down to the AST Builder class. Can you guys point me to the right class ? Meanwhile "explain (rew

RE: RegexSerDe with Filters

2016-06-23 Thread Markovitz, Dudu
My pleasure. Please feel free to reach me if needed. Dudu From: Arun Patel [mailto:arunp.bigd...@gmail.com] Sent: Wednesday, June 22, 2016 2:57 AM To: user@hive.apache.org Subject: Re: RegexSerDe with Filters Thank you very much, Dudu. This really helps. On Tue, Jun 21, 2016 at 7:48 PM, Markov

Re: Why does ORC use Deflater instead of native ZlibCompressor?

2016-06-23 Thread Aleksei Statkevich
Does anyone know? *Aleksei Statkevich *| Engineering Manager

Re: Optimize Hive Query

2016-06-23 Thread Jörn Franke
The query looks a little bit too complex from what it is supposed to do. Can you reformulate and restrict the data in a where clause (highest restriction first). Another hint would be to use the Orc format (with indexes and optionally bloom filters) with snappy compression as well as sorting the

Re: Optimize Hive Query

2016-06-23 Thread Mich Talebzadeh
Do you also have the output from desc formatted tuning_dd_key and send the output please? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * htt

Re: Optimize Hive Query

2016-06-23 Thread @Sanjiv Singh
Hi Gopal, I am using Tez as execution engine. DAG : ++--+ | Explain | +-+--+ | Pla

RE: Optimize Hive Query

2016-06-23 Thread Markovitz, Dudu
Could you also add the results of the following query? Thanks Dudu select m_d_key ,sb_gu_key ,count (*) as cnt fromtuning_dd_key group bym_d_key ,sb_gu_key order bycnt desc limit 100 ; -Original Message-