Re: Maintaining big and complex Hive queries

2016-12-21 Thread Saumitra Shahapure
red here, although there is no definitive guidance >> as far as I know: >> >> https://cwiki.apache.org/confluence/display/Hive/Unit+Testin >> g+Hive+SQL#UnitTestingHiveSQL-Modularisation >> >> On 15 December 2016 at 17:08, Saumitra Shahapure < >> s

Maintaining big and complex Hive queries

2016-12-15 Thread Saumitra Shahapure
Hello, We are running and maintaining quite big and complex Hive SELECT query right now. It's basically a single SELECT query which performs JOIN of about ten other SELECT query outputs. A simplest way to refactor that we can think of is to break this query down into multiple views and then join

java.lang.ArrayIndexOutOfBoundsException in getSplitHosts

2016-04-25 Thread Saumitra Shahapure
Hello, I am using using Hive 0.13.1 in EMR and trying to create Hive table on top of our custom file system (which is a thin wrapper on top of S3) and I am getting error while accessing the data in the table. Stack trace and command history below. I had a doubt that CombineFileInputFormat is tryi

Re: Spark performance for small queries

2015-01-23 Thread Saumitra Shahapure (Vizury)
would generate quite similar execution plans for this query, what exactly is making difference. My question is from the point of understanding both the systems, Answering your questions inline, -- Regards, Saumitra Shahapure On Fri, Jan 23, 2015 at 5:01 AM, Gopal V wrote: > On 1/22/15, 3:03

Re: Spark performance for small queries

2015-01-22 Thread Saumitra Shahapure (Vizury)
Hello, We were comparing performance of some of our production hive queries between Hive and Spark. We compared Hive(0.13)+hadoop (1.2.1) against both Spark 0.9 and 1.1. We could see that the performance gains have been good in Spark. We tried a very simple query, select count(*) from T where col

Simplest way to create partition hierarchies

2014-10-05 Thread Saumitra Shahapure (Vizury)
ce job to create data hierarchies. In our case, the hierarchy is already created. -- Regards, Saumitra Shahapure

Re: HDFS file system size issue

2014-04-15 Thread Saumitra Shahapure
Hi Rahman, These are few lines from hadoop fsck / -blocks -files -locations /mnt/hadoop/hive/warehouse/user.db/table1/000255_0 44323326 bytes, 1 block(s): OK 0. blk_-7919979022650423857_446500 len=44323326 repl=3 [ip1:50010, ip2:50010, ip3:50010] /mnt/hadoop/hive/warehouse/user.db/table1/000256

Re: Handling hierarchical data in Hive

2014-03-25 Thread Saumitra Shahapure (Vizury)
creating partition on dt field and creating Hive index/view on *generated_by *field. If anyone has insights around these, they would be really helpful. Meanwhile we will try to solve our problem by buckets/indices. -- Regards, Saumitra Shahapure On Tue, Mar 25, 2014 at 7:44 PM, Prasan Samtani

Re: Handling hierarchical data in Hive

2014-03-25 Thread Saumitra Shahapure (Vizury)
over-partitioning our table. Over-partitioning is giving us benefit that query on 1-2 partitions is too fast. It's side-effect is that If we try to query large number of partitions, query is too slow. Is there a way to get good performance in both of the scenarios? -- Regards, Saumitra Shahapure

Handling hierarchical data in Hive

2014-03-25 Thread Saumitra Shahapure (Vizury)
1000s of partitions to Hive. So queries on analyze on one month are slowed down. Is there any way to get rid of partitions, and at the same time maintain good performance of queries which are fired on specific day and *generated_by*? -- Regards, Saumitra Shahapure