Re: Loopup objects in distributed cache

2013-04-04 Thread vivek thakre
Thanks Jan for your reply. This is helpful Vivek On Thu, Apr 4, 2013 at 12:11 AM, Jan Dolinár wrote: > Hello Vivek, > > GenericUDTF has method initialize() which is only called once per task. So > if you read your files in this method and store the structures in memory > then the overhead is r

Re: Correct syntax for EXPLAIN DEPENDENCY

2013-04-04 Thread Sanjay Subramanian
Thanks I did upgrade but got stumped with this so reverted back https://issues.cloudera.org/browse/DISTRO-461 Regards sanjay On 4/4/13 7:37 PM, "Jarek Jarcec Cecho" wrote: >Hi Sanjay, >you can upgrade to CDH4.2.0 that contains Hive 0.10. > >Jarcec > >On Fri, Apr 05, 2013 at 01:48:39AM +, Sa

Re: Correct syntax for EXPLAIN DEPENDENCY

2013-04-04 Thread Jarek Jarcec Cecho
Hi Sanjay, you can upgrade to CDH4.2.0 that contains Hive 0.10. Jarcec On Fri, Apr 05, 2013 at 01:48:39AM +, Sanjay Subramanian wrote: > Ah its available only in 0.10.0 :-( > And I am still using 0.9.x from the CDH4.1.2 distribution > > > From: Sanjay Subramanian > mailto:sanjay.subraman..

Re: Correct syntax for EXPLAIN DEPENDENCY

2013-04-04 Thread Sanjay Subramanian
Ah its available only in 0.10.0 :-( And I am still using 0.9.x from the CDH4.1.2 distribution From: Sanjay Subramanian mailto:sanjay.subraman...@wizecommerce.com>> Reply-To: "user@hive.apache.org" mailto:user@hive.apache.org>> Date: Thursday, April 4, 2013 6:40 PM T

Correct syntax for EXPLAIN DEPENDENCY

2013-04-04 Thread Sanjay Subramanian
Hi Whats the correct syntax for EXPLAIN DEPENDENCY ? Query == /usr/lib/hive/bin/hive -e "explain dependency select * from channel_market_lang where channelid > 29000" org.apache.hadoop.hive.ql.parse.ParseException: line 1:8 cannot recognize input near 'plan' 'dependency' 'select' in stateme

Re: Partition performance

2013-04-04 Thread Dean Wampler
Also, how big are the files in each directory? Are they roughly the size of one HDFS block or a multiple. Lots of small files will mean lots of mapper tasks will little to do. You can also compare the job tracker console output for each job. I bet the slow one has a lot of very short map and reduc

Re: Partition performance

2013-04-04 Thread Owen O'Malley
See slide #9 from my Optimizing Hive Queries talk http://www.slideshare.net/oom65/optimize-hivequeriespptx . Certainly, we will improve it, but for now you are much better off with 1,000 partitions than 10,000. -- Owen On Thu, Apr 4, 2013 at 4:21 PM, Ramki Palle wrote: > Is it possible for you

Re: Partition performance

2013-04-04 Thread Ramki Palle
Is it possible for you to send the explain plan of these two queries? Regards, Ramki. On Thu, Apr 4, 2013 at 4:06 PM, Sanjay Subramanian < sanjay.subraman...@wizecommerce.com> wrote: > The slow down is most possibly due to large number of partitions. > I believe the Hive book authors tell us t

Re: Partition performance

2013-04-04 Thread Sanjay Subramanian
The slow down is most possibly due to large number of partitions. I believe the Hive book authors tell us to be cautious with large number of partitions :-) and I abide by that. Users Please add your points of view and experiences Thanks sanjay From: Ian mailto:liu...@yahoo.com>> Reply-To: "us

Partition performance

2013-04-04 Thread Ian
Hi,   I created 3 years of hourly log files (totally 26280 files), and use External Table with partition to query. I tried two partition methods.   1). Log files are stored as /test1/2013/04/02/16/00_0 (A directory per hour). Use date and hour as partition keys. Add 3 years of directories to

builtins submodule - is it still needed?

2013-04-04 Thread Travis Crawford
Hey hive gurus - Is the "builtins" hive submodule in use? The submodule was added in HIVE-2523 as a location for builtin-UDFs, but it appears to not have taken off. Any objections to removing it? DETAILS For HIVE-4278 I'm making some build changes for the HCatalog integration. The "builtins" sub

Re: Huge join performance issue

2013-04-04 Thread Nitin Pawar
you dont really need subqueries to join the tables which have common columns. Its an additional overhead best way to filter your data and speed up your data processing is how you layout your data When you have larger table I will use partitioning and bucketing to trim down the data and improve the

Huge join performance issue

2013-04-04 Thread Gabi D
Hi all, I have two tables I need to join and then summarize. They are both huge (about 1B rows each, in the relevant partitions) and the query runs for over 2 hours creating 5T intermediate data. The current query looks like this: select t1.b,t1.c,t2.d,t2.e, count(*) from (select a,b,cfrom ta

Re: Loopup objects in distributed cache

2013-04-04 Thread Jan Dolinár
Hello Vivek, GenericUDTF has method initialize() which is only called once per task. So if you read your files in this method and store the structures in memory then the overhead is relatively small (reading 15MB per mapper is negligible compared to several GB of processed data). Best regards, Ja