RE: When/how to use partitions and buckets usefully?

2012-04-23 Thread Ruben de Vries
Wow thanks everyone for the nice feedback! I can force a mapside join by doing /*+ STREAMTABLE(members_map) */ right? Cheers, Ruben de Vries -Original Message- From: Mark Grover [mailto:mgro...@oanda.com] Sent: Tuesday, April 24, 2012 3:17 AM To: user@hive.apache.org; bejoy ks Cc: Ru

Re: Doubts related to Amazon EMR

2012-04-23 Thread Bhavesh Shah
Thanks all for their answers. But I want to ask one more thing that: 1) I have written a program (my task) which contains Hive JDBC code and code(commands of SQOOP) for importing the tables and exporting too. If I create JAR of my program and put it on EMR, then should I need to do some extra t

Re: Doubts related to Amazon EMR

2012-04-23 Thread Kyle Mulka
Just wrote up an article on how to install Sqoop on Amazon EMR: http://blog.kylemulka.com/2012/04/how-to-install-sqoop-on-amazon-elastic-map-reduce-emr/ -- Kyle Mulka mu...@umich.edu 206 883 5352 http://www.kylemulka.com On Mon, Apr 23, 2012 at 10:55 AM, Kyle Mulka wrote: > It is possible to i

Re: Lifecycle and Configuration of a hive UDF

2012-04-23 Thread Mark Grover
Added a tiny blurb here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-UDFinternals Comments/suggestions welcome! Thanks for bringing it up, Justin. Mark Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com e: mg

Re: removing hdfs table data directory does not throw error in hive

2012-04-23 Thread Sukhendu Chakraborty
Thanks Nitin. I am aware of what Hive is doing. The question is, is it okay not return an error/warning when no data is found since the metadata for the table also contains the data location when you create the table (which creates the hdfs directory as well). So, if somebody erroneously removes t

Re: removing hdfs table data directory does not throw error in hive

2012-04-23 Thread Nitin Pawar
hive table meta data is stored into a meta data store which will retain the table structure and other meta info even if you delete hdfs table directory as its stored in metadata store db. When you do a select * from table; 1) hive checks for table exists in metadata store 2) if table is existing t

Re: Doubts related to Amazon EMR

2012-04-23 Thread Mark Grover
Hi Bhavesh, To answer your questions: 1) S3 terminology uses the word "object" and I am sure they have good reasons as to why but for us Hive'ers, an S3 object is the same as a file stored on S3. The complete path to the file would be what Amazon calls the S3 "key" and the corresponding value

Re: When/how to use partitions and buckets usefully?

2012-04-23 Thread Mark Grover
Hi Ruben, Like Bejoy pointed out, members_map is small enough to fit in memory, so your joins with visit_stats would be much faster with map-side join. However, there is still some virtue in bucketing visit_stats. Bucketing can optimize joins, group by's and potentially other queries in certain

Re: When/how to use partitions and buckets usefully?

2012-04-23 Thread Bejoy KS
If data is in hdfs, then you can bucket it only after loading into a temp/staging table and then to the final bucketed table. Bucketing needs a Map reduce job. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Ruben de Vries Date: Mon, 23 Apr 2012 18:

RE: When/how to use partitions and buckets usefully?

2012-04-23 Thread Ruben de Vries
Thanks for the help so far guys, I bucketed the members_map, it's 330mb in size (11 mil records). Can you manually bucket stuff? Since my initial mapreduce job is still outside of Hive I'm doing a LOAD DATA to import stuff into the visit_stats tables, replacing that with INSERT OVERWRITE SELECT

Re: When/how to use partitions and buckets usefully?

2012-04-23 Thread Edward Capriolo
There are many good reasons to use bucketing. I have these rules for when to dump partitioning in favor of bucketing: 1) too many partitions a day (500 + partitions a day, # of file issues) 2) an unpredictable number of partitions per day. You can weight these factors with other benefits buckets

Re: When/how to use partitions and buckets usefully?

2012-04-23 Thread Bejoy KS
For Bucketed map join, both tables should be bucketed and the number of buckets of one should be multiple of other. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: "Bejoy KS" Date: Mon, 23 Apr 2012 16:03:34 To: Reply-To: bejoy...@yahoo.com Subject:

Re: When/how to use partitions and buckets usefully?

2012-04-23 Thread Bejoy KS
Bucketed map join would be good I guess. What is the total size of the smaller table and what is its expected size in the next few years? The size should be good enough to be put in Distributed Cache, then map side joins would offer you much performance improvement. Regards Bejoy KS Sent from

RE: When/how to use partitions and buckets usefully?

2012-04-23 Thread Ruben de Vries
Ok, very clear on the partitions, try to make them match the WHERE clauses, not so much about group clauses then ;) The member_map contains 11.636.619 records atm, I think bucketing those would be good? What's a good number to bucket them by then? And is there any point in bucketing the visit_s

RE: When/how to use partitions and buckets usefully?

2012-04-23 Thread Tucker, Matt
If you're only interested in a certain window of dates for analysis, a date-based partition scheme will be helpful, as it will trim partitions that aren't needed by the query before execution. If the member_map table is small, you might consider testing the feasibility of map-side joins, as it

Re: When/how to use partitions and buckets usefully?

2012-04-23 Thread Bejoy KS
Partitions are good when you want to run your queries on a subset of whole data. So the partition column depends on your queries. But a good point to be taken care is that every partition have enough data. Partition gets into effect when you use filters with Where clause. Buckets are good for sa

When/how to use partitions and buckets usefully?

2012-04-23 Thread Ruben de Vries
It seems there's enough information to be found on how to setup and use partitions and buckets. But I'm more interested in how to figure out when and what columns you should be partitioning and bucketing to increase performance?! In my case I got 2 tables, 1 visit_stats (member_id, date and some

Re: Doubts related to Amazon EMR

2012-04-23 Thread Kyle Mulka
It is possible to install Sqoop on AWS EMR. I've got some scripts I can publish later. You are not required to use S3 to store files and can use the local (temporary) HDFS instead. After you have Sqoop installed, you can import your data with it into HDFS, run your calculations in HDFS, then exp

RE: [Marketing Mail] Doubts related to Amazon EMR

2012-04-23 Thread Ladda, Anand
Once you have a Hive Job flow running on Amazon EMR, you'll have access to the file system on the underlying EC2 machines (you'll get the machine name, etc once the cluster is running). You can then move your data files on the EC2 machine file system and load it into HDFS/Hive. I am not sure abo

Possible to use regex column specification with WHERE clause?

2012-04-23 Thread Ryabin, Thomas
Hi, I know that it is possible to use regex column specification with the SELECT clause like so: SELECT `employee.*` FROM employees; I was wondering if it is possible to use it with the WHERE clause also. For example I want to create the query: SELECT `employee.*` FROM employees WHERE $1

subquery + lateral view fails without count

2012-04-23 Thread Ruben de Vries
It's a bit of a weird case but I thought I might share it and hopefully find someone who can confirm this to be a bug or tell me I should do things differently! Here you can find a pastie with the full create and select queries: http://pastie.org/3838924 I've got two tables: `visit_stats` with

Doubts related to Amazon EMR

2012-04-23 Thread Bhavesh Shah
Hello all, I want to deploy my task on Amazon EMR. But as I am new to Amazon Web Services I am confused in understanding the concepts. My Use Case: I want to import the large data from EC2 through SQOOP into the Hive. Imported data in Hive will get processed in Hive by applying some algorithm and

Re: Lifecycle and Configuration of a hive UDF

2012-04-23 Thread Justin Coffey
Hello All, Thank you much for the responses. I can confirm that the lag function implementation works in my case: create temporary function lag as 'com.example.hive.udf.Lag'; select session_id,hit_datetime_gmt,lag(hit_datetime_gmt,session_id) from (select session_id,hit_datetime_gmt from omni2