Re: Storage requirements for intermediate (map-side-output) data during Hive joins

2012-05-07 Thread Ali Safdar Kureishy
Hi Bejoy, ThanksI see...I was asking because I wanted to know how much total storage space I would need on the cluster for the given data in the tables. Are you saying that for 2 tables of 500 Gb each (spread across the cluster), there would be a need for intermediate storage of 25 GB? Or

Re: Want to improve the performance for execution of Hive Jobs.

2012-05-07 Thread Bhavesh Shah
Thanks Both of you for their replies, If I decide to deploy my JAR on Amazon Elastic Mapreduce then, 1) Default block size is 64 MB, so insuch case I have to set it to 128 MB. is it right??? 2) Amazon EMR has already values for mapred.min.split.size and mapred.max.split.size, and mapper and r

Re: Want to improve the performance for execution of Hive Jobs.

2012-05-07 Thread Mapred Learn
Try setting this value to your block Size, for 128 mb block size, > set mapred.min.split.size=128000 Sent from my iPhone On May 7, 2012, at 10:11 PM, Bhavesh Shah wrote: > Thanks Nitin for your reply. > > In short my Task is > 1) Initially I want to import the data from MS SQL Server into HD

Re: Want to improve the performance for execution of Hive Jobs.

2012-05-07 Thread Nitin Pawar
I am no expert on sqoop so i may be wrong but importing 30*0.5M records (table by table) is a huge operation. I would rather prefer just dump and import using hive cli (sqoop is good choice too but i dont know the benchmarks) if you are doing so many joins then its good to be on hadoop cluster ins

Re: Want to improve the performance for execution of Hive Jobs.

2012-05-07 Thread Bhavesh Shah
Thanks Nitin for your reply. In short my Task is 1) Initially I want to import the data from MS SQL Server into HDFS using SQOOP. 2) Through Hive I am processing the data and generating the result in one table 3) That result containing table from Hive is again exported to MS SQL SERVER back. Actu

Re: Want to improve the performance for execution of Hive Jobs.

2012-05-07 Thread Nitin Pawar
1) check the jobtracker url to see how many maps/reducers have been launched 2) if you have a large dataset and wants to execute it fast, you set mapred.min.split.size and mapred.max.split.size to an optimal value so that more mappers will be launched and will finish 3) if you are doing joins, ther

Want to improve the performance for execution of Hive Jobs.

2012-05-07 Thread Bhavesh Shah
Hello all, I have written a Hive JDBC code and created a JAR of it. I am running that JAR on 10 cluster. But the problem as I am using the 10 cluster still the performance is same as that on single cluster. What to do to improve the performance of Hive Jobs? Is there anything configuration setting

FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.MapRedTask

2012-05-07 Thread Mark Grover
Hi all, I wanted to see if anyone has seen this error before: Query returned non-zero code: 9, cause: FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.MapRedTask; nested exception is java.sql.SQLException: Query returned non-zero code: 9, cause: FAILED: Execution Err

Re: Exception while running simple hive query

2012-05-07 Thread kulkarni.swar...@gmail.com
Thanks Shashwat. That did work. However I do find this behavior very weird that it is able to find all other libs at their proper location on local filesystem but searches for this particular one on HDFS. I'll try to dig deeper into the code to see if I can find a cause for this happening. On Mon

Re: Exception while running simple hive query

2012-05-07 Thread shashwat shriparv
Do one thing create the same structure /Users/testuser/hive-0.9.0/ lib/hive-builtins-0.9.0.jar on the hadoop file system and den try.. will work Shashwat Shriparv On Mon, May 7, 2012 at 11:57 PM, kulkarni.swar...@gmail.com < kulkarni.swar...@gmail.com> wrote: > Thanks for the reply. > > Assumi

Re: Exception while running simple hive query

2012-05-07 Thread kulkarni.swar...@gmail.com
Thanks for the reply. Assuming that you mean for permissions within the HIVE_HOME, they all look ok to me. Is there anywhere else too you want me to check? On Mon, May 7, 2012 at 11:16 AM, hadoop hive wrote: > check for the permission.. > > > On Mon, May 7, 2012 at 7:30 PM, kulkarni.swar...@gma

Re: Exception while running simple hive query

2012-05-07 Thread hadoop hive
check for the permission.. On Mon, May 7, 2012 at 7:30 PM, kulkarni.swar...@gmail.com < kulkarni.swar...@gmail.com> wrote: > I created a very simple hive table and then ran the following query that > should run a M/R job to return the results. > > hive> SELECT COUNT(*) FROM invites; > > But I am

Re: Data are not displayed correctly on hive tables

2012-05-07 Thread Mark Grover
Hi Roshan, The following snippet summarizes the delimiters for your Hive table: colelction.delim\u0002 field.delim \u0001 mapkey.delim\u0003 serialization.format\u0001 Your fields are

Re: Storage requirements for intermediate (map-side-output) data during Hive joins

2012-05-07 Thread Bejoy Ks
Hi Safdar      Map side join uses memory on the hive client to form hash tables. They don't come into key value juggling part as there is no reduce phase involved for such jobs. Regards Bejoy KS From: Ali Safdar Kureishy To: user@hive.apache.org Sent: Monday

Re: Storage requirements for intermediate (map-side-output) data during Hive joins

2012-05-07 Thread Bejoy Ks
Hi Ali       The 500*500 Gigs of data is actually processed by multiple tasks across multiple nodes. In default settings a task will process 64Mb of data per task. So you don't need 25 GB temp space in a node at all . A few gigs of free space is more than enough for any MR task . Regards

Re: Storage requirements for intermediate (map-side-output) data during Hive joins

2012-05-07 Thread Ali Safdar Kureishy
Please ignore my question below. I made a mistake with my calculation. The map-side joins do not perform a cross-product of the data. They just emit the data using the join-key as the row key. Thanks, Safdar On Mon, May 7, 2012 at 12:31 AM, Ali Safdar Kureishy < safdar.kurei...@gmail.com> wrote:

Storage requirements for intermediate (map-side-output) data during Hive joins

2012-05-07 Thread Ali Safdar Kureishy
Hi, I'm setting up a Hadoop cluster and would like to understand how much disk space I should expect to need with joins. Let's assume that I have 2 tables, each of about 500 GB. Since the tables are large, these will all be reduce-side joins. As far as I know about such joins, the data generated