Re: Want to improve the performance for execution of Hive Jobs.

2012-05-08 Thread Bejoy Ks
Hi Bhavesh      For the two properties you mentioned, mapred.map.tasks Number of map tasks is determined from input split and input format. mapred.reduce.tasks Your hive job may not require a reduce task, hence hive sets number of reducers to zero Other parameters, I'm not sure why it is not e

Re: Data are not displayed correctly on hive tables

2012-05-08 Thread Mark Grover
Hi Roshan, You are right. '\n' in your XML content is going to give you problems. The table you created in Hive assumes one record = 1 '\n' terminated row from your file. I would recommend sanitizing you data before you load it in to get rid of '\n's. Mark Mark Grover, Business Intelligence A

Re: Data are not displayed correctly on hive tables

2012-05-08 Thread mperformer
Hi Mark Thanks for the reply. In the HDFS, the row looks like: *1~Order Conf Req~ 0 0 0 2006-06-01T17:59:09.413+10:00 0 Order Conf Prod Qty Amt Dlvry Date Price Prod Qty Amt Dlvry Date Price

Hive error when running a count query.

2012-05-08 Thread Kayla Schultz
The query I'm trying to run is: select count(*) from customers; The table exists and there is data in it. However when I run this command I get the following: http://imgur.com/sOXXB and the error log shows: http://imgur.com/5EWrS Any idea what I'm doing wrong? Sorry about the logs being pictures,

Re: Suggestions on implementing connect by in hive

2012-05-08 Thread Edward Capriolo
Iterative and recursive problems not well suited for map reduce because tasks do not share state or coordinate with each other. Most of the syntax shown here http://psoug.org/reference/connectby.html does not look like a good fit for a hadoop or hive problem. On Tue, May 8, 2012 at 2:14 PM, Thulas

Suggestions on implementing connect by in hive

2012-05-08 Thread Thulasi Ram Naidu Peddineni
Hi, Is there a way to implement connect by (similar to the one in Oracle) in hive ? (Join works if you know the number of levels of but my use-case is such that I don't know number of levels.) - Regards, Thulasi Ram P

Re: Storage requirements for intermediate (map-side-output) data during Hive joins

2012-05-08 Thread Mark Grover
Hi Safdar and Bejoy, I decided to give my 2 cents to this dialogue:-) All points that Bejoy made are valid. However, I don't see why multiplication of data sizes is involved. You have two tables A and B, 500 GB each. They occupy 1000GB in HDFS (that's not entirely true, since your dfs replicatio

Re: Want to improve the performance for execution of Hive Jobs.

2012-05-08 Thread Bhavesh Shah
Thanks Bejoy for your reply. Yes I saw that for ewvery job new XML is created. In that I saw that whatever variable I set is different from that. Example I have set mapred.map.tasks=10 and mapred.reduce.tasks=2 and In for all job XML it is showing value for map is 1 and for reduce is 0. Same thing

Re: Want to improve the performance for execution of Hive Jobs.

2012-05-08 Thread Bejoy KS
Hi Bhavesh On a job level, if you set/override some properties it won't go into mapred-site.xml. Check your corresponding Job.xml to get the values. Also confirm from task logs that there is no warnings with respect to overriding those properties. If these two are good then you can confirm

Re: Want to improve the performance for execution of Hive Jobs.

2012-05-08 Thread Bhavesh Shah
Hello Bejoy KS, I did in the same way by executing "hive -f " on Amazon EMR. and when I observed the mapred-site.xml, all variables that I have set in above file are set by default with their values. I didn't see my set values. And the performance is slow too. I have tried this on my local cluste

Re: Want to improve the performance for execution of Hive Jobs.

2012-05-08 Thread Bejoy Ks
Hi Bhavesh       I'm not sure of AWS, but from a quick reading cluster wide settings like hdfs block size can be set on hdfs-site.xml through bootstrap actions. Since you are changing hdfs block size set min and max split size across the cluster using bootstrap actions itself. The rest of the p

Re: Want to improve the performance for execution of Hive Jobs.

2012-05-08 Thread Bhavesh Shah
Thanks Bejoy KS for your reply, I want to ask one thing that If I want to set this parameter on Amazon Elastic Mapreduce then how can I set these variable like: e.g. SET mapred.min.split.size=m; SET mapred.max.split.size=m+n; set dfs.block.size=128 set mapred.compress.map.output=t

Re: Want to improve the performance for execution of Hive Jobs.

2012-05-08 Thread Bejoy Ks
Hi Bhavesh      In sqoop you can optimize the performance by using --direct mode for import and increasing the number of mappers used for import. When you increase the number of mappers you need to ensure that the RDBMS connection pool will handle those number of connections gracefully. Also us

Re: Storage requirements for intermediate (map-side-output) data during Hive joins

2012-05-08 Thread Bejoy Ks
Hi Ali      Sorry my short response may have got you confused. Let us assume you are doing a LeftOuterJoin on two tables 'A' and 'B' on a column 'id' (table are large so that only reduce side joins are only possible) then from my understanding this is how it should happen (explanation based on