Hi Bhavesh
For the two properties you mentioned,
mapred.map.tasks
Number of map tasks is determined from input split and input format.
mapred.reduce.tasks
Your hive job may not require a reduce task, hence hive sets number of reducers
to zero
Other parameters, I'm not sure why it is not e
Hi Roshan,
You are right. '\n' in your XML content is going to give you problems. The
table you created in Hive assumes one record = 1 '\n' terminated row from your
file. I would recommend sanitizing you data before you load it in to get rid of
'\n's.
Mark
Mark Grover, Business Intelligence A
Hi Mark
Thanks for the reply.
In the HDFS, the row looks like:
*1~Order Conf Req~
0
0
0
2006-06-01T17:59:09.413+10:00
0
Order Conf
Prod
Qty Amt
Dlvry Date
Price
Prod
Qty Amt
Dlvry Date
Price
The query I'm trying to run is: select count(*) from customers;
The table exists and there is data in it. However when I run this command I
get the following: http://imgur.com/sOXXB and the error log shows:
http://imgur.com/5EWrS
Any idea what I'm doing wrong?
Sorry about the logs being pictures,
Iterative and recursive problems not well suited for map reduce
because tasks do not share state or coordinate with each other. Most
of the syntax shown here http://psoug.org/reference/connectby.html
does not look like a good fit for a hadoop or hive problem.
On Tue, May 8, 2012 at 2:14 PM, Thulas
Hi,
Is there a way to implement connect by (similar to the one in
Oracle) in hive ? (Join works if you know the number of levels of but
my use-case is such that I don't know number of levels.)
-
Regards,
Thulasi Ram P
Hi Safdar and Bejoy,
I decided to give my 2 cents to this dialogue:-)
All points that Bejoy made are valid. However, I don't see why multiplication
of data sizes is involved.
You have two tables A and B, 500 GB each. They occupy 1000GB in HDFS (that's
not entirely true, since your dfs replicatio
Thanks Bejoy for your reply.
Yes I saw that for ewvery job new XML is created. In that I saw that
whatever variable I set is different from that.
Example I have set mapred.map.tasks=10 and mapred.reduce.tasks=2
and In for all job XML it is showing value for map is 1 and for reduce is
0.
Same thing
Hi Bhavesh
On a job level, if you set/override some properties it won't go into
mapred-site.xml. Check your corresponding Job.xml to get the values. Also
confirm from task logs that there is no warnings with respect to overriding
those properties. If these two are good then you can confirm
Hello Bejoy KS,
I did in the same way by executing "hive -f " on Amazon EMR.
and when I observed the mapred-site.xml, all variables that I have set in
above file are set by default with their values. I didn't see my set values.
And the performance is slow too.
I have tried this on my local cluste
Hi Bhavesh
I'm not sure of AWS, but from a quick reading cluster wide settings like
hdfs block size can be set on hdfs-site.xml through bootstrap actions. Since
you are changing hdfs block size set min and max split size across the cluster
using bootstrap actions itself. The rest of the p
Thanks Bejoy KS for your reply,
I want to ask one thing that If I want to set this parameter on Amazon
Elastic Mapreduce then how can I set these variable like:
e.g. SET mapred.min.split.size=m;
SET mapred.max.split.size=m+n;
set dfs.block.size=128
set mapred.compress.map.output=t
Hi Bhavesh
In sqoop you can optimize the performance by using --direct mode for
import and increasing the number of mappers used for import. When you increase
the number of mappers you need to ensure that the RDBMS connection pool will
handle those number of connections gracefully. Also us
Hi Ali
Sorry my short response may have got you confused. Let us assume you are
doing a LeftOuterJoin on two tables 'A' and 'B' on a column 'id' (table are
large so that only reduce side joins are only possible) then from my
understanding this is how it should happen (explanation based on
14 matches
Mail list logo