I am no expert on sqoop so i may be wrong but importing 30*0.5M records
(table by table) is a huge operation. I would rather prefer just dump and
import using hive cli (sqoop is good choice too but i dont know the
benchmarks)

if you are doing so many joins then its good to be on hadoop cluster
instead of the single machine. If you have a 10 node cluster then it should
certainly improve your query performance.

also you may want to take a look at different kind of joins available to
you (mapjoins, bucketedmapjoins, skewed joins etc) cause each join comes up
with its own optimized approach.


the options i said in previous mail are part of job conf submitted to
hadoop. on hive cli we just set them on command line or through hiverc



On Tue, May 8, 2012 at 10:41 AM, Bhavesh Shah <bhavesh25s...@gmail.com>wrote:

> Thanks Nitin for your reply.
>
> In short my Task is
> 1) Initially I want to import the data from MS SQL Server into HDFS using
> SQOOP.
> 2) Through Hive I am processing the data and generating the result in one
> table
> 3) That result containing table from Hive is again exported to MS SQL
> SERVER back.
>
> Actually the data which I am importing from MS SQL Server is very large
> (near about 5,00,000 entries in one table. Like wise I have 30 tables). For
> this I have written a task in Hive which contains only queries (And each
> query has used a lot of joins in it). So due to this the performance is
> very poor on  my single local machine ( It takes near about 3 hrs to
> execute completely). I have observed that when I have submitted a single
> query to Hive CLI it took 10-11 jobs to execute completely.
>
> * set mapred.min.split.size
> set mapred.max.split.size*
> Should this value to be set in bootstrap action while submitting jobs to
> amazon EMR? What value to be set for it as I don't know?
>
>
> --
> Regards,
> Bhavesh Shah
>
>
> On Tue, May 8, 2012 at 10:31 AM, Nitin Pawar <nitinpawar...@gmail.com>wrote:
>
>> 1) check the jobtracker url to see how many maps/reducers have been
>> launched
>> 2) if you have a large dataset and wants to execute it fast, you
>> set mapred.min.split.size and mapred.max.split.size to an optimal value so
>> that more mappers will be launched and will finish
>> 3) if you are doing joins, there are different ways to go according to
>> the data you have and size of data
>>
>> it will be helpful if you can let us know your datasizes and query
>> details
>>
>>
>> On Tue, May 8, 2012 at 10:07 AM, Bhavesh Shah <bhavesh25s...@gmail.com>wrote:
>>
>>> Hello all,
>>> I have written a Hive JDBC code and created a JAR of it. I am running
>>> that JAR on 10 cluster.
>>> But the problem as I am using the 10 cluster still the performance is
>>> same as that on single cluster.
>>>
>>> What to do to improve the performance of Hive Jobs? Is there anything
>>> configuration setting to set before the submitting Hive Jobs to cluster?
>>> One more thing I want to know is that How can we come to know that is
>>> job running on all cluster?
>>>
>>> Please let me know if anyone knows about it?
>>>
>>> --
>>> Regards,
>>> Bhavesh Shah
>>>
>>>
>>
>>
>> --
>> Nitin Pawar
>>
>>
>
>


-- 
Nitin Pawar

Reply via email to