Hi Paul,
I am having the same problem. Do you know any efficient way of merging the
files?

-Mohit

On Tue, Dec 6, 2011 at 8:14 PM, Paul Mackles <pmack...@adobe.com> wrote:

> How much time is it spending in the map/reduce phases, respectively? The
> large number of files could be creating a lot of mappers which create a lot
> of overhead. What happens if you merge the 2624 files into a smaller number
> like 24 or 48. That should speed up the mapper phase significantly.****
>
> ** **
>
> *From:* Savant, Keshav [mailto:keshav.c.sav...@fisglobal.com]
> *Sent:* Tuesday, December 06, 2011 6:01 AM
> *To:* user@hive.apache.org
> *Subject:* Hive query taking too much time****
>
> ** **
>
> Hi All,****
>
> ** **
>
> My setup is ****
>
> hadoop-0.20.203.0****
>
> hive-0.7.1****
>
> ** **
>
> I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it is
> also acting as secondary name node). On namenode I have setup hive with
> HiveDerbyServerMode to support multiple hive server connection.****
>
> ** **
>
> I have inserted plain text CSV files in HDFS using ‘LOAD DATA’ hive query
> statements, total number of files is 2624 an their combined size is only
> 713 MB, which is very less from Hadoop perspective that can handle TBs of
> data very easily.****
>
> ** **
>
> The problem is, when I run a simple count query (i.e. *select count(*)
> from a_table*), it takes too much time in executing the query.****
>
> ** **
>
> For instance it takes almost 17 minutes to execute the said query if the
> table has 950,000 rows, I understand that time is too much for executing a
> query with only such small data. ****
>
> This is only a dev environment and in production environment the number of
> files and their combined size will move into millions and GBs respectively.
> ****
>
> ** **
>
> On analyzing the logs on all the datanodes and namenode/secondary namenode
> I do not find any error in them.****
>
> ** **
>
> I have tried setting mapred.reduce.tasks to a fixed number also, but
> number of reduce always remains 1 while number of maps is determined by
> hive only.****
>
> ** **
>
> Any suggestion what I am doing wrong, or how can I improve the performance
> of hive queries? Any suggestion or pointer is highly appreciated. ****
>
> ** **
>
> Keshav****
>
> _____________
> The information contained in this message is proprietary and/or
> confidential. If you are not the intended recipient, please: (i) delete the
> message and all copies; (ii) do not disclose, distribute or use the message
> in any manner; and (iii) notify the sender immediately. In addition, please
> be aware that any message addressed to our domain is subject to archiving
> and review by persons other than the intended recipient. Thank you.****
>



-- 
Best Regards,

Mohit Gupta
Software Engineer at Vdopia Inc.

Reply via email to