Using CombineFileInputFormat might help, but it still creates overhead when you hold many small files in HDFS.

I don't know details of your requirements, but but option 2 seems to be better, make sure that X is at least size of few blocks in HDFS.

You could also merge files incrementally, like first every 1h, then merge those results again after 12h and so on.

You can use -getmerge option or use this class (I have not used it):
http://hadoop.apache.org/hdfs/docs/r0.21.0/api/org/apache/hadoop/hdfs/tools/HDFSConcat.html


On 08.12.2011 09:03, Aniket Mokashi wrote:
You can also take a look at--
https://issues.apache.org/jira/browse/HIVE-74

On Wed, Dec 7, 2011 at 9:05 PM, Savant, Keshav<
keshav.c.sav...@fisglobal.com>  wrote:

You are right Wojciech Langiewicz, we did the same thing and posted my
result yesterday. Now we are planning to do this using a shell script
because of dynamicity of our environment where file keep on coming. We
will schedule the shell script using cron job.

A query on this, we are planning to merge files based on either of the
following approach
1. Based on file count: If file count goes to X number of files, then
merge and insert in HDFS.
2. Based on merged file size: If merged file size crosses beyond X
number of bytes, then insert into HDFS.

I think option 2 is better because in that way we can say that all
merged files will be almost of same bytes. What do you suggest?

Kind Regards,
Keshav C Savant


-----Original Message-----
From: Wojciech Langiewicz [mailto:wlangiew...@gmail.com]
Sent: Wednesday, December 07, 2011 8:15 PM
To: user@hive.apache.org
Subject: Re: Hive query taking too much time

Hi,
In this case it's much easier and faster to merge all files using this
command:

cat *.csv>  output.csv
hive -e "load data local inpath 'output.csv' into table $table"

On 07.12.2011 07:00, Vikas Srivastava wrote:
hey if u having the same col of  all the files then you can easily
merge by shell script

list=`*.csv`
$table=yourtable
for file in $list
do
cat $file>>new_file.csv
done
hive -e "load data local inpath '$file' into table $table"

it will merge all the files in single file then you can upload it in
the same query

On Tue, Dec 6, 2011 at 8:16 PM, Mohit Gupta
<success.mohit.gu...@gmail.com>wrote:

Hi Paul,
I am having the same problem. Do you know any efficient way of
merging the files?

-Mohit


On Tue, Dec 6, 2011 at 8:14 PM, Paul Mackles<pmack...@adobe.com>
wrote:

How much time is it spending in the map/reduce phases, respectively?

The large number of files could be creating a lot of mappers which
create a lot of overhead. What happens if you merge the 2624 files
into a smaller number like 24 or 48. That should speed up the mapper

phase significantly.****

** **

*From:* Savant, Keshav [mailto:keshav.c.sav...@fisglobal.com]
*Sent:* Tuesday, December 06, 2011 6:01 AM
*To:* user@hive.apache.org
*Subject:* Hive query taking too much time****

** **

Hi All,****

** **

My setup is ****

hadoop-0.20.203.0****

hive-0.7.1****

** **

I am having a total of 5 node cluster: 4 data nodes, 1 namenode (it
is also acting as secondary name node). On namenode I have setup
hive with HiveDerbyServerMode to support multiple hive server
connection.****

** **

I have inserted plain text CSV files in HDFS using 'LOAD DATA' hive
query statements, total number of files is 2624 an their combined
size is only
713 MB, which is very less from Hadoop perspective that can handle
TBs of data very easily.****

** **

The problem is, when I run a simple count query (i.e. *select
count(*) from a_table*), it takes too much time in executing the
query.****

** **

For instance it takes almost 17 minutes to execute the said query if

the table has 950,000 rows, I understand that time is too much for
executing a query with only such small data. ****

This is only a dev environment and in production environment the
number of files and their combined size will move into millions and
GBs
respectively.****

** **

On analyzing the logs on all the datanodes and namenode/secondary
namenode I do not find any error in them.****

** **

I have tried setting mapred.reduce.tasks to a fixed number also, but

number of reduce always remains 1 while number of maps is determined

by hive only.****

** **

Any suggestion what I am doing wrong, or how can I improve the
performance of hive queries? Any suggestion or pointer is highly
appreciated. ****

** **

Keshav****

_____________
The information contained in this message is proprietary and/or
confidential. If you are not the intended recipient, please: (i)
delete the message and all copies; (ii) do not disclose, distribute
or use the message in any manner; and (iii) notify the sender
immediately. In addition, please be aware that any message addressed

to our domain is subject to archiving and review by persons other
than the intended recipient. Thank you.****




--
Best Regards,

Mohit Gupta
Software Engineer at Vdopia Inc.






_____________
The information contained in this message is proprietary and/or
confidential. If you are not the intended recipient, please: (i) delete the
message and all copies; (ii) do not disclose, distribute or use the message
in any manner; and (iii) notify the sender immediately. In addition, please
be aware that any message addressed to our domain is subject to archiving
and review by persons other than the intended recipient. Thank you.





Reply via email to