Re: Dose block size determine the number of map task

2011-06-01 Thread Junxian Yan
Thx. So that means hadoop will treat conbinehiveinput as one block if not set split paramters, is it right? R On Wed, Jun 1, 2011 at 6:44 PM, Steven Wong wrote: > When using CombineHiveInputFormat, parameters such as mapred.max.split.size > (and others) help determine how the input is split acr

RE: Dose block size determine the number of map task

2011-06-01 Thread Steven Wong
When using CombineHiveInputFormat, parameters such as mapred.max.split.size (and others) help determine how the input is split across mappers. Other factors include whether your input files' format is a splittable format or not. Hope this helps. From: Junxian Yan [mailto:junxian@gmail.com]

Hive logging concurrency

2011-06-01 Thread Steven Wong
By default, all Hive clients log to the same file called hive.log via DRFA. What I'm seeing is that many log lines are "lost" after hive.log is rolled over to hive.log.-MM-DD. Is this an issue with DRFA? What do folks do to avoid this problem when using concurrent Hive clients? Thanks. Stev

Re: question about number of map tasks for small file

2011-06-01 Thread Edward Capriolo
On Wed, Jun 1, 2011 at 1:12 PM, Igor Tatarinov wrote: > Can you pre-aggregate your historical data to reduce the number of files? > > We used to partition our data by date but that created too many output > files so now we partition by month. > > I do find it odd that Hive (0.6) can't merge compr

Re: question about number of map tasks for small file

2011-06-01 Thread Igor Tatarinov
Can you pre-aggregate your historical data to reduce the number of files? We used to partition our data by date but that created too many output files so now we partition by month. I do find it odd that Hive (0.6) can't merge compressed output files. We could have gotten away with daily partition

Dose block size determine the number of map task

2011-06-01 Thread Junxian Yan
I saw this in hadoop wiki: http://wiki.apache.org/hadoop/HowManyMapsAndReduces But in my experiment,I see the different result. When I set the CombineHiveInputFormat in hive and by the doc, the default block should be 64M, but my input files are more than 64M, hadoop still created one map task to

Re: question about number of map tasks for small file

2011-06-01 Thread Junxian Yan
Today I tried CombineHiveInputFormat and set the max split size for hadoop input. Seems I can get the expected map tasks number. But another problem is the cpu is consumed highly by map tasks. almost 100%. I just ran a query with simple WHERE condition over testing files,whose total size is about

Re: Hive basic questions

2011-06-01 Thread jinhang du
As far as I know, 1. The external table does not need to copy data from hdfs to your warehouse when loading data. 2. "Location" locates the data in hdfs and it links data to the table. And when you drop table, data is not deleted. 3. The tables' information is stored in your metastore, ie derby, my