how to let one map task read multiple files?

2011-08-27 Thread Daniel,Wu
I have a files of 7G, and the load using the command of load data local inpath '/home/oracle/store_sales.csv' into table store_sales; That file is not compressed, so I want to compress the table to make it work faster ( I don't know how to let hive work on a compress file directly), So I use t

Re:Re:Re: Re: RE: Why a sql only use one map task?

2011-08-25 Thread Daniel,Wu
after I set set mapred.min.split.size=2; Then it will kick off 3 map tasks (the file I have is 500M). So looks like we need to set mapred.min.split.size instead of mapred.map.tasks to control how many maps to kick off. At 2011-08-25 19:38:30,"Daniel,Wu" wrote: It works, a

Re:Re: Re: RE: Why a sql only use one map task?

2011-08-25 Thread Daniel,Wu
35:38,"Ashutosh Chauhan" wrote: This may be because CombineHiveInputFormat is combining your splits in one map task. If you don't want that to happen, do: hive> set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat 2011/8/24 Daniel,Wu I pasted the inform I pasted

Re:Re: RE: Why a sql only use one map task?

2011-08-24 Thread Daniel,Wu
3NANA job_201108242119_0006NORMALoracleselect period_key,count(*) from...period_key(Stage-1)100.00% 11100.00% 3 3NANA At 2011-08-24 18:19:38,wd wrote: >What about your total Map Task Capacity? >you may check it from http://your_jobtracker:50030/jobtracker.jsp > >2011/8/24 Daniel,Wu : >&g

Re:RE: Why a sql only use one map task?

2011-08-23 Thread Daniel,Wu
t from one node. But anyhow it starts only one map task. At 2011-08-24 02:28:18,"Aggarwal, Vaibhav" wrote: If you actually have splittable files you can set the following setting to create more splits: mapred.max.split.size appropriately. Thanks Vaibhav From: Daniel,Wu

Re:Re: Why a sql only use one map task?

2011-08-23 Thread Daniel,Wu
ed format if yes becoz of that its only split in single map. 2011/8/23 Daniel,Wu I run the following simple sql select count(*) from sales; And the job information shows it only uses one map task. The underlying hadoop has 3 data/data nodes. So I expect hive should kick off 3 map tasks, one on

Why a sql only use one map task?

2011-08-23 Thread Daniel,Wu
I run the following simple sql select count(*) from sales; And the job information shows it only uses one map task. The underlying hadoop has 3 data/data nodes. So I expect hive should kick off 3 map tasks, one on each task nodes. What can make hive only run one map task? Do I need to set some

wants to create a JIRA (request): multiple tables join with only one hug table.

2011-08-14 Thread Daniel,Wu
Hi everyone, I'd like to create a change request (or JIRA, not sure), do you think it's feasible? And I search the document about how to contribute, but can't find a way about how to create a request, could anyone point me to the document? At 2011-08-14 17:08:26,"Daniel,W

failed when create an index with partitioned by clause

2011-08-14 Thread Daniel,Wu
create table part (a int,b int) PARTITIONED by (c int); create index part_idx on table part(b,c) AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED REBUILD partitioned by (a) ; hive> create index part_idx on table part(b,c) AS 'org.apache.hadoop.hive.ql.index.comp

Re:Re: Re: multiple tables join with only one hug table.

2011-08-14 Thread Daniel,Wu
tables at once (meaning in a single map-reduce) if they all join on the same key. 2011/8/13 Daniel,Wu Thanks, it works, but not as effective as possible: suppose we join 10 small tables (s1,s2...s10) with one huge table (big) in a database house system (the join is between big table and small t

Re:Re: multiple tables join with only one hug table.

2011-08-13 Thread Daniel,Wu
ler tables specified in the Mapjoin hint into memory. Then every small table is in memory of each mapper. -Ayon See My Photos on Flickr Also check out my Blog for answers to commonly asked questions. From: "Daniel,Wu" To: hive Sent: Thursday, August 11, 2011 7:01 PM Subject: multi

how to load data to partitioned table

2011-08-11 Thread Daniel,Wu
suppose the table is partitioned by period_key, and the csv file also has a column named as period_key. The csv file contains multiple days of data, how can we load it in the the table? I think of an workaround by first load the data into a non-partition table, and then insert the data from n

create table for hive

2011-08-11 Thread Daniel,Wu
drop table store_sales; CREATE TABLE store_sales( SUBVENDOR_ID_KEY int , VENDOR_KEY int , RETAILER_KEY int , ITEM_KEY int , STORE_KEY int , SubvendorId string, OOS_REASON_KEY int , Total_Sales_Amount float , Total_Sales_Volume_Units float , Store_On_Hand_Volume_Units float , Promoted_Sal

how to make the data in one table available to multiple tables?

2011-08-11 Thread Daniel,Wu
We have a table name as sales, which is partitioned by period (MMDD), and we also need a table ly_sales(last year sales). To speed up the query, we don't use a view to join sales with last year mapping table( e.g 20110603 mapped to 20100603) for performance viewpoint. However we used the

how to distribute a small table to all nodes?

2011-08-11 Thread Daniel,Wu
if we have a very small table to be joined. we can use map side join and need the small table to be located on the map task. Is it possible to replicate the small table to ALL nodes when create the small table to cute the time to distribute the small table?

multiple tables join with only one hug table.

2011-08-11 Thread Daniel,Wu
if the retailer fact table is sale_fact with 10B rows, and join with 3 small tables: stores (10K), products(10K), period (1K). What's the best join solution? In oracle, it can first build hash for stores, and hash for products, and hash for stores. Then probe using the fact table, if the row mat

why need to copy when run a sql with a single map

2011-08-10 Thread Daniel,Wu
I run a single query like select retailer_key,count(*) from records group by retailer_key; it uses a single map as shown below, since the file is already on HDFS, so I think hadoop/hive doesn't need to copy anything. Kind% CompleteNum TasksPendingRunningCompleteKilledFailed/Killed Task Attempt

Fw:why hive has such a high latency?

2011-08-10 Thread Daniel,Wu
Anyone know why hive has such a high latency? scan a table with 16,522,439 rows take more than 85 seconds. To read these data off disk, we only need about 10 seconds (even not consider the caching which read data from memory). So where does 75 seconds go to? will Deserialize & Serialize t

what's the benifit of integrate hbase with hive? For low latency?

2011-08-08 Thread Daniel,Wu
Hive document said hive is high latency, to query a table with about 100M might take 1 minute. And hbase is a high performance database, so does that mean after integrate hive and hbase, hive will get a better performance with lower latency?