I have a files of 7G, and the load using the command of
load data local inpath '/home/oracle/store_sales.csv' into table store_sales;
That file is not compressed, so I want to compress the table to make it work
faster ( I don't know how to let hive work on a compress file directly), So I
use t
after I set
set mapred.min.split.size=2;
Then it will kick off 3 map tasks (the file I have is 500M). So looks like we
need to set mapred.min.split.size instead of mapred.map.tasks to control how
many maps to kick off.
At 2011-08-25 19:38:30,"Daniel,Wu" wrote:
It works, a
35:38,"Ashutosh Chauhan" wrote:
This may be because CombineHiveInputFormat is combining your splits in one map
task. If you don't want that to happen, do:
hive> set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat
2011/8/24 Daniel,Wu
I pasted the inform I pasted
3NANA
job_201108242119_0006NORMALoracleselect period_key,count(*)
from...period_key(Stage-1)100.00%
11100.00%
3 3NANA
At 2011-08-24 18:19:38,wd wrote:
>What about your total Map Task Capacity?
>you may check it from http://your_jobtracker:50030/jobtracker.jsp
>
>2011/8/24 Daniel,Wu :
>&g
t from one node. But anyhow it starts
only one map task.
At 2011-08-24 02:28:18,"Aggarwal, Vaibhav" wrote:
If you actually have splittable files you can set the following setting to
create more splits:
mapred.max.split.size appropriately.
Thanks
Vaibhav
From: Daniel,Wu
ed format
if yes becoz of that its only split in single map.
2011/8/23 Daniel,Wu
I run the following simple sql
select count(*) from sales;
And the job information shows it only uses one map task.
The underlying hadoop has 3 data/data nodes. So I expect hive should kick off 3
map tasks, one on
I run the following simple sql
select count(*) from sales;
And the job information shows it only uses one map task.
The underlying hadoop has 3 data/data nodes. So I expect hive should kick off 3
map tasks, one on each task nodes. What can make hive only run one map task? Do
I need to set some
Hi everyone,
I'd like to create a change request (or JIRA, not sure), do you think it's
feasible? And I search the document about how to contribute, but can't find a
way about how to create a request, could anyone point me to the document?
At 2011-08-14 17:08:26,"Daniel,W
create table part (a int,b int) PARTITIONED by (c int);
create index part_idx on table part(b,c) AS
'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' WITH DEFERRED
REBUILD
partitioned by (a) ;
hive> create index part_idx on table part(b,c) AS
'org.apache.hadoop.hive.ql.index.comp
tables at
once (meaning in a single map-reduce) if they all join on the same key.
2011/8/13 Daniel,Wu
Thanks, it works, but not as effective as possible:
suppose we join 10 small tables (s1,s2...s10) with one huge table (big) in a
database house system (the join is between big table and small t
ler tables specified
in the Mapjoin hint into memory. Then every small table is in memory of each
mapper.
-Ayon
See My Photos on Flickr
Also check out my Blog for answers to commonly asked questions.
From: "Daniel,Wu"
To: hive
Sent: Thursday, August 11, 2011 7:01 PM
Subject: multi
suppose the table is partitioned by period_key, and the csv file also has a
column named as period_key. The csv file contains multiple days of data, how
can we load it in the the table?
I think of an workaround by first load the data into a non-partition table, and
then insert the data from n
drop table store_sales;
CREATE TABLE store_sales(
SUBVENDOR_ID_KEY int ,
VENDOR_KEY int ,
RETAILER_KEY int ,
ITEM_KEY int ,
STORE_KEY int ,
SubvendorId string,
OOS_REASON_KEY int ,
Total_Sales_Amount float ,
Total_Sales_Volume_Units float ,
Store_On_Hand_Volume_Units float ,
Promoted_Sal
We have a table name as sales, which is partitioned by period (MMDD),
and we also need a table ly_sales(last year sales). To speed up the query, we
don't use a view to join sales with last year mapping table( e.g 20110603
mapped to 20100603) for performance viewpoint. However we used the
if we have a very small table to be joined. we can use map side join and need
the small table to be located on the map task. Is it possible to replicate the
small table to ALL nodes when create the small table to cute the time to
distribute the small table?
if the retailer fact table is sale_fact with 10B rows, and join with 3 small
tables: stores (10K), products(10K), period (1K). What's the best join solution?
In oracle, it can first build hash for stores, and hash for products, and hash
for stores. Then probe using the fact table, if the row mat
I run a single query like
select retailer_key,count(*) from records group by retailer_key;
it uses a single map as shown below, since the file is already on HDFS, so I
think hadoop/hive doesn't need to copy anything.
Kind% CompleteNum TasksPendingRunningCompleteKilledFailed/Killed
Task Attempt
Anyone know why hive has such a high latency? scan a table with 16,522,439
rows take more than 85 seconds. To read these data off disk, we only need about
10 seconds (even not consider the caching which read data from memory). So
where does 75 seconds go to? will Deserialize & Serialize t
Hive document said hive is high latency, to query a table with about 100M
might take 1 minute. And hbase is a high performance database, so does that
mean after integrate hive and hbase, hive will get a better performance with
lower latency?
19 matches
Mail list logo