RE: Hive Dynamic Partions - How to avoid overwrite

2011-10-04 Thread Aggarwal, Vaibhav
You can choose to partition by (country, date). In this case you move the data in a date partition within your country partition and avoid overwriting old data. If you choose to go this way one thing to check is that this should not result in too many partitions. Large number of partitions have

RE: Map joins in hive

2011-09-27 Thread Aggarwal, Vaibhav
Does it get stuck before the creating a Hadoop job or after creating a Hadoop job. In case it is stuck before creating a hadoop job you can look at Hive.log (wherever you are directing it) for what is taking a long time to setup the job. In case the Hadoop job has already started you can look at

RE: Benchmarking problems

2011-09-27 Thread Aggarwal, Vaibhav
You can choose to turn the speculative execution ON which might help you with few slow progressing tasks. mapred.map.tasks.speculative.execution and mapred.reduce.tasks.speculative.execution are the job conf options. -Original Message- From: bharath vissapragada [mailto:bharathvissaprag

RE: Best practices for storing data on Hive

2011-09-06 Thread Aggarwal, Vaibhav
>However, given the amount of users that visit our website (hundreds of >thousands of unique users every day), this would lead to a large number of >partitions (and rather small file sizes, ranging from a >couple of bytes to a >couple of KB). From the documentation I've read online, it seems th

RE: Best practices for storing data on Hive

2011-09-06 Thread Aggarwal, Vaibhav
Hi You could choose to have the second table (for user ids) partitioned by date also. table_root/userid=ab/date=2010-12-31/ That way you can split your data set by both a userid and a date. You can use dynamic partitions to transform existing date partitioned table into userid/date partition

Keyworkds in Hive

2011-08-31 Thread Aggarwal, Vaibhav
Hi Is there a wiki page which contains a list of keywords in Hive? Can we use 'time' or 'date' as column names? Thanks Vaibhav

RE: Hive in EC2

2011-08-30 Thread Aggarwal, Vaibhav
You could also choose to look at Amazon ElasticMapReduce. It allows you to provision an EC2 cluster of your choice preinstalled with Hive and Hadoop. https://cwiki.apache.org/confluence/display/Hive/HiveAmazonElasticMapReduce Thanks Vaibhav -Original Message- From: MIS [mailto:misapa...

RE: HIVE_AUX_JARS_PATH

2011-08-29 Thread Aggarwal, Vaibhav
You need to point to the exact jar file location and not just the directory location. Vaibhav -Original Message- From: Sam William [mailto:sa...@stumbleupon.com] Sent: Monday, August 29, 2011 3:56 PM To: user@hive.apache.org Subject: HIVE_AUX_JARS_PATH I assume you need to set HIVE_AU

RE: how to let one map task read multiple files?

2011-08-27 Thread Aggarwal, Vaibhav
CombineFileInputFormat can be used to combine multiple files into one map task. But CombineFileInputFormat does not attempt to combine compressed files. It defaults to the HiveFileInputFormat which creates at least one map task per file. 7G of data is not a lot for 3 node cluster to process and y

RE: Why a sql only use one map task?

2011-08-23 Thread Aggarwal, Vaibhav
If you actually have splittable files you can set the following setting to create more splits: mapred.max.split.size appropriately. Thanks Vaibhav From: Daniel,Wu [mailto:hadoop...@163.com] Sent: Tuesday, August 23, 2011 6:51 AM To: hive Subject: Why a sql only use one map task? I run the fo

RE: Hive Custom UDF - "hive.aux.jars.path" not working

2011-08-22 Thread Aggarwal, Vaibhav
Did you restart the hive server after modifying the hive-site.xml settings? I think you need to restart the server to pick up the latest settings in the config file. Thanks Vaibhav From: Amit Sharma [mailto:amitsharma1...@gmail.com] Sent: Monday, August 22, 2011 2:42 PM To: user@hive.apache.org

RE: org.apache.hadoop.fs.ChecksumException: Checksum error:

2011-08-19 Thread Aggarwal, Vaibhav
This is a really curious case. How many replicas of each block do you have? Are you able to copy the data directly using HDFS client? You could try the hadoop fs -copyToLocal command and see if it can copy the data from hdfs correctly. That would help you verify that the issue really is at HDFS

RE: Alter table Set Locations for all partitions

2011-08-19 Thread Aggarwal, Vaibhav
You could also specify fully qualified hdfs path in the create table command. It could look like create external table test(key string ) row format delimited fields terminated by '\000' collection items terminated by ' ' location 'hdfs://new_master_host:port/table_path'; Then you can use the 'ins

RE: how to load data to partitioned table

2011-08-12 Thread Aggarwal, Vaibhav
If you want to insert data into a partitioned table without specifying the partition value, you need to enable dynamic partitioning. You can use the following switches: SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; Thanks Vaibhav From: Daniel,Wu [mailto:h

RE: Reducer Issue in New Setup

2011-08-11 Thread Aggarwal, Vaibhav
Are you using a custom scheduler? I have seen issues with jobs having 0 mappers and 1 reducer with Fair scheduler. From: hadoop n00b [mailto:new2h...@gmail.com] Sent: Thursday, August 11, 2011 9:32 AM To: user@hive.apache.org Subject: Reducer Issue in New Setup Hello, We have just setup Hive on

RE: CDH3 U1 Hive Job-commit very slow

2011-08-10 Thread Aggarwal, Vaibhav
] Sent: Wednesday, August 10, 2011 3:40 AM To: user@hive.apache.org Subject: Re: CDH3 U1 Hive Job-commit very slow there is only 10186 partitions in the metadata store (select count(1) from PARTITIONS; in mysql), I think it is not the problem. 2011/8/10 Aggarwal, Vaibhav mailto:vagg...@amazon.com

RE: CDH3 U1 Hive Job-commit very slow

2011-08-09 Thread Aggarwal, Vaibhav
Do you have a lot of partitions in your table? Time taken to process the partitions before submitting the job is proportional to number of partitions. There is a patch I submitted recently as an attempt to alleviate this problem: https://issues.apache.org/jira/browse/HIVE-2299 If that is not th

RE: what's the benifit of integrate hbase with hive? For low latency?

2011-08-08 Thread Aggarwal, Vaibhav
There are many potential benefits of using hive hbase handler. 1. The most obvious is ability to run SQL like queries on your data instead of using hbase client. 2. Ability to join data with other data sources like HDFS or S3. 3. Ability to move data from your Hive tables int

RE: Hive 0.7 using only one mapper

2011-07-28 Thread Aggarwal, Vaibhav
If you are using CombineHiveInputFormat it might be the case that all files are being combined into one large split and hence 1 mapper gets created. If that is the case you can set the max split size in hive-default.xml config file to create more splits and hence more map tasks: mapred.max.s

RE: Hive session locking up after 4 queries using S3

2011-07-06 Thread Aggarwal, Vaibhav
ail, please contact the sender immediately and delete the e-mail from your computer. On Wednesday, July 6, 2011 at 7:39 PM, Aggarwal, Vaibhav wrote: Could you please tell us which Hadoop and Hive version are you using? Looks like you might be using an older version of Hadoop (more specifically one

RE: Hive session locking up after 4 queries using S3

2011-07-06 Thread Aggarwal, Vaibhav
Could you please tell us which Hadoop and Hive version are you using? Looks like you might be using an older version of Hadoop (more specifically one which ships with old version of jets3t). Thanks Vaibhav From: Wouter de Bie [mailto:wou...@spotify.com] Sent: Wednesday, July 06, 2011 9:07 AM To: