Fwd: Wikipedia Dump Analysis..

2013-10-02 Thread Ajeet S Raina
> Hello, > > > > I have Hadoop running on HDFS with Hive installed. I am able to import Wikipedia dump into HDFS through the below command: > > > > http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 > > > > $ hadoop jar out.jar edu.umd.cloud9.collection.wikipedia.DumpWiki

Hive User Group Meetup NYC Edition 2013

2013-10-02 Thread Prasad Mujumdar
The next NYC edition of the Hive User Group Meetup is happening on October 28th, 6:30pm ET at Hilton New York (East Suite) . The format will be a series of short 15 minute long talks followed by un-conference style sessions and networking. Space for the meetup along with food and refreshments are b

Re: inflight jobs and updating a partition location

2013-10-02 Thread Timothy Potter
I think there was an issue with reading views from Pig using HCatLoader (0.4.x). Views definitely seem cleaner though. On Wed, Oct 2, 2013 at 2:06 PM, Stephen Sprague wrote: > Hi Tim, > I guess there are several ways to do it and your method seems to be one of > them. I have a need for the sa

Re: inflight jobs and updating a partition location

2013-10-02 Thread Stephen Sprague
Hi Tim, I guess there are several ways to do it and your method seems to be one of them. I have a need for the same thing and i create a view instead. It points to the date that is the latest partition. eg. create view foo_latest_vw as select * from foo_table where date= #and date is the partiti

Re: inflight jobs and updating a partition location

2013-10-02 Thread Timothy Potter
btw ... this appears to work in my env - hive 0.9.0 (cdh 4.1.1) ... let me know if there's any drawbacks to this approach. Thanks. Tim On Wed, Oct 2, 2013 at 1:12 PM, Timothy Potter wrote: > Hi, > > I'd like to implement a "latest" partition concept for one of my tables > and believe I can sim

inflight jobs and updating a partition location

2013-10-02 Thread Timothy Potter
Hi, I'd like to implement a "latest" partition concept for one of my tables and believe I can simply update the location using alter table X partition (date='latest') set location 'foo'; This assumes two partitions can point at the same location? My other question is what happens to current runn

Use distribute to spread across reducers

2013-10-02 Thread Keith Wiley
I'm trying to create a subset of a large table for testing. The following approach works: create table subset_table as select * from large_table limit 1000 ...but it only uses one reducer. I would like to speed up the process of creating a subset but distributing across multiple reducers. I

Re: Dealing with duplicate rows in Hive

2013-10-02 Thread Nitin Pawar
Yes doing group by or distinct on 50 columns is ugly. One option (ugly as well) is first select only these 9 columns and then do a select * with join of the first . something like (select distinct cols from table) a join (select * from table b) on (a.col = b.col) I am really not sure this works

[ANN] Hivemall: Hive scalable machine learning library

2013-10-02 Thread Makoto YUI
Hello all, My employer, AIST, has given the thumbs up to open source our machine learning library, named Hivemall. Hivemall is a scalable machine learning library running on Hive/Hadoop, licensed under the LGPL 2.1. https://github.com/myui/hivemall Hivemall provides machine learning functiona

Re: Dealing with duplicate rows in Hive

2013-10-02 Thread Philo Wang
Thanks for the suggestion! Unfortunately, if you use group by in a query all columns in the select statement must also appear in the group by. I can always select distinct on all 50 columns (or group by all 50 columns), but that sounds very extreme and I feel that there has to be a better solution

Re: Dealing with duplicate rows in Hive

2013-10-02 Thread Nitin Pawar
may be you want to try group by in hive select distinct col1, col2, col3 works but if you want to select all 50 columns its tricky. Other option would be group by all those 9 keys and it should take care that you have combination of those 9 columns is unique. On Wed, Oct 2, 2013 at 12:34 PM, Ph

Re: Dealing with duplicate rows in Hive

2013-10-02 Thread Philo Wang
Yes, that is correct. On Tue, Oct 1, 2013 at 11:21 PM, Nitin Pawar wrote: > So you have 50 columns and out of them you want to use 9 columns for > finding unique rows? > > am i correct in assuming that you want to make a key of combination of > these 9 columns so that you have just one row for a