> Hello,
>
>
>
> I have Hadoop running on HDFS with Hive installed. I am able to import
Wikipedia dump into HDFS through the below command:
>
>
>
>
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
>
>
>
> $ hadoop jar out.jar
edu.umd.cloud9.collection.wikipedia.DumpWiki
The next NYC edition of the Hive User Group Meetup is happening on October
28th, 6:30pm ET at Hilton New York (East Suite) . The format will be a
series of short 15 minute long talks followed by un-conference style
sessions and networking. Space for the meetup along with food and
refreshments are b
I think there was an issue with reading views from Pig using HCatLoader
(0.4.x). Views definitely seem cleaner though.
On Wed, Oct 2, 2013 at 2:06 PM, Stephen Sprague wrote:
> Hi Tim,
> I guess there are several ways to do it and your method seems to be one of
> them. I have a need for the sa
Hi Tim,
I guess there are several ways to do it and your method seems to be one of
them. I have a need for the same thing and i create a view instead. It
points to the date that is the latest partition.
eg. create view foo_latest_vw as select * from foo_table where date= #and date is the partiti
btw ... this appears to work in my env - hive 0.9.0 (cdh 4.1.1) ... let me
know if there's any drawbacks to this approach.
Thanks.
Tim
On Wed, Oct 2, 2013 at 1:12 PM, Timothy Potter wrote:
> Hi,
>
> I'd like to implement a "latest" partition concept for one of my tables
> and believe I can sim
Hi,
I'd like to implement a "latest" partition concept for one of my tables and
believe I can simply update the location using alter table X partition
(date='latest') set location 'foo';
This assumes two partitions can point at the same location?
My other question is what happens to current runn
I'm trying to create a subset of a large table for testing. The following
approach works:
create table subset_table as
select * from large_table limit 1000
...but it only uses one reducer. I would like to speed up the process of
creating a subset but distributing across multiple reducers. I
Yes doing group by or distinct on 50 columns is ugly.
One option (ugly as well) is first select only these 9 columns and then do
a select * with join of the first .
something like
(select distinct cols from table) a join (select * from table b) on (a.col
= b.col)
I am really not sure this works
Hello all,
My employer, AIST, has given the thumbs up to open source our machine
learning library, named Hivemall.
Hivemall is a scalable machine learning library running on Hive/Hadoop,
licensed under the LGPL 2.1.
https://github.com/myui/hivemall
Hivemall provides machine learning functiona
Thanks for the suggestion! Unfortunately, if you use group by in a query
all columns in the select statement must also appear in the group by. I can
always select distinct on all 50 columns (or group by all 50 columns), but
that sounds very extreme and I feel that there has to be a better solution
may be you want to try group by
in hive select distinct col1, col2, col3 works but if you want to select
all 50 columns its tricky.
Other option would be group by all those 9 keys and it should take care
that you have combination of those 9 columns is unique.
On Wed, Oct 2, 2013 at 12:34 PM, Ph
Yes, that is correct.
On Tue, Oct 1, 2013 at 11:21 PM, Nitin Pawar wrote:
> So you have 50 columns and out of them you want to use 9 columns for
> finding unique rows?
>
> am i correct in assuming that you want to make a key of combination of
> these 9 columns so that you have just one row for a
12 matches
Mail list logo