Howdy,
I'm coming from cassandra, and I'm actually trying to count all columns in a
column family. I believe that is similar to counting the number tuples in a
bag in the lingo in the pig manual. It was harder than I expected, but I
think this works:
rows = LOAD 'cassandra://MyKeyspace/MyColumnF
columns in a conventional sense, but your code will return 5. Is
> that what you want? If so, your code seems correct.
>
> D
>
> On Fri, Jun 3, 2011 at 12:53 PM, William Oberman
> wrote:
> > Howdy,
> >
> > I'm coming from cassandra, and I'm actually trying
hat's my theory). As a
workaround, can I have COUNT ignore/skip rows with null columns? I'll start
digging through the docs as well.
will
On Fri, Jun 3, 2011 at 4:09 PM, William Oberman wrote:
> That is exactly what I wanted, thanks for the confirm!
>
>
> On Fri, Jun 3, 2
OUP counts ALL;
sum_of_bag = FOREACH counts_in_bag GENERATE SUM($1);
dump sum_of_bag;
On Tue, Jun 7, 2011 at 4:33 PM, William Oberman wrote:
> I tried this same script on closer to production data, and I'm getting
> errors. I'm 50% sure it's this:
> https://issues.apach
_in_bag = GROUP counts ALL;
sum_of_bag = FOREACH counts_in_bag GENERATE SUM($1);
dump sum_of_bag;
For some reason typing the bag was causing me problems.
On Tue, Jun 7, 2011 at 4:58 PM, William Oberman wrote:
> I think FILTER will do the trick? E.g.
>
>
> rows = LOAD 'cassandra://
I think I'm stuck on typing issues trying to store data in cassandra. To
verify, cassandra wants (key, {tuples})
My pig script is fairly brief:
raw = LOAD 'cassandra://test_in/test_cf' USING CassandraStorage() AS
(key:chararray, columns:bag {column:tuple (name, value)});
--colums == timeUUID -> J
Rather than staying stuck, I wrote a custom function: TupleToBagOfTuple. I'm
curios if I could have avoided this though.
On Wed, Jun 15, 2011 at 2:17 PM, William Oberman
wrote:
> I think I'm stuck on typing issues trying to store data in cassandra. To
> verify, cassandra wan
ndraBag from
> pygmalion - it does the work for you to get it back into a form that
> cassandra understands.
>
> Others may know better how to massage the data into that form using just
> pig, but if all else fails, you could write a udf to do that.
>
> Jeremy
>
> On Jun 1
I'll do a reply all, to keep this more consistent (sorry!).
Rather than staying stuck, I wrote a custom function: TupleToBagOfTuple. I'm
curious if I could have avoided it with proper pig scripting though.
On Wed, Jun 15, 2011 at 3:08 PM, William Oberman
wrote:
> My problem is the
I tried out hadoop/pig in my test environment using tar.gz's. Before I roll
out to production, I thought I'd try the cdh3 pacakges, as that might be
easier to maintain (since I'm not a sysadmin). Following cloudera's install
guide worked like a charm, but I couldn't get pig to run until I did thi
I thought pig is the one trying to write to /tmp inside of hadoop?
will
On Fri, Jul 8, 2011 at 3:00 PM, Dmitriy Ryaboy wrote:
> Seems like a question you should ask Cloudera?
>
> On Fri, Jul 8, 2011 at 11:57 AM, William Oberman
> wrote:
> > I tried out hadoop/pig in my test
g as the same user so I didn't
matter.
So, that makes getting the permissions right for /tmp more important, but I
didn't think the hadoop crowd would care since it's pig that causes the
write to that location. But a newbie pig user might need the FYI....
On Fri, Jul 8, 2011 at 3:0
Dai wrote:
> Check
>
> http://pig.apache.org/docs/r0.8.1/piglatin_ref1.html#Storing+Intermediate+Data
>
> Daniel
>
> On Fri, Jul 8, 2011 at 12:04 PM, William Oberman
> wrote:
>
> > Sorry, to be more verbose, CDH3 actually respects permissions inside of
> > HD
Hello,
My setup is Pig + Hadoop + Cassandra for my "big data" and MySql for my
"relational/meta data". Up until now that has been fine, but now I need to
start creating metrics that "cross the lines". In particular, I need to
create aggregations of Cassandra data based on lookups from MySql.
Af
gt; sizes are within reason.
>
>
> On Tue, Sep 11, 2012 at 8:17 AM, William Oberman
> wrote:
>
> > Hello,
> >
> > My setup is Pig + Hadoop + Cassandra for my "big data" and MySql for my
> > "relational/meta data". Up until now that has been
he DB.
>
> 1 - I've never used this:
>
> http://svn.apache.org/repos/asf/cassandra/trunk/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java
>
> On Tue, Sep 11, 2012 at 8:54 AM, William Oberman
> wrote:
>
> > Great news (for me)! :-) My r
ing I planned to move on, rather than figure that out :-)
will
On Tue, Sep 11, 2012 at 2:09 PM, William Oberman
wrote:
> Thanks (again)!
>
> I'm already using CassandraStorage to load the JSON strings. I used Maps
> because I liked being able to name the fields, but I could e
I'm trying to play around with Amazon EMR, and I currently have self hosted
Cassandra as the source of data. I was going to try to do: Cassandra -> S3
-> EMR. I've traced my problems to PigStorage. At this point I can
recreate my problem "locally" without involving S3 or Amazon.
In my local tes
o reproduce your problem? 1 ~ 2
> rows would be sufficient.
>
> Thanks,
> Cheolsoo
>
> On Tue, Nov 6, 2012 at 12:20 PM, William Oberman
> wrote:
>
> > I'm trying to play around with Amazon EMR, and I currently have self
> hosted
> > Cassandra as the source of
correctly.
On Tue, Nov 6, 2012 at 4:35 PM, Cheolsoo Park wrote:
> >> This is a dumb question, but PigStorage escapes the delimiter, right?
>
> No it doesn't.
>
> On Tue, Nov 6, 2012 at 1:29 PM, William Oberman >wrote:
>
> > This is a dumb question, but P
Just in case someone hits this thread by having the same issue, please vote
for this bug:
https://issues.apache.org/jira/browse/PIG-1271
On Tue, Nov 6, 2012 at 4:50 PM, William Oberman wrote:
> Wow, ok. That is completely unexpected. Thanks for the heads up!
>
> In my case, because p
A couple of weeks ago I spent a bunch of time trying to get EMR + S3 + Avro
working:
https://forums.aws.amazon.com/thread.jspa?messageID=398194
Short story, yes I think PIG-2540 is the issue. I'm currently trying to
get pig 0.10 running in EMR with help from AWS support. You have to do:
--boot
I should have read more closely, you're not using EMR.
I'm guessing if you upgrade to pig 0.10 the issue will go away...
On Fri, Nov 30, 2012 at 4:09 PM, William Oberman
wrote:
> A couple of weeks ago I spent a bunch of time trying to get EMR + S3 +
> Avro
Your line numbers aren't matching up to the 1.1.7 release, which is weird.
Based on the "stock" 1.1.7 source, there was a null check on str
before predicateFromString(str),
making your code path impossible...
will
On Tue, Dec 11, 2012 at 1:00 PM, Jonathan Coveney wrote:
> If I were debugging t
g.file=pig.log
> -Dpig.home.dir=/Library/pig-0.10.0/bin/..
> /Library/hadoop-1.0.2/bin/hadoop jar
> /Library/pig-0.10.0/bin/../pig-0.10.0-withouthadoop.jar
> -Dudf.import.list=org.apache.cassandra.hadoop.pig -x local rowcount.pig
>
>
>
>
>
>
>
> On 12/11/12 1:10
We managed to piece this together. It's not fully generic (we assume a
single field). But, it gets the job done for unit testing.
--
package com.civicscience.util;
import org.apache.pig.ResourceSchema;
import org.apache.pig.builtin.Utf8StorageConverter;
import org.apache.pig.impl.uti
I'm trying to set useMatches=false in REGEX_EXTRACT_ALL as per the javadoc:
http://pig.apache.org/docs/r0.11.0/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html
(and yes, I'm using pig 0.11).
But it doesn't work. I'm concerned about this post:
http://grokbase.com/t/pig/user/12b891a55k/boolean-pig
be a while...
I'm using: '([^?=&]+)(?:[&#]|=([^&#]*))'
will
On Wed, May 8, 2013 at 1:20 PM, William Oberman wrote:
> I'm trying to set useMatches=false in REGEX_EXTRACT_ALL as per the javadoc:
>
> http://pig.apache.org/docs/r0.11.0/api/org/apache/pig/builtin/REGEX_EX
I'm using pig 0.11.2.
I had been processing ASCII files of json with schema: (key:chararray,
columns:bag {column:tuple (timeUUID:chararray, value:chararray,
timestamp:long)})
For what it's worth, this is cassandra data, at a fairly low level.
But, this was getting big, so I compressed it all with
They are all *.gz, I confirmed that first :-)
On Saturday, June 8, 2013, Niels Basjes wrote:
> What are the exact filenames you used?
> The decompression of input files is based on the filename extention.
>
> Niels
> On Jun 7, 2013 11:11 PM, "William Oberman"
> &g
compressed" count.
I don't know how to debug hadoop/pig quite as well, though I'm trying now.
But, my working theory is that some combination of pig/hadoop aborts
processing the gz stream on a null character, but keeps chugging on a
non-gz stream. Does that sound familiar?
will
t be split that way.
>
>
> On Mon, Jun 10, 2013 at 12:06 PM, William Oberman
> wrote:
>
> > I still don't fully understand (and am still debugging), but I have a
> > "problem file" and a theory.
> >
> > The file has a "corrupt line" tha
1 PM, Alan Crosswell
> > > wrote:
> > >
> > > > Suggest that if you have a choice, you use bzip2 compression instead
> of
> > > > gzip as bzip2 is block-based and Pig can split reading a large
> bzipped
> > > file
> > > > across multiple
The docs say DISTINCT can take a custom partitioner. How does that work?
What is "K" and "V"?
I'm having some doubts the docs are correct. I wrote a test partitioner
that does a System.out of K and V. I then wrote simple scripts to do JOIN,
GROUP and DISTINCT. For JOIN and GROUP I see my syste
)
On Wed, Jul 17, 2013 at 2:27 PM, William Oberman
wrote:
> The docs say DISTINCT can take a custom partitioner. How does that work?
> What is "K" and "V"?
> I'm having some doubts the docs are correct. I wrote a test partitioner
> that does a System.out of
t; partitioner. Could you file a JIRA against the docs so we can get that
> fixed?
>
> Alan.
>
> On Jul 17, 2013, at 11:27 AM, William Oberman wrote:
>
> > The docs say DISTINCT can take a custom partitioner. How does that work?
> > What is "K" and "V"?
I was debugging some warnings in a script I had:
FIELD_DISCARDED_TYPE_CONVERSION_FAILED
ACCESSING_NON_EXISTENT_FIELD
I got it down to basically these two lines:
--foo was stored using PigStorage
foo = LOAD '' AS (key:chararray, value:map[chararray]);
STORE foo INTO '...';
The problem is some
rage.html
> >work
> for you?
>
> Thanks,
> Cheolsoo
>
>
> On Wed, Mar 26, 2014 at 10:51 AM, William Oberman
> wrote:
>
> > I was debugging some warnings in a script I had:
> > FIELD_DISCARDED_TYPE_CONVERSION_FAILED
> > ACCESSING_NON_EXISTE
ts, etc...
I'm using AWS's EMR, which claims to be hadoop 1.0.3 + pig 11.
will
-- Forwarded message ------
From: William Oberman
Date: Fri, Apr 4, 2014 at 12:24 PM
Subject: using hadoop + cassandra for CF mutations (delete)
To: "u...@cassandra.apache.org"
Hi,
ive at the top of you pig file:
> SET pig.maxCombinedSplitSize
>
> It works for my on CDH 4.4, although my data source is HDFS files and not
> Casandra
>
> Regards,
> Dotan
>
>
>
>
> On Fri, Apr 4, 2014 at 9:13 PM, William Oberman >wrote:
>
> > A
lmost same situation when I was trying to load or write
> very small data from/into Cassandra. It was launching 257 map tasks. When
> num_tokens value reduced to 1 it Pig launched only 2 job. Do restart
> Cassandra service after change.
>
> Hope it might help..
>
> --
> Suraj
41 matches
Mail list logo