trying to count all tuples

2011-06-03 Thread William Oberman
Howdy, I'm coming from cassandra, and I'm actually trying to count all columns in a column family. I believe that is similar to counting the number tuples in a bag in the lingo in the pig manual. It was harder than I expected, but I think this works: rows = LOAD 'cassandra://MyKeyspace/MyColumnF

Re: trying to count all tuples

2011-06-03 Thread William Oberman
columns in a conventional sense, but your code will return 5. Is > that what you want? If so, your code seems correct. > > D > > On Fri, Jun 3, 2011 at 12:53 PM, William Oberman > wrote: > > Howdy, > > > > I'm coming from cassandra, and I'm actually trying

Re: trying to count all tuples

2011-06-07 Thread William Oberman
hat's my theory). As a workaround, can I have COUNT ignore/skip rows with null columns? I'll start digging through the docs as well. will On Fri, Jun 3, 2011 at 4:09 PM, William Oberman wrote: > That is exactly what I wanted, thanks for the confirm! > > > On Fri, Jun 3, 2

Re: trying to count all tuples

2011-06-07 Thread William Oberman
OUP counts ALL; sum_of_bag = FOREACH counts_in_bag GENERATE SUM($1); dump sum_of_bag; On Tue, Jun 7, 2011 at 4:33 PM, William Oberman wrote: > I tried this same script on closer to production data, and I'm getting > errors. I'm 50% sure it's this: > https://issues.apach

Re: trying to count all tuples

2011-06-08 Thread William Oberman
_in_bag = GROUP counts ALL; sum_of_bag = FOREACH counts_in_bag GENERATE SUM($1); dump sum_of_bag; For some reason typing the bag was causing me problems. On Tue, Jun 7, 2011 at 4:58 PM, William Oberman wrote: > I think FILTER will do the trick? E.g. > > > rows = LOAD 'cassandra://

prep for cassandra storage from pig

2011-06-15 Thread William Oberman
I think I'm stuck on typing issues trying to store data in cassandra. To verify, cassandra wants (key, {tuples}) My pig script is fairly brief: raw = LOAD 'cassandra://test_in/test_cf' USING CassandraStorage() AS (key:chararray, columns:bag {column:tuple (name, value)}); --colums == timeUUID -> J

Re: prep for cassandra storage from pig

2011-06-15 Thread William Oberman
Rather than staying stuck, I wrote a custom function: TupleToBagOfTuple. I'm curios if I could have avoided this though. On Wed, Jun 15, 2011 at 2:17 PM, William Oberman wrote: > I think I'm stuck on typing issues trying to store data in cassandra. To > verify, cassandra wan

Re: prep for cassandra storage from pig

2011-06-15 Thread William Oberman
ndraBag from > pygmalion - it does the work for you to get it back into a form that > cassandra understands. > > Others may know better how to massage the data into that form using just > pig, but if all else fails, you could write a udf to do that. > > Jeremy > > On Jun 1

Re: prep for cassandra storage from pig

2011-06-15 Thread William Oberman
I'll do a reply all, to keep this more consistent (sorry!). Rather than staying stuck, I wrote a custom function: TupleToBagOfTuple. I'm curious if I could have avoided it with proper pig scripting though. On Wed, Jun 15, 2011 at 3:08 PM, William Oberman wrote: > My problem is the

tar.gz to cdh3 package

2011-07-08 Thread William Oberman
I tried out hadoop/pig in my test environment using tar.gz's. Before I roll out to production, I thought I'd try the cdh3 pacakges, as that might be easier to maintain (since I'm not a sysadmin). Following cloudera's install guide worked like a charm, but I couldn't get pig to run until I did thi

Re: tar.gz to cdh3 package

2011-07-08 Thread William Oberman
I thought pig is the one trying to write to /tmp inside of hadoop? will On Fri, Jul 8, 2011 at 3:00 PM, Dmitriy Ryaboy wrote: > Seems like a question you should ask Cloudera? > > On Fri, Jul 8, 2011 at 11:57 AM, William Oberman > wrote: > > I tried out hadoop/pig in my test

Re: tar.gz to cdh3 package

2011-07-08 Thread William Oberman
g as the same user so I didn't matter. So, that makes getting the permissions right for /tmp more important, but I didn't think the hadoop crowd would care since it's pig that causes the write to that location. But a newbie pig user might need the FYI.... On Fri, Jul 8, 2011 at 3:0

Re: tar.gz to cdh3 package

2011-07-12 Thread William Oberman
Dai wrote: > Check > > http://pig.apache.org/docs/r0.8.1/piglatin_ref1.html#Storing+Intermediate+Data > > Daniel > > On Fri, Jul 8, 2011 at 12:04 PM, William Oberman > wrote: > > > Sorry, to be more verbose, CDH3 actually respects permissions inside of > > HD

best practice for Pig + MySql for meta data lookups

2012-09-11 Thread William Oberman
Hello, My setup is Pig + Hadoop + Cassandra for my "big data" and MySql for my "relational/meta data". Up until now that has been fine, but now I need to start creating metrics that "cross the lines". In particular, I need to create aggregations of Cassandra data based on lookups from MySql. Af

Re: best practice for Pig + MySql for meta data lookups

2012-09-11 Thread William Oberman
gt; sizes are within reason. > > > On Tue, Sep 11, 2012 at 8:17 AM, William Oberman > wrote: > > > Hello, > > > > My setup is Pig + Hadoop + Cassandra for my "big data" and MySql for my > > "relational/meta data". Up until now that has been

Re: best practice for Pig + MySql for meta data lookups

2012-09-11 Thread William Oberman
he DB. > > 1 - I've never used this: > > http://svn.apache.org/repos/asf/cassandra/trunk/contrib/pig/src/java/org/apache/cassandra/hadoop/pig/CassandraStorage.java > > On Tue, Sep 11, 2012 at 8:54 AM, William Oberman > wrote: > > > Great news (for me)! :-) My r

Re: best practice for Pig + MySql for meta data lookups

2012-09-12 Thread William Oberman
ing I planned to move on, rather than figure that out :-) will On Tue, Sep 11, 2012 at 2:09 PM, William Oberman wrote: > Thanks (again)! > > I'm already using CassandraStorage to load the JSON strings. I used Maps > because I liked being able to name the fields, but I could e

Having troubles with PigStorage

2012-11-06 Thread William Oberman
I'm trying to play around with Amazon EMR, and I currently have self hosted Cassandra as the source of data. I was going to try to do: Cassandra -> S3 -> EMR. I've traced my problems to PigStorage. At this point I can recreate my problem "locally" without involving S3 or Amazon. In my local tes

Re: Having troubles with PigStorage

2012-11-06 Thread William Oberman
o reproduce your problem? 1 ~ 2 > rows would be sufficient. > > Thanks, > Cheolsoo > > On Tue, Nov 6, 2012 at 12:20 PM, William Oberman > wrote: > > > I'm trying to play around with Amazon EMR, and I currently have self > hosted > > Cassandra as the source of

Re: Having troubles with PigStorage

2012-11-06 Thread William Oberman
correctly. On Tue, Nov 6, 2012 at 4:35 PM, Cheolsoo Park wrote: > >> This is a dumb question, but PigStorage escapes the delimiter, right? > > No it doesn't. > > On Tue, Nov 6, 2012 at 1:29 PM, William Oberman >wrote: > > > This is a dumb question, but P

Re: Having troubles with PigStorage

2012-11-06 Thread William Oberman
Just in case someone hits this thread by having the same issue, please vote for this bug: https://issues.apache.org/jira/browse/PIG-1271 On Tue, Nov 6, 2012 at 4:50 PM, William Oberman wrote: > Wow, ok. That is completely unexpected. Thanks for the heads up! > > In my case, because p

Re: Pig 0.9.2 and avro on S3

2012-11-30 Thread William Oberman
A couple of weeks ago I spent a bunch of time trying to get EMR + S3 + Avro working: https://forums.aws.amazon.com/thread.jspa?messageID=398194񡍲 Short story, yes I think PIG-2540 is the issue. I'm currently trying to get pig 0.10 running in EMR with help from AWS support. You have to do: --boot

Re: Pig 0.9.2 and avro on S3

2012-11-30 Thread William Oberman
I should have read more closely, you're not using EMR. I'm guessing if you upgrade to pig 0.10 the issue will go away... On Fri, Nov 30, 2012 at 4:09 PM, William Oberman wrote: > A couple of weeks ago I spent a bunch of time trying to get EMR + S3 + > Avro

Re: ERROR 2999: Unexpected internal error. null

2012-12-11 Thread William Oberman
Your line numbers aren't matching up to the 1.1.7 release, which is weird. Based on the "stock" 1.1.7 source, there was a null check on str before predicateFromString(str), making your code path impossible... will On Tue, Dec 11, 2012 at 1:00 PM, Jonathan Coveney wrote: > If I were debugging t

Re: ERROR 2999: Unexpected internal error. null

2012-12-11 Thread William Oberman
g.file=pig.log > -Dpig.home.dir=/Library/pig-0.10.0/bin/.. > /Library/hadoop-1.0.2/bin/hadoop jar > /Library/pig-0.10.0/bin/../pig-0.10.0-withouthadoop.jar > -Dudf.import.list=org.apache.cassandra.hadoop.pig -x local rowcount.pig > > > > > > > > On 12/11/12 1:10

Re: String Representation of DataBag and its Schema

2013-03-21 Thread William Oberman
We managed to piece this together. It's not fully generic (we assume a single field). But, it gets the job done for unit testing. -- package com.civicscience.util; import org.apache.pig.ResourceSchema; import org.apache.pig.builtin.Utf8StorageConverter; import org.apache.pig.impl.uti

udf + boolean constructor

2013-05-08 Thread William Oberman
I'm trying to set useMatches=false in REGEX_EXTRACT_ALL as per the javadoc: http://pig.apache.org/docs/r0.11.0/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html (and yes, I'm using pig 0.11). But it doesn't work. I'm concerned about this post: http://grokbase.com/t/pig/user/12b891a55k/boolean-pig

Re: udf + boolean constructor

2013-05-08 Thread William Oberman
be a while... I'm using: '([^?=&]+)(?:[&#]|=([^&#]*))' will On Wed, May 8, 2013 at 1:20 PM, William Oberman wrote: > I'm trying to set useMatches=false in REGEX_EXTRACT_ALL as per the javadoc: > > http://pig.apache.org/docs/r0.11.0/api/org/apache/pig/builtin/REGEX_EX

problems with .gz

2013-06-07 Thread William Oberman
I'm using pig 0.11.2. I had been processing ASCII files of json with schema: (key:chararray, columns:bag {column:tuple (timeUUID:chararray, value:chararray, timestamp:long)}) For what it's worth, this is cassandra data, at a fairly low level. But, this was getting big, so I compressed it all with

Re: problems with .gz

2013-06-08 Thread William Oberman
They are all *.gz, I confirmed that first :-) On Saturday, June 8, 2013, Niels Basjes wrote: > What are the exact filenames you used? > The decompression of input files is based on the filename extention. > > Niels > On Jun 7, 2013 11:11 PM, "William Oberman" > &g

Re: problems with .gz

2013-06-10 Thread William Oberman
compressed" count. I don't know how to debug hadoop/pig quite as well, though I'm trying now. But, my working theory is that some combination of pig/hadoop aborts processing the gz stream on a null character, but keeps chugging on a non-gz stream. Does that sound familiar? will

Re: problems with .gz

2013-06-10 Thread William Oberman
t be split that way. > > > On Mon, Jun 10, 2013 at 12:06 PM, William Oberman > wrote: > > > I still don't fully understand (and am still debugging), but I have a > > "problem file" and a theory. > > > > The file has a "corrupt line" tha

Re: problems with .gz

2013-06-12 Thread William Oberman
1 PM, Alan Crosswell > > > wrote: > > > > > > > Suggest that if you have a choice, you use bzip2 compression instead > of > > > > gzip as bzip2 is block-based and Pig can split reading a large > bzipped > > > file > > > > across multiple

DISTINCT and paritioner

2013-07-17 Thread William Oberman
The docs say DISTINCT can take a custom partitioner. How does that work? What is "K" and "V"? I'm having some doubts the docs are correct. I wrote a test partitioner that does a System.out of K and V. I then wrote simple scripts to do JOIN, GROUP and DISTINCT. For JOIN and GROUP I see my syste

Re: DISTINCT and paritioner

2013-07-17 Thread William Oberman
) On Wed, Jul 17, 2013 at 2:27 PM, William Oberman wrote: > The docs say DISTINCT can take a custom partitioner. How does that work? > What is "K" and "V"? > I'm having some doubts the docs are correct. I wrote a test partitioner > that does a System.out of

Re: DISTINCT and paritioner

2013-07-19 Thread William Oberman
t; partitioner. Could you file a JIRA against the docs so we can get that > fixed? > > Alan. > > On Jul 17, 2013, at 11:27 AM, William Oberman wrote: > > > The docs say DISTINCT can take a custom partitioner. How does that work? > > What is "K" and "V"?

line feeds

2014-03-26 Thread William Oberman
I was debugging some warnings in a script I had: FIELD_DISCARDED_TYPE_CONVERSION_FAILED ACCESSING_NON_EXISTENT_FIELD I got it down to basically these two lines: --foo was stored using PigStorage foo = LOAD '' AS (key:chararray, value:map[chararray]); STORE foo INTO '...'; The problem is some

Re: line feeds

2014-03-26 Thread William Oberman
rage.html > >work > for you? > > Thanks, > Cheolsoo > > > On Wed, Mar 26, 2014 at 10:51 AM, William Oberman > wrote: > > > I was debugging some warnings in a script I had: > > FIELD_DISCARDED_TYPE_CONVERSION_FAILED > > ACCESSING_NON_EXISTE

Fwd: using hadoop + cassandra for CF mutations (delete)

2014-04-04 Thread William Oberman
ts, etc... I'm using AWS's EMR, which claims to be hadoop 1.0.3 + pig 11. will -- Forwarded message ------ From: William Oberman Date: Fri, Apr 4, 2014 at 12:24 PM Subject: using hadoop + cassandra for CF mutations (delete) To: "u...@cassandra.apache.org" Hi,

Re: using hadoop + cassandra for CF mutations (delete)

2014-04-07 Thread William Oberman
ive at the top of you pig file: > SET pig.maxCombinedSplitSize > > It works for my on CDH 4.4, although my data source is HDFS files and not > Casandra > > Regards, > Dotan > > > > > On Fri, Apr 4, 2014 at 9:13 PM, William Oberman >wrote: > > > A

Re: Fwd: using hadoop + cassandra for CF mutations (delete)

2014-04-07 Thread William Oberman
lmost same situation when I was trying to load or write > very small data from/into Cassandra. It was launching 257 map tasks. When > num_tokens value reduced to 1 it Pig launched only 2 job. Do restart > Cassandra service after change. > > Hope it might help.. > > -- > Suraj