Re: pig and widerows

aaron morton Mon, 01 Oct 2012 14:08:51 -0700

That looks like it may be a bug, can you create a ticket on 
https://issues.apache.org/jira/browse/CASSANDRA ?


Thanks

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 28/09/2012, at 7:50 AM, William Oberman <ober...@civicscience.com> wrote:

> I don't want to switch my cassandra to HEAD, but looking at the newest code 
> for CassandraStorage, I'm concerned the Uri parsing for widerows isn't going 
> to work.  setLocation first calls setLocationFromUri (which sets widerows to 
> the Uri value), but then sets widerows to a static value (which is defined as 
> false), and then it sets widerows to the system setting if it exists.  That 
> doesn't seem right...  ?
> 
> But setLocationFromUri also gets called from setStoreLocation, and I don't 
> really know the difference between setLocation and setStoreLocation in terms 
> of what is going on in terms of the integration between cassandra/pig/hadoop.
> 
> will
> 
> On Thu, Sep 27, 2012 at 3:26 PM, William Oberman <ober...@civicscience.com> 
> wrote:
> The next painful lesson for me was figuring out how to get logging working 
> for a distributed hadoop process.   In my test environment, I have a single 
> node that runs name/secondaryname/data/job trackers (call it "central"), and 
> I have two cassandra nodes running tasktrackers.  But, I also have cassandra 
> libraries on the central box, and invoke my pig script from there.   I had 
> been patching and recompiling cassandra (1.1.5 with my logging, and the 
> system env fix) on that central box, and SOME of the logging was appearing in 
> the pig output.  But, eventually I decided to move that recompiled code to 
> the tasktracker boxes, and then I found even more of the logging I had added 
> in:
> /var/log/hadoop/userlogs/JOB_ID
> on each of the tasktrackers.
> 
> Based on this new logging, I found out that the widerows setting wasn't 
> propagating from the central box to the tasktrackers.  I added:
> export PIG_WIDEROW_INPUT=true
> To hadoop-env.sh on each of the tasktrackers and it finally worked!  
> 
> So, long story short, to actually get all columns for a key I had to:
> 1.) patch 1.1.5 to honor the "PIG_WIDEROW_INPUT=true" system setting
> 2.) add the system setting to ALL nodes in the hadoop cluster
> 
> I'm going to try to undo all of my other hacks to get logging/printing 
> working to confirm if those were actually the only two changes I had to make.
> 
> will
> 
> 
> On Thu, Sep 27, 2012 at 1:43 PM, William Oberman <ober...@civicscience.com> 
> wrote:
> Ok, this is painful.  The first problem I found is in stock 1.1.5 there is no 
> way to set widerows to true!  The new widerows URI parsing is NOT in 1.1.5.  
> And for extra fun, getting the value from the system property is BROKEN (at 
> least in my centos linux environment).
> 
> Here are the key lines of code (in CassandraStorage), note the different ways 
> of getting the property!  getenv in the test, and getProperty in the set:
>         widerows = DEFAULT_WIDEROW_INPUT;
>         if (System.getenv(PIG_WIDEROW_INPUT) != null)
>             widerows = Boolean.valueOf(System.getProperty(PIG_WIDEROW_INPUT));
> 
> I added this logging:
>         logger.warn("widerows = " + widerows + " getenv=" + 
> System.getenv(PIG_WIDEROW_INPUT) + " 
> getProp="+System.getProperty(PIG_WIDEROW_INPUT));
> 
> And I saw:
> org.apache.cassandra.hadoop.pig.CassandraStorage - widerows = false 
> getenv=true getProp=null
> So for me getProperty != getenv :-(
> 
> For people trying to figure out how to debug cassandra + hadoop + pig, for me 
> the key to get debugging and logging working was to focus on /etc/hadoop/conf 
> (not /etc/pig/conf as I expected).  
> 
> Also, if you want to compile your own cassandra (to add logging messages), 
> make sure it's appears first on the pig classpath (use pig -secretDebugCmd to 
> see the fully qualified command line).
> 
> The next thing I'm trying to figure out is why when widerows == true I'm 
> STILL not seeing more than 1024 columns :-( 
> 
> will
>  
> On Wed, Sep 26, 2012 at 3:42 PM, William Oberman <ober...@civicscience.com> 
> wrote:
> Hi,
> 
> I'm trying to figure out what's going on with my cassandra/hadoop/pig system. 
>  I created a "mini" copy of my main cassandra data by randomly subsampling to 
> get ~50,000 keys.  I was then writing pig scripts but also the equivalent 
> operation using simple single threaded code to double check pig.
> 
> Of course my very first test failed.  After doing a pig DUMP on the raw data, 
> what appears to be happening is I'm only getting the first 1024 columns of a 
> key.  After some googling, this seems to be known behavior unless you add 
> "?widerows=true" to the pig load URI. I tried this, but it didn't seem to fix 
> anything :-(   Here's the the start of my pig script:
> foo = LOAD 'cassandra://KEYSPACE/COLUMN_FAMILY?widerows=true' USING 
> CassandraStorage() AS (key:chararray, columns:bag {column:tuple (name, 
> value)});
> 
> I'm using cassandra 1.1.5 from datastax rpms.  I'm using hadoop 
> (0.20.2+923.418-1) and pig (0.8.1+28.39-1) from cloudera rpms.
> 
> What am I doing wrong?  Or, how I can enable debugging/logging to next figure 
> out what is going on?  I haven't had to debug hadoop+pig+cassandra much, 
> other than doing DUMP/ILLUSTRATE from pig.
> 
> will
> 
> 
> 
> 
>

Re: pig and widerows

Reply via email to