Ok, this is painful.  The first problem I found is in stock 1.1.5 there is
no way to set widerows to true!  The new widerows URI parsing is NOT in
1.1.5.  And for extra fun, getting the value from the system property is
BROKEN (at least in my centos linux environment).

Here are the key lines of code (in CassandraStorage), note the different
ways of getting the property!  getenv in the test, and getProperty in the
set:
        widerows = DEFAULT_WIDEROW_INPUT;
        if (System.getenv(PIG_WIDEROW_INPUT) != null)
            widerows =
Boolean.valueOf(System.getProperty(PIG_WIDEROW_INPUT));

I added this logging:
        logger.warn("widerows = " + widerows + " getenv=" +
System.getenv(PIG_WIDEROW_INPUT) + "
getProp="+System.getProperty(PIG_WIDEROW_INPUT));

And I saw:
org.apache.cassandra.hadoop.pig.CassandraStorage - widerows = false
getenv=true getProp=null
So for me getProperty != getenv :-(

For people trying to figure out how to debug cassandra + hadoop + pig, for
me the key to get debugging and logging working was to focus on
/etc/hadoop/conf (not /etc/pig/conf as I expected).

Also, if you want to compile your own cassandra (to add logging messages),
make sure it's appears first on the pig classpath (use pig -secretDebugCmd
to see the fully qualified command line).

The next thing I'm trying to figure out is why when widerows == true I'm
STILL not seeing more than 1024 columns :-(

will

On Wed, Sep 26, 2012 at 3:42 PM, William Oberman
<ober...@civicscience.com>wrote:

> Hi,
>
> I'm trying to figure out what's going on with my cassandra/hadoop/pig
> system.  I created a "mini" copy of my main cassandra data by randomly
> subsampling to get ~50,000 keys.  I was then writing pig scripts but also
> the equivalent operation using simple single threaded code to double check
> pig.
>
> Of course my very first test failed.  After doing a pig DUMP on the raw
> data, what appears to be happening is I'm only getting the first 1024
> columns of a key.  After some googling, this seems to be known behavior
> unless you add "?widerows=true" to the pig load URI. I tried this, but
> it didn't seem to fix anything :-(   Here's the the start of my pig script:
> foo = LOAD 'cassandra://KEYSPACE/COLUMN_FAMILY?widerows=true' USING
> CassandraStorage() AS (key:chararray, columns:bag {column:tuple (name,
> value)});
>
> I'm using cassandra 1.1.5 from datastax rpms.  I'm using hadoop
> (0.20.2+923.418-1) and pig (0.8.1+28.39-1) from cloudera rpms.
>
> What am I doing wrong?  Or, how I can enable debugging/logging to next
> figure out what is going on?  I haven't had to debug hadoop+pig+cassandra
> much, other than doing DUMP/ILLUSTRATE from pig.
>
> will
>
>

Reply via email to