That looks like it may be a bug, can you create a ticket on https://issues.apache.org/jira/browse/CASSANDRA ?
Thanks ----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 28/09/2012, at 7:50 AM, William Oberman <ober...@civicscience.com> wrote: > I don't want to switch my cassandra to HEAD, but looking at the newest code > for CassandraStorage, I'm concerned the Uri parsing for widerows isn't going > to work. setLocation first calls setLocationFromUri (which sets widerows to > the Uri value), but then sets widerows to a static value (which is defined as > false), and then it sets widerows to the system setting if it exists. That > doesn't seem right... ? > > But setLocationFromUri also gets called from setStoreLocation, and I don't > really know the difference between setLocation and setStoreLocation in terms > of what is going on in terms of the integration between cassandra/pig/hadoop. > > will > > On Thu, Sep 27, 2012 at 3:26 PM, William Oberman <ober...@civicscience.com> > wrote: > The next painful lesson for me was figuring out how to get logging working > for a distributed hadoop process. In my test environment, I have a single > node that runs name/secondaryname/data/job trackers (call it "central"), and > I have two cassandra nodes running tasktrackers. But, I also have cassandra > libraries on the central box, and invoke my pig script from there. I had > been patching and recompiling cassandra (1.1.5 with my logging, and the > system env fix) on that central box, and SOME of the logging was appearing in > the pig output. But, eventually I decided to move that recompiled code to > the tasktracker boxes, and then I found even more of the logging I had added > in: > /var/log/hadoop/userlogs/JOB_ID > on each of the tasktrackers. > > Based on this new logging, I found out that the widerows setting wasn't > propagating from the central box to the tasktrackers. I added: > export PIG_WIDEROW_INPUT=true > To hadoop-env.sh on each of the tasktrackers and it finally worked! > > So, long story short, to actually get all columns for a key I had to: > 1.) patch 1.1.5 to honor the "PIG_WIDEROW_INPUT=true" system setting > 2.) add the system setting to ALL nodes in the hadoop cluster > > I'm going to try to undo all of my other hacks to get logging/printing > working to confirm if those were actually the only two changes I had to make. > > will > > > On Thu, Sep 27, 2012 at 1:43 PM, William Oberman <ober...@civicscience.com> > wrote: > Ok, this is painful. The first problem I found is in stock 1.1.5 there is no > way to set widerows to true! The new widerows URI parsing is NOT in 1.1.5. > And for extra fun, getting the value from the system property is BROKEN (at > least in my centos linux environment). > > Here are the key lines of code (in CassandraStorage), note the different ways > of getting the property! getenv in the test, and getProperty in the set: > widerows = DEFAULT_WIDEROW_INPUT; > if (System.getenv(PIG_WIDEROW_INPUT) != null) > widerows = Boolean.valueOf(System.getProperty(PIG_WIDEROW_INPUT)); > > I added this logging: > logger.warn("widerows = " + widerows + " getenv=" + > System.getenv(PIG_WIDEROW_INPUT) + " > getProp="+System.getProperty(PIG_WIDEROW_INPUT)); > > And I saw: > org.apache.cassandra.hadoop.pig.CassandraStorage - widerows = false > getenv=true getProp=null > So for me getProperty != getenv :-( > > For people trying to figure out how to debug cassandra + hadoop + pig, for me > the key to get debugging and logging working was to focus on /etc/hadoop/conf > (not /etc/pig/conf as I expected). > > Also, if you want to compile your own cassandra (to add logging messages), > make sure it's appears first on the pig classpath (use pig -secretDebugCmd to > see the fully qualified command line). > > The next thing I'm trying to figure out is why when widerows == true I'm > STILL not seeing more than 1024 columns :-( > > will > > On Wed, Sep 26, 2012 at 3:42 PM, William Oberman <ober...@civicscience.com> > wrote: > Hi, > > I'm trying to figure out what's going on with my cassandra/hadoop/pig system. > I created a "mini" copy of my main cassandra data by randomly subsampling to > get ~50,000 keys. I was then writing pig scripts but also the equivalent > operation using simple single threaded code to double check pig. > > Of course my very first test failed. After doing a pig DUMP on the raw data, > what appears to be happening is I'm only getting the first 1024 columns of a > key. After some googling, this seems to be known behavior unless you add > "?widerows=true" to the pig load URI. I tried this, but it didn't seem to fix > anything :-( Here's the the start of my pig script: > foo = LOAD 'cassandra://KEYSPACE/COLUMN_FAMILY?widerows=true' USING > CassandraStorage() AS (key:chararray, columns:bag {column:tuple (name, > value)}); > > I'm using cassandra 1.1.5 from datastax rpms. I'm using hadoop > (0.20.2+923.418-1) and pig (0.8.1+28.39-1) from cloudera rpms. > > What am I doing wrong? Or, how I can enable debugging/logging to next figure > out what is going on? I haven't had to debug hadoop+pig+cassandra much, > other than doing DUMP/ILLUSTRATE from pig. > > will > > > > >