Re: Pig not reading all cassandra data

Matt Kennedy Fri, 11 Feb 2011 12:37:35 -0800

Sorry it has taken me a while to get back to this.  I'm still trying to get
to the bottom of this to find where the disconnect is between the column
family input format code and the Pig optimizer.

I suspected that the problem was line 365 of:
http://svn.apache.org/viewvc/pig/tags/release-0.8.0/src/org/apache/pig/backend/hadoop/executionengine/util/MapRedUtil.java?view=markup

...but I changed the ColumnFamilySplit.java file so that it returns -1
instead of 0, the result of which is that the Pig job will iterate over the
entirety of the cassandra data that it is supposed to, but it does so with
only one mapper.  It looks like the Pig map combiner isn't using the
split.getLength call to determine how the maps get combined as I originally
suspected.  I'll update when I figure more out.

-Matt

On Sat, Feb 5, 2011 at 1:01 AM, Jonathan Ellis <jbel...@gmail.com> wrote:

> On Fri, Feb 4, 2011 at 9:47 PM, Matt Kennedy <stinkym...@gmail.com> wrote:
> > Found the culprit.  There is a new feature in Pig 0.8 that will try to
> > reduce the number of splits used to speed up the whole job.  Since the
> > ColumnFamilyInputFormat lists the input size as zero, this feature
> > eliminates all of the splits except for one.
> >
> > The workaround is to disable this feature for jobs that use
> CassandraStorage
> > by setting -Dpig.splitCombination=false in the pig_cassandra script.
> >
> > Hope somebody finds this useful, you wouldn't believe how many dead-ends
> I
> > ran down trying to figure this out.
>
> Ouch, thanks for tracking that down.
>
> What should CFIF be returning differently?  Do you mean the
> InputSplit.getLength?
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>

Re: Pig not reading all cassandra data

Reply via email to