Re: cassandra + pig

William Oberman Thu, 11 Oct 2012 10:10:57 -0700

Thanks Jeremy!  Maybe figuring out how to do paging in pig would have been
easier, but I found the widerow setting first which led me where I am
today.  I don't mind helping to blaze trails, or contribute back when doing
so, but I usually try to follow rather than lead when it comes to
tools/software I choose use.  I didn't realize how close to the edge I was
getting in this case :-)


On Thu, Oct 11, 2012 at 1:03 PM, Jeremy Hanna <jeremy.hanna1...@gmail.com>wrote:

> For our use case, we had a lot of narrow column families and the couple of
> column families that had wide rows, we did our own paging through them.  I
> don't recall if we did paging in pig or mapreduce but you should be able to
> do that in both since pig allows you to specify the slice start.
>
> On Oct 11, 2012, at 11:28 AM, William Oberman <ober...@civicscience.com>
> wrote:
>
> > If you don't mind me asking, how are you handling the fact that
> pre-widerow you are only getting a static number of columns per key
> (default 1024)?  Or am I not understanding the "limit" concept?
> >
> > On Thu, Oct 11, 2012 at 11:25 AM, Jeremy Hanna <
> jeremy.hanna1...@gmail.com> wrote:
> > The Dachis Group (where I just came from, now at DataStax) uses pig with
> cassandra for a lot of things.  However, we weren't using the widerow
> implementation yet since wide row support is new to 1.1.x and we were on
> 0.7, then 0.8, then 1.0.x.
> >
> > I think since it's new to 1.1's hadoop support, it sounds like there are
> some rough edges like you say.  But issues that are reproducible on tickets
> for any problems are much appreciated and they will get addressed.
> >
> > On Oct 11, 2012, at 10:43 AM, William Oberman <ober...@civicscience.com>
> wrote:
> >
> > > I'm wondering how many people are using cassandra + pig out there?  I
> recently went through the effort of validating things at a much higher
> level than I previously did(*), and found a few issues:
> > > https://issues.apache.org/jira/browse/CASSANDRA-4748
> > > https://issues.apache.org/jira/browse/CASSANDRA-4749
> > > https://issues.apache.org/jira/browse/CASSANDRA-4789
> > >
> > > In general, it seems like the widerow implementation still has rough
> edges.  I'm concerned I'm not understanding why other people aren't using
> the feature, and thus finding these problems.  Is everyone else just
> setting a high static limit?  E.g.  LOAD 'cassandra://KEYSPACE/CF?limit=X"
> where X >= the max size of any key?  Is everyone else using data models
> that result in keys with # columns always less than 1024?  Do newer version
> of hadoop consume the cassandra API in a way that work around these issues?
>  I'm using CDH3 == hadoop 0.20.2, pig 0.8.1.
> > >
> > > (*) I took a random subsample of 50,000 keys of my production data
> (approx 1M total key/value pairs, some keys having only a single value and
> some having 1000's).  I then wrote both a pig script and simple procedural
> version of the pig script.  Then I compared the results.  Obviously I
> started with differences, though after locally patching my code to fix the
> above 3 bugs (though, really only two issues), I now (finally) get the same
> results.
> >
> >
> >
>
>

Re: cassandra + pig

Reply via email to