Thanks Jeremy! Maybe figuring out how to do paging in pig would have been easier, but I found the widerow setting first which led me where I am today. I don't mind helping to blaze trails, or contribute back when doing so, but I usually try to follow rather than lead when it comes to tools/software I choose use. I didn't realize how close to the edge I was getting in this case :-)
On Thu, Oct 11, 2012 at 1:03 PM, Jeremy Hanna <jeremy.hanna1...@gmail.com>wrote: > For our use case, we had a lot of narrow column families and the couple of > column families that had wide rows, we did our own paging through them. I > don't recall if we did paging in pig or mapreduce but you should be able to > do that in both since pig allows you to specify the slice start. > > On Oct 11, 2012, at 11:28 AM, William Oberman <ober...@civicscience.com> > wrote: > > > If you don't mind me asking, how are you handling the fact that > pre-widerow you are only getting a static number of columns per key > (default 1024)? Or am I not understanding the "limit" concept? > > > > On Thu, Oct 11, 2012 at 11:25 AM, Jeremy Hanna < > jeremy.hanna1...@gmail.com> wrote: > > The Dachis Group (where I just came from, now at DataStax) uses pig with > cassandra for a lot of things. However, we weren't using the widerow > implementation yet since wide row support is new to 1.1.x and we were on > 0.7, then 0.8, then 1.0.x. > > > > I think since it's new to 1.1's hadoop support, it sounds like there are > some rough edges like you say. But issues that are reproducible on tickets > for any problems are much appreciated and they will get addressed. > > > > On Oct 11, 2012, at 10:43 AM, William Oberman <ober...@civicscience.com> > wrote: > > > > > I'm wondering how many people are using cassandra + pig out there? I > recently went through the effort of validating things at a much higher > level than I previously did(*), and found a few issues: > > > https://issues.apache.org/jira/browse/CASSANDRA-4748 > > > https://issues.apache.org/jira/browse/CASSANDRA-4749 > > > https://issues.apache.org/jira/browse/CASSANDRA-4789 > > > > > > In general, it seems like the widerow implementation still has rough > edges. I'm concerned I'm not understanding why other people aren't using > the feature, and thus finding these problems. Is everyone else just > setting a high static limit? E.g. LOAD 'cassandra://KEYSPACE/CF?limit=X" > where X >= the max size of any key? Is everyone else using data models > that result in keys with # columns always less than 1024? Do newer version > of hadoop consume the cassandra API in a way that work around these issues? > I'm using CDH3 == hadoop 0.20.2, pig 0.8.1. > > > > > > (*) I took a random subsample of 50,000 keys of my production data > (approx 1M total key/value pairs, some keys having only a single value and > some having 1000's). I then wrote both a pig script and simple procedural > version of the pig script. Then I compared the results. Obviously I > started with differences, though after locally patching my code to fix the > above 3 bugs (though, really only two issues), I now (finally) get the same > results. > > > > > > > >