Cool - thanks Dmitriy! On Jun 15, 2011, at 12:54 PM, Dmitriy Ryaboy wrote:
> Another tip: > If you parametrize your load statements, it becomes easy to switch > between loading from something like Cassandra, and reading from HDFS > or local fs directly. > > Also: > Try using Pig's "illustrate" command when working through your flows > -- it does some clever things that go far beyond simple random > sampling of source data, in order to ensure that you can see the > effects of doing filters, that joins get (possibly artificial) > matching keys even if you sampled in a way that didn't actually > produce any, etc. > > D > > On Wed, Jun 15, 2011 at 10:35 AM, Jeremy Hanna > <jeremy.hanna1...@gmail.com> wrote: >> We started doing this recently and thought it might be useful to others. >> >> Pig (and Hive) have a sample function that allows you to sample data from >> your data store. >> >> In pig it looks something like this: >> mysample = SAMPLE myrelation 0.01; >> >> One possible use for this, with pig and cassandra is to solve a conundrum of >> testing locally. We've wondered how to do this so we decided to do sampling >> of a column family (or set of CFs), store into HDFS (or CFS), download >> locally, then import into your local Cassandra node. That gives you real >> data to test against with pig/hive or for other purposes. >> >> That way, when you're flying out to the Hadoop Summit or the Cassandra SF >> event, you can play with real data :). >> >> Maybe others have been doing this for years, but if not, we're finding it >> handy. >> >> Jeremy