We started doing this recently and thought it might be useful to others.

Pig (and Hive) have a sample function that allows you to sample data from your 
data store.

In pig it looks something like this:
mysample = SAMPLE myrelation 0.01;

One possible use for this, with pig and cassandra is to solve a conundrum of 
testing locally.  We've wondered how to do this so we decided to do sampling of 
a column family (or set of CFs), store into HDFS (or CFS), download locally, 
then import into your local Cassandra node.  That gives you real data to test 
against with pig/hive or for other purposes.

That way, when you're flying out to the Hadoop Summit or the Cassandra SF 
event, you can play with real data :).

Maybe others have been doing this for years, but if not, we're finding it handy.

Jeremy

Reply via email to