We started doing this recently and thought it might be useful to others. Pig (and Hive) have a sample function that allows you to sample data from your data store.
In pig it looks something like this: mysample = SAMPLE myrelation 0.01; One possible use for this, with pig and cassandra is to solve a conundrum of testing locally. We've wondered how to do this so we decided to do sampling of a column family (or set of CFs), store into HDFS (or CFS), download locally, then import into your local Cassandra node. That gives you real data to test against with pig/hive or for other purposes. That way, when you're flying out to the Hadoop Summit or the Cassandra SF event, you can play with real data :). Maybe others have been doing this for years, but if not, we're finding it handy. Jeremy