Cool - thanks Dmitriy!

On Jun 15, 2011, at 12:54 PM, Dmitriy Ryaboy wrote:

> Another tip:
> If you parametrize your load statements, it becomes easy to switch
> between loading from something like Cassandra, and reading from HDFS
> or local fs directly.
> 
> Also:
> Try using Pig's "illustrate" command when working through your flows
> -- it does some clever things that go far beyond simple random
> sampling of source data, in order to ensure that you can see the
> effects of doing filters, that joins get (possibly artificial)
> matching keys even if you sampled in a way that didn't actually
> produce any, etc.
> 
> D
> 
> On Wed, Jun 15, 2011 at 10:35 AM, Jeremy Hanna
> <jeremy.hanna1...@gmail.com> wrote:
>> We started doing this recently and thought it might be useful to others.
>> 
>> Pig (and Hive) have a sample function that allows you to sample data from 
>> your data store.
>> 
>> In pig it looks something like this:
>> mysample = SAMPLE myrelation 0.01;
>> 
>> One possible use for this, with pig and cassandra is to solve a conundrum of 
>> testing locally.  We've wondered how to do this so we decided to do sampling 
>> of a column family (or set of CFs), store into HDFS (or CFS), download 
>> locally, then import into your local Cassandra node.  That gives you real 
>> data to test against with pig/hive or for other purposes.
>> 
>> That way, when you're flying out to the Hadoop Summit or the Cassandra SF 
>> event, you can play with real data :).
>> 
>> Maybe others have been doing this for years, but if not, we're finding it 
>> handy.
>> 
>> Jeremy

Reply via email to