Re: hadoop map join with ColumnFamilyInputFormat

Jeremy Hanna Thu, 01 Mar 2012 07:09:15 -0800

I haven't used that in particular, but it's pretty trivial to do that with Pig 
and I would imagine it would just do the right thing under the covers.  It's a 
simple join with Pig.  We use pygmalion to get data from the Cassandra bag.  A 
simple example would be:
DEFINE FromCassandraBag org.pygmalion.udf.FromCassandraBag();


raw_billing_acount =  LOAD 'cassandra://voltron/billing_account' USING 
org.apache.cassandra.hadoop.pig.CassandraStorage() AS (id:chararray, 
columns:bag {column:tuple (name, value)});
billing_account = FOREACH raw_billing_account GENERATE
        id,
        FLATTEN(FromCassandraBag('name, age, address, city, state, 
zip',columns)) AS (
                name:           chararray,
                age:            chararray,
                address:        chararray,
                city:           chararray,
                state:          chararray,
                zip:                    chararay
        );

raw_game_account =  LOAD 'cassandra://voltron/game_account' USING 
org.apache.cassandra.hadoop.pig.CassandraStorage() AS (id:chararray, 
columns:bag {column:tuple (name, value)});
game_account = FOREACH raw_game_account GENERATE
        id,
        FLATTEN(FromCassandraBag('username, level, experience_points, 
super_powers, vehicles',columns)) AS (
                username:                       chararray,
                level:                          chararray,
                experience_points:      chararray,
                super_powers:           chararray,
                vehicles:                       chararray
        );

composite_relation = FOREACH
        (join billing_account by id, game_account by id)
                GENERATE
                billing_account::id as id,
                name,
                username,
                level,
                super_powers;

Anyway - not sure if that's what you're looking for but that's what we do a lot 
of with Pig - joins on any attribute or group bys or things like that.


On Mar 1, 2012, at 4:45 AM, Benoit Mathieu wrote:

> Hi all,
> 
> I want to write a MapReduce job with a Map task taking its data from 2
> CFs. Those 2 CFs have the same row keys and are in same keyspace, so
> they are partionned the same way across my cluster and it would be
> nice that the Map task reads the both column families locally.
> 
> In hadoop package org.apache.hadoop.mapred.join, there is a
> CompositeInputFormat class, which seems to do what I want, but it
> seems related to HDFS files as the "compose" method takes "Path" args.
> 
> Does anyone have ever wrote a CompositeColumnFamilyInputFormat ? or
> have any insight about it ?
> 
> Cheers,
> 
> Benoit

Re: hadoop map join with ColumnFamilyInputFormat

Reply via email to