Re: Recommendations on moving to Hadoop/Hive with Cassandra + RDBMS

Jeremy Hanna Tue, 30 Aug 2011 10:27:26 -0700

FWIW, we are using Pig (and Hadoop) with Cassandra and are looking to 
potentially move to Brisk because of the simplicity of operations there.


Not sure what you mean about the true power of Hadoop.  In my mind the true 
power of Hadoop is the ability to parallelize jobs and send each task to where 
the data resides.  HDFS exists to enable that.  Brisk is just another HDFS 
compatible implementation.  If you're already storing your data in Cassandra 
and are looking to use Hadoop with it, then I would seriously consider using 
Brisk.

That said, Cassandra with Hadoop works fine.

On Aug 30, 2011, at 11:58 AM, Tharindu Mathew wrote:

> Hi Eric,
> 
> Thanks for your response.
> 
> On Tue, Aug 30, 2011 at 5:35 PM, Eric Djatsa <djatsa...@gmail.com> wrote:
> 
>> Hi Tharindu, try having a look at Brisk(
>> http://www.datastax.com/products/brisk) it integrates Hadoop with
>> Cassandra and is shipped with Hive for SQL analysis. You can then install
>> Sqoop(http://www.cloudera.com/downloads/sqoop/) on top of Hadoop in order
>> to enable data import/export between Hadoop and MySQL.
>> Does this sound ok to you ?
>> 
>> These do sound ok. But I was looking at using something from Apache itself.
> 
> Brisk sounds nice, but I feel that disregarding HDFS and totally switching
> to Cassandra is not the right thing to do. Just my opinion there. I feel we
> are not using the true power of Hadoop then.
> 
> I feel Pig has more integration with Cassandra, so I might take a look
> there.
> 
> Whichever I choose, I will contribute the code back to the Apache projects I
> use. Here's a sample data analysis I do with my language. Maybe, there is no
> generic way to do what I want to do.
> 
> 
> 
> <get name="NodeId">
> <index name="ServerName" start="" end=""/>
> <!--<index name="nodeId" start="AS" end="FB"/>-->
> <!--<groupBy index="nodeId"/>-->
> <granularity index="timeStamp" type="hour"/>
> </get>
> 
> <lookup name="Event"/>
> 
> <aggregate>
> <measure name="RequestCount" aggregationType="CUMULATIVE"/>
> <measure name="ResponseCount" aggregationType="CUMULATIVE"/>
> <measure name="MaximumResponseTime" aggregationType="AVG"/>
> </aggregate>
> 
> <put name="NodeResult" indexRow="allKeys"/>
> 
> <log/>
> 
> <get name="NodeResult">
> <index name="ServerName" start="" end=""/>
> <groupBy index="ServerName"/>
> </get>
> 
> <aggregate>
> <measure name="RequestCount" aggregationType="CUMULATIVE"/>
> <measure name="ResponseCount" aggregationType="CUMULATIVE"/>
> <measure name="MaximumResponseTime" aggregationType="AVG"/>
> </aggregate>
> 
> <put name="NodeAccumilator" indexRow="allKeys"/>
> 
> <log/>
> 
> 
>> 2011/8/29 Tharindu Mathew <mcclou...@gmail.com>
>> 
>>> Hi,
>>> 
>>> I have an already running system where I define a simple data flow (using
>>> a simple custom data flow language) and configure jobs to run against stored
>>> data. I use quartz to schedule and run these jobs and the data exists on
>>> various data stores (mainly Cassandra but some data exists in RDBMS like
>>> mysql as well).
>>> 
>>> Thinking about scalability and already existing support for standard data
>>> flow languages in the form of Pig and HiveQL, I plan to move my system to
>>> Hadoop.
>>> 
>>> I've seen some efforts on the integration of Cassandra and Hadoop. I've
>>> been reading up and still am contemplating on how to make this change.
>>> 
>>> It would be great to hear the recommended approach of doing this on Hadoop
>>> with the integration of Cassandra and other RDBMS. For example, a sample
>>> task that already runs on the system is "once in every hour, get rows from
>>> column family X, aggregate data in columns A, B and C and write back to
>>> column family Y, and enter details of last aggregated row into a table in
>>> mysql"
>>> 
>>> Thanks in advance.
>>> 
>>> --
>>> Regards,
>>> 
>>> Tharindu
>>> 
>> 
>> 
>> 
>> --
>> *Eric Djatsa Yota*
>> *Double degree MsC Student in Computer Science Engineering and
>> Communication Networks
>> Télécom ParisTech (FRANCE) - Politecnico di Torino (ITALY)*
>> *Intern at AMADEUS S.A.S Sophia Antipolis*
>> djatsa...@gmail.com
>> *Tel : 0601791859*
>> 
>> 
> 
> 
> -- 
> Regards,
> 
> Tharindu

Re: Recommendations on moving to Hadoop/Hive with Cassandra + RDBMS

Reply via email to