Thanks Jeremy for your response. That gives me some encouragement, that I might be on that right track.
I think I need to try out more stuff before coming to a conclusion on Brisk. For Pig operations over Cassandra, I only could find http://svn.apache.org/repos/asf/cassandra/trunk/contrib/pig. Are there any other resource that you can point me to? There seems to be a lack of samples on this subject. On Tue, Aug 30, 2011 at 10:56 PM, Jeremy Hanna <jeremy.hanna1...@gmail.com>wrote: > FWIW, we are using Pig (and Hadoop) with Cassandra and are looking to > potentially move to Brisk because of the simplicity of operations there. > > Not sure what you mean about the true power of Hadoop. In my mind the true > power of Hadoop is the ability to parallelize jobs and send each task to > where the data resides. HDFS exists to enable that. Brisk is just another > HDFS compatible implementation. If you're already storing your data in > Cassandra and are looking to use Hadoop with it, then I would seriously > consider using Brisk. > > That said, Cassandra with Hadoop works fine. > > On Aug 30, 2011, at 11:58 AM, Tharindu Mathew wrote: > > > Hi Eric, > > > > Thanks for your response. > > > > On Tue, Aug 30, 2011 at 5:35 PM, Eric Djatsa <djatsa...@gmail.com> > wrote: > > > >> Hi Tharindu, try having a look at Brisk( > >> http://www.datastax.com/products/brisk) it integrates Hadoop with > >> Cassandra and is shipped with Hive for SQL analysis. You can then > install > >> Sqoop(http://www.cloudera.com/downloads/sqoop/) on top of Hadoop in > order > >> to enable data import/export between Hadoop and MySQL. > >> Does this sound ok to you ? > >> > >> These do sound ok. But I was looking at using something from Apache > itself. > > > > Brisk sounds nice, but I feel that disregarding HDFS and totally > switching > > to Cassandra is not the right thing to do. Just my opinion there. I feel > we > > are not using the true power of Hadoop then. > > > > I feel Pig has more integration with Cassandra, so I might take a look > > there. > > > > Whichever I choose, I will contribute the code back to the Apache > projects I > > use. Here's a sample data analysis I do with my language. Maybe, there is > no > > generic way to do what I want to do. > > > > > > > > <get name="NodeId"> > > <index name="ServerName" start="" end=""/> > > <!--<index name="nodeId" start="AS" end="FB"/>--> > > <!--<groupBy index="nodeId"/>--> > > <granularity index="timeStamp" type="hour"/> > > </get> > > > > <lookup name="Event"/> > > > > <aggregate> > > <measure name="RequestCount" aggregationType="CUMULATIVE"/> > > <measure name="ResponseCount" aggregationType="CUMULATIVE"/> > > <measure name="MaximumResponseTime" aggregationType="AVG"/> > > </aggregate> > > > > <put name="NodeResult" indexRow="allKeys"/> > > > > <log/> > > > > <get name="NodeResult"> > > <index name="ServerName" start="" end=""/> > > <groupBy index="ServerName"/> > > </get> > > > > <aggregate> > > <measure name="RequestCount" aggregationType="CUMULATIVE"/> > > <measure name="ResponseCount" aggregationType="CUMULATIVE"/> > > <measure name="MaximumResponseTime" aggregationType="AVG"/> > > </aggregate> > > > > <put name="NodeAccumilator" indexRow="allKeys"/> > > > > <log/> > > > > > >> 2011/8/29 Tharindu Mathew <mcclou...@gmail.com> > >> > >>> Hi, > >>> > >>> I have an already running system where I define a simple data flow > (using > >>> a simple custom data flow language) and configure jobs to run against > stored > >>> data. I use quartz to schedule and run these jobs and the data exists > on > >>> various data stores (mainly Cassandra but some data exists in RDBMS > like > >>> mysql as well). > >>> > >>> Thinking about scalability and already existing support for standard > data > >>> flow languages in the form of Pig and HiveQL, I plan to move my system > to > >>> Hadoop. > >>> > >>> I've seen some efforts on the integration of Cassandra and Hadoop. I've > >>> been reading up and still am contemplating on how to make this change. > >>> > >>> It would be great to hear the recommended approach of doing this on > Hadoop > >>> with the integration of Cassandra and other RDBMS. For example, a > sample > >>> task that already runs on the system is "once in every hour, get rows > from > >>> column family X, aggregate data in columns A, B and C and write back to > >>> column family Y, and enter details of last aggregated row into a table > in > >>> mysql" > >>> > >>> Thanks in advance. > >>> > >>> -- > >>> Regards, > >>> > >>> Tharindu > >>> > >> > >> > >> > >> -- > >> *Eric Djatsa Yota* > >> *Double degree MsC Student in Computer Science Engineering and > >> Communication Networks > >> Télécom ParisTech (FRANCE) - Politecnico di Torino (ITALY)* > >> *Intern at AMADEUS S.A.S Sophia Antipolis* > >> djatsa...@gmail.com > >> *Tel : 0601791859* > >> > >> > > > > > > -- > > Regards, > > > > Tharindu > > -- Regards, Tharindu