Thanks Jeremy. These will be really useful. On Wed, Aug 31, 2011 at 12:12 AM, Jeremy Hanna <jeremy.hanna1...@gmail.com>wrote:
> I've tried to help out with some UDFs and references that help with our use > case: https://github.com/jeromatron/pygmalion/ > > There are some brisk docs on pig as well that might be helpful: > http://www.datastax.com/docs/0.8/brisk/about_pig > > On Aug 30, 2011, at 1:30 PM, Tharindu Mathew wrote: > > > Thanks Jeremy for your response. That gives me some encouragement, that I > might be on that right track. > > > > I think I need to try out more stuff before coming to a conclusion on > Brisk. > > > > For Pig operations over Cassandra, I only could find > http://svn.apache.org/repos/asf/cassandra/trunk/contrib/pig. Are there any > other resource that you can point me to? There seems to be a lack of samples > on this subject. > > > > On Tue, Aug 30, 2011 at 10:56 PM, Jeremy Hanna < > jeremy.hanna1...@gmail.com> wrote: > > FWIW, we are using Pig (and Hadoop) with Cassandra and are looking to > potentially move to Brisk because of the simplicity of operations there. > > > > Not sure what you mean about the true power of Hadoop. In my mind the > true power of Hadoop is the ability to parallelize jobs and send each task > to where the data resides. HDFS exists to enable that. Brisk is just > another HDFS compatible implementation. If you're already storing your data > in Cassandra and are looking to use Hadoop with it, then I would seriously > consider using Brisk. > > > > That said, Cassandra with Hadoop works fine. > > > > On Aug 30, 2011, at 11:58 AM, Tharindu Mathew wrote: > > > > > Hi Eric, > > > > > > Thanks for your response. > > > > > > On Tue, Aug 30, 2011 at 5:35 PM, Eric Djatsa <djatsa...@gmail.com> > wrote: > > > > > >> Hi Tharindu, try having a look at Brisk( > > >> http://www.datastax.com/products/brisk) it integrates Hadoop with > > >> Cassandra and is shipped with Hive for SQL analysis. You can then > install > > >> Sqoop(http://www.cloudera.com/downloads/sqoop/) on top of Hadoop in > order > > >> to enable data import/export between Hadoop and MySQL. > > >> Does this sound ok to you ? > > >> > > >> These do sound ok. But I was looking at using something from Apache > itself. > > > > > > Brisk sounds nice, but I feel that disregarding HDFS and totally > switching > > > to Cassandra is not the right thing to do. Just my opinion there. I > feel we > > > are not using the true power of Hadoop then. > > > > > > I feel Pig has more integration with Cassandra, so I might take a look > > > there. > > > > > > Whichever I choose, I will contribute the code back to the Apache > projects I > > > use. Here's a sample data analysis I do with my language. Maybe, there > is no > > > generic way to do what I want to do. > > > > > > > > > > > > <get name="NodeId"> > > > <index name="ServerName" start="" end=""/> > > > <!--<index name="nodeId" start="AS" end="FB"/>--> > > > <!--<groupBy index="nodeId"/>--> > > > <granularity index="timeStamp" type="hour"/> > > > </get> > > > > > > <lookup name="Event"/> > > > > > > <aggregate> > > > <measure name="RequestCount" aggregationType="CUMULATIVE"/> > > > <measure name="ResponseCount" aggregationType="CUMULATIVE"/> > > > <measure name="MaximumResponseTime" aggregationType="AVG"/> > > > </aggregate> > > > > > > <put name="NodeResult" indexRow="allKeys"/> > > > > > > <log/> > > > > > > <get name="NodeResult"> > > > <index name="ServerName" start="" end=""/> > > > <groupBy index="ServerName"/> > > > </get> > > > > > > <aggregate> > > > <measure name="RequestCount" aggregationType="CUMULATIVE"/> > > > <measure name="ResponseCount" aggregationType="CUMULATIVE"/> > > > <measure name="MaximumResponseTime" aggregationType="AVG"/> > > > </aggregate> > > > > > > <put name="NodeAccumilator" indexRow="allKeys"/> > > > > > > <log/> > > > > > > > > >> 2011/8/29 Tharindu Mathew <mcclou...@gmail.com> > > >> > > >>> Hi, > > >>> > > >>> I have an already running system where I define a simple data flow > (using > > >>> a simple custom data flow language) and configure jobs to run against > stored > > >>> data. I use quartz to schedule and run these jobs and the data exists > on > > >>> various data stores (mainly Cassandra but some data exists in RDBMS > like > > >>> mysql as well). > > >>> > > >>> Thinking about scalability and already existing support for standard > data > > >>> flow languages in the form of Pig and HiveQL, I plan to move my > system to > > >>> Hadoop. > > >>> > > >>> I've seen some efforts on the integration of Cassandra and Hadoop. > I've > > >>> been reading up and still am contemplating on how to make this > change. > > >>> > > >>> It would be great to hear the recommended approach of doing this on > Hadoop > > >>> with the integration of Cassandra and other RDBMS. For example, a > sample > > >>> task that already runs on the system is "once in every hour, get rows > from > > >>> column family X, aggregate data in columns A, B and C and write back > to > > >>> column family Y, and enter details of last aggregated row into a > table in > > >>> mysql" > > >>> > > >>> Thanks in advance. > > >>> > > >>> -- > > >>> Regards, > > >>> > > >>> Tharindu > > >>> > > >> > > >> > > >> > > >> -- > > >> *Eric Djatsa Yota* > > >> *Double degree MsC Student in Computer Science Engineering and > > >> Communication Networks > > >> Télécom ParisTech (FRANCE) - Politecnico di Torino (ITALY)* > > >> *Intern at AMADEUS S.A.S Sophia Antipolis* > > >> djatsa...@gmail.com > > >> *Tel : 0601791859* > > >> > > >> > > > > > > > > > -- > > > Regards, > > > > > > Tharindu > > > > > > > > > > -- > > Regards, > > > > Tharindu > > -- Regards, Tharindu