Did you want me to included specific examples from my employment at datastax or start from the ground up?
All spark is on cassandra is a better than the previous use of hive. The fact that datastax hasnt provided any benchmarks themselves other than glossy marketing statements pretty much says it all-where are your benchmarks? Maybe you could combine it with the in memory option to really boogie... :) (If I find time, I might just write a blog post about exactly how to do this-it involves the use of parquet and partitioning with clustering-and it doesnt cost anything to do it-and it's in production at my company) -- Colin Clark +1 612 859 6129 Skype colin.p.clark > On Feb 11, 2015, at 6:51 AM, DuyHai Doan <doanduy...@gmail.com> wrote: > > "The very nature of cassandra's distributed nature vs partitioning data on > hadoop makes spark on hdfs actually fasted than on cassandra...." > > Prove it. Did you ever have a look into the source code of the > Spark/Cassandra connector to see how data locality is achieved before > throwing out such statement ? > >> On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) >> <mvallemil...@bloomberg.net> wrote: >> > cassandra makes a very poor datawarehouse ot long term time series store >> >> Really? This is not the impression I have... I think Cassandra is good to >> store larges amounts of data and historical information, it's only not good >> to store temporary data. >> Netflix has a large amount of data and it's all stored in Cassandra, AFAIK. >> >> > The very nature of cassandra's distributed nature vs partitioning data on >> > hadoop makes spark on hdfs actually fasted than on cassandra. >> >> I am not sure about the current state of Spark support for Cassandra, but I >> guess if you create a map reduce job, the intermediate map results will be >> still stored in HDFS, as it happens to hadoop, is this right? I think the >> problem with Spark + Cassandra or with Hadoop + Cassandra is that the hard >> part spark or hadoop does, the shuffling, could be done out of the box with >> Cassandra, but no one takes advantage on that. What if a map / reduce job >> used a temporary CF in Cassandra to store intermediate results? >> >> From: user@cassandra.apache.org >> Subject: Re: How to speed up SELECT * query in Cassandra >> I use spark with cassandra, and you dont need DSE. >> >> I see a lot of people ask this same question below (how do I get a lot of >> data out of cassandra?), and my question is always, why arent you updating >> both places at once? >> >> For example, we use hadoop and cassandra in conjunction with each other, we >> use a message bus to store every event in both, aggregrate in both, but only >> keep current data in cassandra (cassandra makes a very poor datawarehouse ot >> long term time series store) and then use services to process queries that >> merge data from hadoop and cassandra. >> >> Also, spark on hdfs gives more flexibility in terms of large datasets and >> performance. The very nature of cassandra's distributed nature vs >> partitioning data on hadoop makes spark on hdfs actually fasted than on >> cassandra.... >> >> >> >> -- >> Colin Clark >> +1 612 859 6129 >> Skype colin.p.clark >> >>> On Feb 11, 2015, at 4:49 AM, Jens Rantil <jens.ran...@tink.se> wrote: >>> >>> >>>> On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) >>>> <mvallemil...@bloomberg.net> wrote: >>>> If you use Cassandra enterprise, you can use hive, AFAIK. >>> >>> Even better, you can use Spark/Shark with DSE. >>> >>> Cheers, >>> Jens >>> >>> >>> -- >>> Jens Rantil >>> Backend engineer >>> Tink AB >>> >>> Email: jens.ran...@tink.se >>> Phone: +46 708 84 18 32 >>> Web: www.tink.se >>> >>> Facebook Linkedin Twitter >> >