Did you want me to included specific examples from my employment at datastax or 
start from the ground up? 

All spark is on cassandra is a better than the previous use of hive. 

The fact that datastax hasnt provided any benchmarks themselves other than 
glossy marketing statements pretty much says it all-where are your benchmarks?  
Maybe you could combine it with the in memory option to really boogie...

:)

(If I find time, I might just write a blog post about exactly how to do this-it 
involves the use of parquet and partitioning with clustering-and it doesnt cost 
anything to do it-and it's in production at my company)
--
Colin Clark 
+1 612 859 6129
Skype colin.p.clark

> On Feb 11, 2015, at 6:51 AM, DuyHai Doan <doanduy...@gmail.com> wrote:
> 
> "The very nature of cassandra's distributed nature vs partitioning data on 
> hadoop makes spark on hdfs actually fasted than on cassandra...."
> 
> Prove it. Did you ever have a look into the source code of the 
> Spark/Cassandra connector to see how data locality is achieved before 
> throwing out such statement ?
> 
>> On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
>> <mvallemil...@bloomberg.net> wrote:
>> > cassandra makes a very poor datawarehouse ot long term time series store
>> 
>> Really? This is not the impression I have... I think Cassandra is good to 
>> store larges amounts of data and historical information, it's only not good 
>> to store temporary data.
>> Netflix has a large amount of data and it's all stored in Cassandra, AFAIK. 
>> 
>> > The very nature of cassandra's distributed nature vs partitioning data on 
>> > hadoop makes spark on hdfs actually fasted than on cassandra.
>> 
>> I am not sure about the current state of Spark support for Cassandra, but I 
>> guess if you create a map reduce job, the intermediate map results will be 
>> still stored in HDFS, as it happens to hadoop, is this right? I think the 
>> problem with Spark + Cassandra or with Hadoop + Cassandra is that the hard 
>> part spark or hadoop does, the shuffling, could be done out of the box with 
>> Cassandra, but no one takes advantage on that. What if a map / reduce job 
>> used a temporary CF in Cassandra to store intermediate results?
>> 
>> From: user@cassandra.apache.org 
>> Subject: Re: How to speed up SELECT * query in Cassandra
>> I use spark with cassandra, and you dont need DSE.
>> 
>> I see a lot of people ask this same question below (how do I get a lot of 
>> data out of cassandra?), and my question is always, why arent you updating 
>> both places at once?
>> 
>> For example, we use hadoop and cassandra in conjunction with each other, we 
>> use a message bus to store every event in both, aggregrate in both, but only 
>> keep current data in cassandra (cassandra makes a very poor datawarehouse ot 
>> long term time series store) and then use services to process queries that 
>> merge data from hadoop and cassandra.  
>> 
>> Also, spark on hdfs gives more flexibility in terms of large datasets and 
>> performance.  The very nature of cassandra's distributed nature vs 
>> partitioning data on hadoop makes spark on hdfs actually fasted than on 
>> cassandra....
>> 
>> 
>> 
>> --
>> Colin Clark 
>> +1 612 859 6129
>> Skype colin.p.clark
>> 
>>> On Feb 11, 2015, at 4:49 AM, Jens Rantil <jens.ran...@tink.se> wrote:
>>> 
>>> 
>>>> On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) 
>>>> <mvallemil...@bloomberg.net> wrote:
>>>> If you use Cassandra enterprise, you can use hive, AFAIK.
>>> 
>>> Even better, you can use Spark/Shark with DSE.
>>> 
>>> Cheers,
>>> Jens
>>> 
>>> 
>>> -- 
>>> Jens Rantil
>>> Backend engineer
>>> Tink AB
>>> 
>>> Email: jens.ran...@tink.se
>>> Phone: +46 708 84 18 32
>>> Web: www.tink.se
>>> 
>>> Facebook Linkedin Twitter
>> 
> 

Reply via email to