Re: How to speed up SELECT * query in Cassandra

Colin Wed, 11 Feb 2015 07:07:07 -0800

No, the question isnt closed.  You dont get to decide that.

I dont run a website making claims regarding cassandra and spark - your 
employer does.


Again, where are your benchmarks?

I will publish mine, then we'll see what you've got.

--
Colin Clark 
+1 612 859 6129
Skype colin.p.clark

> On Feb 11, 2015, at 8:39 AM, DuyHai Doan <doanduy...@gmail.com> wrote:
> 
> For your information Colin: http://en.wikipedia.org/wiki/List_of_fallacies. 
> Look at "Burden of proof"
> 
> You stated "The very nature of cassandra's distributed nature vs partitioning 
> data on hadoop makes spark on hdfs actually fasted than on cassandra...."
> 
> It's up to YOU to prove it right, not up to me to prove it wrong.
> 
> All other bla bla is troll.
> 
> Come back to me once you get some decent benchmarks supporting your 
> statement, until then, the question is closed.
> 
> 
> 
>> On Wed, Feb 11, 2015 at 3:17 PM, Colin <co...@clark.ws> wrote:
>> Did you want me to included specific examples from my employment at datastax 
>> or start from the ground up? 
>> 
>> All spark is on cassandra is a better than the previous use of hive. 
>> 
>> The fact that datastax hasnt provided any benchmarks themselves other than 
>> glossy marketing statements pretty much says it all-where are your 
>> benchmarks?  Maybe you could combine it with the in memory option to really 
>> boogie...
>> 
>> :)
>> 
>> (If I find time, I might just write a blog post about exactly how to do 
>> this-it involves the use of parquet and partitioning with clustering-and it 
>> doesnt cost anything to do it-and it's in production at my company)
>> --
>> Colin Clark 
>> +1 612 859 6129
>> Skype colin.p.clark
>> 
>>> On Feb 11, 2015, at 6:51 AM, DuyHai Doan <doanduy...@gmail.com> wrote:
>>> 
>>> "The very nature of cassandra's distributed nature vs partitioning data on 
>>> hadoop makes spark on hdfs actually fasted than on cassandra...."
>>> 
>>> Prove it. Did you ever have a look into the source code of the 
>>> Spark/Cassandra connector to see how data locality is achieved before 
>>> throwing out such statement ?
>>> 
>>>> On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) 
>>>> <mvallemil...@bloomberg.net> wrote:
>>>> > cassandra makes a very poor datawarehouse ot long term time series store
>>>> 
>>>> Really? This is not the impression I have... I think Cassandra is good to 
>>>> store larges amounts of data and historical information, it's only not 
>>>> good to store temporary data.
>>>> Netflix has a large amount of data and it's all stored in Cassandra, 
>>>> AFAIK. 
>>>> 
>>>> > The very nature of cassandra's distributed nature vs partitioning data 
>>>> > on hadoop makes spark on hdfs actually fasted than on cassandra.
>>>> 
>>>> I am not sure about the current state of Spark support for Cassandra, but 
>>>> I guess if you create a map reduce job, the intermediate map results will 
>>>> be still stored in HDFS, as it happens to hadoop, is this right? I think 
>>>> the problem with Spark + Cassandra or with Hadoop + Cassandra is that the 
>>>> hard part spark or hadoop does, the shuffling, could be done out of the 
>>>> box with Cassandra, but no one takes advantage on that. What if a map / 
>>>> reduce job used a temporary CF in Cassandra to store intermediate results?
>>>> 
>>>> From: user@cassandra.apache.org 
>>>> Subject: Re: How to speed up SELECT * query in Cassandra
>>>> I use spark with cassandra, and you dont need DSE.
>>>> 
>>>> I see a lot of people ask this same question below (how do I get a lot of 
>>>> data out of cassandra?), and my question is always, why arent you updating 
>>>> both places at once?
>>>> 
>>>> For example, we use hadoop and cassandra in conjunction with each other, 
>>>> we use a message bus to store every event in both, aggregrate in both, but 
>>>> only keep current data in cassandra (cassandra makes a very poor 
>>>> datawarehouse ot long term time series store) and then use services to 
>>>> process queries that merge data from hadoop and cassandra.  
>>>> 
>>>> Also, spark on hdfs gives more flexibility in terms of large datasets and 
>>>> performance.  The very nature of cassandra's distributed nature vs 
>>>> partitioning data on hadoop makes spark on hdfs actually fasted than on 
>>>> cassandra....
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Colin Clark 
>>>> +1 612 859 6129
>>>> Skype colin.p.clark
>>>> 
>>>>> On Feb 11, 2015, at 4:49 AM, Jens Rantil <jens.ran...@tink.se> wrote:
>>>>> 
>>>>> 
>>>>>> On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) 
>>>>>> <mvallemil...@bloomberg.net> wrote:
>>>>>> If you use Cassandra enterprise, you can use hive, AFAIK.
>>>>> 
>>>>> Even better, you can use Spark/Shark with DSE.
>>>>> 
>>>>> Cheers,
>>>>> Jens
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Jens Rantil
>>>>> Backend engineer
>>>>> Tink AB
>>>>> 
>>>>> Email: jens.ran...@tink.se
>>>>> Phone: +46 708 84 18 32
>>>>> Web: www.tink.se
>>>>> 
>>>>> Facebook Linkedin Twitter
>

Re: How to speed up SELECT * query in Cassandra

Reply via email to