Re: Cassandra & Spark

CharSyam Fri, 24 Aug 2018 23:25:59 -0700

Spark can read hdfs directly so locality is important but Spark can't read
Cassandra data directly it can only connect by api. So I think you don't
need to install them a same node


2018년 8월 25일 (토) 오후 3:16, Affan Syed <as...@an10.io>님이 작성:

> Tobias,
>
> This is very interesting. Can I inquire a bit more on why you have both C*
> and Kudu in the system?
>
> Wouldnt keeping just Kudu work (that was its initial purpose?). Is there
> something to do with its production readiness? I ask as we have a similar
> concern as well.
>
> Finally, how are your dashboard apps talking to Kudu? Is there a backend
> that talks via impala, or do you have some calls to bash level scripts
> communicating over some file system?
>
>
>
> - Affan
>
>
> On Thu, Jun 8, 2017 at 7:01 PM Tobias Eriksson <
> tobias.eriks...@qvantel.com> wrote:
>
>> Hi
>>
>> What I wanted was a dashboard with graphs/diagrams and it should not take
>> minutes for the page to load
>>
>> Thus, it was a problem to have Spark with Cassandra, and not solving the
>> parallelization to such an extent that I could have the diagrams rendered
>> in seconds.
>>
>> Now with Kudu we get some decent results rendering the diagrams/graphs
>>
>>
>>
>> The way we transfer data from Cassandra which is the Production system
>> storage to Kudu, is through an Apache Kafka topic (or many topics actually)
>> and then we have an application which ingests the data into Kudu
>>
>>
>>
>>
>>
>> Other Systems -- > Domain Storage App(s) -- > Cassandra -- > KAFKA -- >
>> KuduIngestion App -- > Kudu < -- Dashboard App(s)
>>
>>
>>
>>
>>
>> If you want to play with really fast analytics then perhaps consider
>> looking at Apache Ignite
>>
>> https://ignite.apache.org
>>
>> Which then act as a layer between Cassandra and your applications storing
>> into Cassandra (memory datagrid I think it is called)
>>
>> Basically, think of it as a big cache
>>
>> It is an in-memory thingi ☺
>>
>> And then you can run some super fast queries
>>
>>
>>
>> -Tobias
>>
>>
>>
>> *From: *DuyHai Doan <doanduy...@gmail.com>
>> *Date: *Thursday, 8 June 2017 at 15:42
>> *To: *Tobias Eriksson <tobias.eriks...@qvantel.com>
>> *Cc: *한 승호 <shha...@outlook.com>, "user@cassandra.apache.org" <
>> user@cassandra.apache.org>
>> *Subject: *Re: Cassandra & Spark
>>
>>
>>
>> Interesting
>>
>>
>>
>> Tobias, when you said "Instead we transferred the data to Apache Kudu",
>> did you transfer all Cassandra data into Kudu from with a single migration
>> and then tap into Kudo for aggregation or did you run data import every
>> day/week/month from Cassandra into Kudu ?
>>
>>
>>
>> From my point of view, the difficulty is not to have a static set of data
>> and run aggregation on it, there are a lot of alternatives out there. The
>> difficulty is to be able to run analytics on a live/production/changing
>> dataset with all the data movement & update that it implies.
>>
>>
>>
>> Regards
>>
>>
>>
>> On Thu, Jun 8, 2017 at 3:37 PM, Tobias Eriksson <
>> tobias.eriks...@qvantel.com> wrote:
>>
>> Hi
>>
>> Something to consider before moving to Apache Spark and Cassandra
>>
>> I have a background where we have tons of data in Cassandra, and we
>> wanted to use Apache Spark to run various jobs
>>
>> We loved what we could do with Spark, BUT….
>>
>> We realized soon that we wanted to run multiple jobs in parallel
>>
>> Some jobs would take 30 minutes and some 45 seconds
>>
>> Spark is by default arranged so that it will take up all the resources
>> there is, this can be tweaked by using Mesos or Yarn
>>
>> But even with Mesos and Yarn we found it complicated to run multiple jobs
>> in parallel.
>>
>> So eventually we ended up throwing out Spark,
>>
>> Instead we transferred the data to Apache Kudu, and then we ran our
>> analysis on Kudu, and what a difference !
>>
>> “my two cents!”
>>
>> -Tobias
>>
>>
>>
>>
>>
>>
>>
>> *From: *한 승호 <shha...@outlook.com>
>> *Date: *Thursday, 8 June 2017 at 10:25
>> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
>> *Subject: *Cassandra & Spark
>>
>>
>>
>> Hello,
>>
>>
>>
>> I am Seung-ho and I work as a Data Engineer in Korea. I need some advice.
>>
>>
>>
>> My company recently consider replacing RDMBS-based system with Cassandra
>> and Hadoop.
>>
>> The purpose of this system is to analyze Cadssandra and HDFS data with
>> Spark.
>>
>>
>>
>> It seems many user cases put emphasis on data locality, for instance,
>> both Cassandra and Spark executor should be on the same node.
>>
>>
>>
>> The thing is, my company's data analyst team wants to analyze
>> heterogeneous data source, Cassandra and HDFS, using Spark.
>>
>> So, I wonder what would be the best practices of using Cassandra and
>> Hadoop in such case.
>>
>>
>>
>> Plan A: Both HDFS and Cassandra with NodeManager(Spark Executor) on the
>> same node
>>
>>
>>
>> Plan B: Cassandra + Node Manager / HDFS + NodeManager in each node
>> separately but the same cluster
>>
>>
>>
>>
>>
>> Which would be better or correct, or would be a better way?
>>
>>
>>
>> I appreciate your advice in advance :)
>>
>>
>>
>> Best Regards,
>>
>> Seung-Ho Han
>>
>>
>>
>>
>>
>> Windows 10용 메일 <https://go.microsoft.com/fwlink/?LinkId=550986>에서 보냄
>>
>>
>>
>>
>>
>

Re: Cassandra & Spark

Reply via email to