Tobias,

This is very interesting. Can I inquire a bit more on why you have both C*
and Kudu in the system?

Wouldnt keeping just Kudu work (that was its initial purpose?). Is there
something to do with its production readiness? I ask as we have a similar
concern as well.

Finally, how are your dashboard apps talking to Kudu? Is there a backend
that talks via impala, or do you have some calls to bash level scripts
communicating over some file system?



- Affan


On Thu, Jun 8, 2017 at 7:01 PM Tobias Eriksson <tobias.eriks...@qvantel.com>
wrote:

> Hi
>
> What I wanted was a dashboard with graphs/diagrams and it should not take
> minutes for the page to load
>
> Thus, it was a problem to have Spark with Cassandra, and not solving the
> parallelization to such an extent that I could have the diagrams rendered
> in seconds.
>
> Now with Kudu we get some decent results rendering the diagrams/graphs
>
>
>
> The way we transfer data from Cassandra which is the Production system
> storage to Kudu, is through an Apache Kafka topic (or many topics actually)
> and then we have an application which ingests the data into Kudu
>
>
>
>
>
> Other Systems -- > Domain Storage App(s) -- > Cassandra -- > KAFKA -- >
> KuduIngestion App -- > Kudu < -- Dashboard App(s)
>
>
>
>
>
> If you want to play with really fast analytics then perhaps consider
> looking at Apache Ignite
>
> https://ignite.apache.org
>
> Which then act as a layer between Cassandra and your applications storing
> into Cassandra (memory datagrid I think it is called)
>
> Basically, think of it as a big cache
>
> It is an in-memory thingi ☺
>
> And then you can run some super fast queries
>
>
>
> -Tobias
>
>
>
> *From: *DuyHai Doan <doanduy...@gmail.com>
> *Date: *Thursday, 8 June 2017 at 15:42
> *To: *Tobias Eriksson <tobias.eriks...@qvantel.com>
> *Cc: *한 승호 <shha...@outlook.com>, "user@cassandra.apache.org" <
> user@cassandra.apache.org>
> *Subject: *Re: Cassandra & Spark
>
>
>
> Interesting
>
>
>
> Tobias, when you said "Instead we transferred the data to Apache Kudu",
> did you transfer all Cassandra data into Kudu from with a single migration
> and then tap into Kudo for aggregation or did you run data import every
> day/week/month from Cassandra into Kudu ?
>
>
>
> From my point of view, the difficulty is not to have a static set of data
> and run aggregation on it, there are a lot of alternatives out there. The
> difficulty is to be able to run analytics on a live/production/changing
> dataset with all the data movement & update that it implies.
>
>
>
> Regards
>
>
>
> On Thu, Jun 8, 2017 at 3:37 PM, Tobias Eriksson <
> tobias.eriks...@qvantel.com> wrote:
>
> Hi
>
> Something to consider before moving to Apache Spark and Cassandra
>
> I have a background where we have tons of data in Cassandra, and we wanted
> to use Apache Spark to run various jobs
>
> We loved what we could do with Spark, BUT….
>
> We realized soon that we wanted to run multiple jobs in parallel
>
> Some jobs would take 30 minutes and some 45 seconds
>
> Spark is by default arranged so that it will take up all the resources
> there is, this can be tweaked by using Mesos or Yarn
>
> But even with Mesos and Yarn we found it complicated to run multiple jobs
> in parallel.
>
> So eventually we ended up throwing out Spark,
>
> Instead we transferred the data to Apache Kudu, and then we ran our
> analysis on Kudu, and what a difference !
>
> “my two cents!”
>
> -Tobias
>
>
>
>
>
>
>
> *From: *한 승호 <shha...@outlook.com>
> *Date: *Thursday, 8 June 2017 at 10:25
> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
> *Subject: *Cassandra & Spark
>
>
>
> Hello,
>
>
>
> I am Seung-ho and I work as a Data Engineer in Korea. I need some advice.
>
>
>
> My company recently consider replacing RDMBS-based system with Cassandra
> and Hadoop.
>
> The purpose of this system is to analyze Cadssandra and HDFS data with
> Spark.
>
>
>
> It seems many user cases put emphasis on data locality, for instance, both
> Cassandra and Spark executor should be on the same node.
>
>
>
> The thing is, my company's data analyst team wants to analyze
> heterogeneous data source, Cassandra and HDFS, using Spark.
>
> So, I wonder what would be the best practices of using Cassandra and
> Hadoop in such case.
>
>
>
> Plan A: Both HDFS and Cassandra with NodeManager(Spark Executor) on the
> same node
>
>
>
> Plan B: Cassandra + Node Manager / HDFS + NodeManager in each node
> separately but the same cluster
>
>
>
>
>
> Which would be better or correct, or would be a better way?
>
>
>
> I appreciate your advice in advance :)
>
>
>
> Best Regards,
>
> Seung-Ho Han
>
>
>
>
>
> Windows 10용 메일 <https://go.microsoft.com/fwlink/?LinkId=550986>에서 보냄
>
>
>
>
>

Reply via email to