Tobias, This is very interesting. Can I inquire a bit more on why you have both C* and Kudu in the system?
Wouldnt keeping just Kudu work (that was its initial purpose?). Is there something to do with its production readiness? I ask as we have a similar concern as well. Finally, how are your dashboard apps talking to Kudu? Is there a backend that talks via impala, or do you have some calls to bash level scripts communicating over some file system? - Affan On Thu, Jun 8, 2017 at 7:01 PM Tobias Eriksson <tobias.eriks...@qvantel.com> wrote: > Hi > > What I wanted was a dashboard with graphs/diagrams and it should not take > minutes for the page to load > > Thus, it was a problem to have Spark with Cassandra, and not solving the > parallelization to such an extent that I could have the diagrams rendered > in seconds. > > Now with Kudu we get some decent results rendering the diagrams/graphs > > > > The way we transfer data from Cassandra which is the Production system > storage to Kudu, is through an Apache Kafka topic (or many topics actually) > and then we have an application which ingests the data into Kudu > > > > > > Other Systems -- > Domain Storage App(s) -- > Cassandra -- > KAFKA -- > > KuduIngestion App -- > Kudu < -- Dashboard App(s) > > > > > > If you want to play with really fast analytics then perhaps consider > looking at Apache Ignite > > https://ignite.apache.org > > Which then act as a layer between Cassandra and your applications storing > into Cassandra (memory datagrid I think it is called) > > Basically, think of it as a big cache > > It is an in-memory thingi ☺ > > And then you can run some super fast queries > > > > -Tobias > > > > *From: *DuyHai Doan <doanduy...@gmail.com> > *Date: *Thursday, 8 June 2017 at 15:42 > *To: *Tobias Eriksson <tobias.eriks...@qvantel.com> > *Cc: *한 승호 <shha...@outlook.com>, "user@cassandra.apache.org" < > user@cassandra.apache.org> > *Subject: *Re: Cassandra & Spark > > > > Interesting > > > > Tobias, when you said "Instead we transferred the data to Apache Kudu", > did you transfer all Cassandra data into Kudu from with a single migration > and then tap into Kudo for aggregation or did you run data import every > day/week/month from Cassandra into Kudu ? > > > > From my point of view, the difficulty is not to have a static set of data > and run aggregation on it, there are a lot of alternatives out there. The > difficulty is to be able to run analytics on a live/production/changing > dataset with all the data movement & update that it implies. > > > > Regards > > > > On Thu, Jun 8, 2017 at 3:37 PM, Tobias Eriksson < > tobias.eriks...@qvantel.com> wrote: > > Hi > > Something to consider before moving to Apache Spark and Cassandra > > I have a background where we have tons of data in Cassandra, and we wanted > to use Apache Spark to run various jobs > > We loved what we could do with Spark, BUT…. > > We realized soon that we wanted to run multiple jobs in parallel > > Some jobs would take 30 minutes and some 45 seconds > > Spark is by default arranged so that it will take up all the resources > there is, this can be tweaked by using Mesos or Yarn > > But even with Mesos and Yarn we found it complicated to run multiple jobs > in parallel. > > So eventually we ended up throwing out Spark, > > Instead we transferred the data to Apache Kudu, and then we ran our > analysis on Kudu, and what a difference ! > > “my two cents!” > > -Tobias > > > > > > > > *From: *한 승호 <shha...@outlook.com> > *Date: *Thursday, 8 June 2017 at 10:25 > *To: *"user@cassandra.apache.org" <user@cassandra.apache.org> > *Subject: *Cassandra & Spark > > > > Hello, > > > > I am Seung-ho and I work as a Data Engineer in Korea. I need some advice. > > > > My company recently consider replacing RDMBS-based system with Cassandra > and Hadoop. > > The purpose of this system is to analyze Cadssandra and HDFS data with > Spark. > > > > It seems many user cases put emphasis on data locality, for instance, both > Cassandra and Spark executor should be on the same node. > > > > The thing is, my company's data analyst team wants to analyze > heterogeneous data source, Cassandra and HDFS, using Spark. > > So, I wonder what would be the best practices of using Cassandra and > Hadoop in such case. > > > > Plan A: Both HDFS and Cassandra with NodeManager(Spark Executor) on the > same node > > > > Plan B: Cassandra + Node Manager / HDFS + NodeManager in each node > separately but the same cluster > > > > > > Which would be better or correct, or would be a better way? > > > > I appreciate your advice in advance :) > > > > Best Regards, > > Seung-Ho Han > > > > > > Windows 10용 메일 <https://go.microsoft.com/fwlink/?LinkId=550986>에서 보냄 > > > > >