Spark can read hdfs directly so locality is important but Spark can't read Cassandra data directly it can only connect by api. So I think you don't need to install them a same node
2018년 8월 25일 (토) 오후 3:16, Affan Syed <as...@an10.io>님이 작성: > Tobias, > > This is very interesting. Can I inquire a bit more on why you have both C* > and Kudu in the system? > > Wouldnt keeping just Kudu work (that was its initial purpose?). Is there > something to do with its production readiness? I ask as we have a similar > concern as well. > > Finally, how are your dashboard apps talking to Kudu? Is there a backend > that talks via impala, or do you have some calls to bash level scripts > communicating over some file system? > > > > - Affan > > > On Thu, Jun 8, 2017 at 7:01 PM Tobias Eriksson < > tobias.eriks...@qvantel.com> wrote: > >> Hi >> >> What I wanted was a dashboard with graphs/diagrams and it should not take >> minutes for the page to load >> >> Thus, it was a problem to have Spark with Cassandra, and not solving the >> parallelization to such an extent that I could have the diagrams rendered >> in seconds. >> >> Now with Kudu we get some decent results rendering the diagrams/graphs >> >> >> >> The way we transfer data from Cassandra which is the Production system >> storage to Kudu, is through an Apache Kafka topic (or many topics actually) >> and then we have an application which ingests the data into Kudu >> >> >> >> >> >> Other Systems -- > Domain Storage App(s) -- > Cassandra -- > KAFKA -- > >> KuduIngestion App -- > Kudu < -- Dashboard App(s) >> >> >> >> >> >> If you want to play with really fast analytics then perhaps consider >> looking at Apache Ignite >> >> https://ignite.apache.org >> >> Which then act as a layer between Cassandra and your applications storing >> into Cassandra (memory datagrid I think it is called) >> >> Basically, think of it as a big cache >> >> It is an in-memory thingi ☺ >> >> And then you can run some super fast queries >> >> >> >> -Tobias >> >> >> >> *From: *DuyHai Doan <doanduy...@gmail.com> >> *Date: *Thursday, 8 June 2017 at 15:42 >> *To: *Tobias Eriksson <tobias.eriks...@qvantel.com> >> *Cc: *한 승호 <shha...@outlook.com>, "user@cassandra.apache.org" < >> user@cassandra.apache.org> >> *Subject: *Re: Cassandra & Spark >> >> >> >> Interesting >> >> >> >> Tobias, when you said "Instead we transferred the data to Apache Kudu", >> did you transfer all Cassandra data into Kudu from with a single migration >> and then tap into Kudo for aggregation or did you run data import every >> day/week/month from Cassandra into Kudu ? >> >> >> >> From my point of view, the difficulty is not to have a static set of data >> and run aggregation on it, there are a lot of alternatives out there. The >> difficulty is to be able to run analytics on a live/production/changing >> dataset with all the data movement & update that it implies. >> >> >> >> Regards >> >> >> >> On Thu, Jun 8, 2017 at 3:37 PM, Tobias Eriksson < >> tobias.eriks...@qvantel.com> wrote: >> >> Hi >> >> Something to consider before moving to Apache Spark and Cassandra >> >> I have a background where we have tons of data in Cassandra, and we >> wanted to use Apache Spark to run various jobs >> >> We loved what we could do with Spark, BUT…. >> >> We realized soon that we wanted to run multiple jobs in parallel >> >> Some jobs would take 30 minutes and some 45 seconds >> >> Spark is by default arranged so that it will take up all the resources >> there is, this can be tweaked by using Mesos or Yarn >> >> But even with Mesos and Yarn we found it complicated to run multiple jobs >> in parallel. >> >> So eventually we ended up throwing out Spark, >> >> Instead we transferred the data to Apache Kudu, and then we ran our >> analysis on Kudu, and what a difference ! >> >> “my two cents!” >> >> -Tobias >> >> >> >> >> >> >> >> *From: *한 승호 <shha...@outlook.com> >> *Date: *Thursday, 8 June 2017 at 10:25 >> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org> >> *Subject: *Cassandra & Spark >> >> >> >> Hello, >> >> >> >> I am Seung-ho and I work as a Data Engineer in Korea. I need some advice. >> >> >> >> My company recently consider replacing RDMBS-based system with Cassandra >> and Hadoop. >> >> The purpose of this system is to analyze Cadssandra and HDFS data with >> Spark. >> >> >> >> It seems many user cases put emphasis on data locality, for instance, >> both Cassandra and Spark executor should be on the same node. >> >> >> >> The thing is, my company's data analyst team wants to analyze >> heterogeneous data source, Cassandra and HDFS, using Spark. >> >> So, I wonder what would be the best practices of using Cassandra and >> Hadoop in such case. >> >> >> >> Plan A: Both HDFS and Cassandra with NodeManager(Spark Executor) on the >> same node >> >> >> >> Plan B: Cassandra + Node Manager / HDFS + NodeManager in each node >> separately but the same cluster >> >> >> >> >> >> Which would be better or correct, or would be a better way? >> >> >> >> I appreciate your advice in advance :) >> >> >> >> Best Regards, >> >> Seung-Ho Han >> >> >> >> >> >> Windows 10용 메일 <https://go.microsoft.com/fwlink/?LinkId=550986>에서 보냄 >> >> >> >> >> >