Nope, Spark cassandra connector leverages data locality and get tremendous improvements due to localitty.
- Affan On Sat, Aug 25, 2018 at 11:25 AM CharSyam <chars...@gmail.com> wrote: > Spark can read hdfs directly so locality is important but Spark can't read > Cassandra data directly it can only connect by api. So I think you don't > need to install them a same node > > 2018년 8월 25일 (토) 오후 3:16, Affan Syed <as...@an10.io>님이 작성: > >> Tobias, >> >> This is very interesting. Can I inquire a bit more on why you have both >> C* and Kudu in the system? >> >> Wouldnt keeping just Kudu work (that was its initial purpose?). Is there >> something to do with its production readiness? I ask as we have a similar >> concern as well. >> >> Finally, how are your dashboard apps talking to Kudu? Is there a backend >> that talks via impala, or do you have some calls to bash level scripts >> communicating over some file system? >> >> >> >> - Affan >> >> >> On Thu, Jun 8, 2017 at 7:01 PM Tobias Eriksson < >> tobias.eriks...@qvantel.com> wrote: >> >>> Hi >>> >>> What I wanted was a dashboard with graphs/diagrams and it should not >>> take minutes for the page to load >>> >>> Thus, it was a problem to have Spark with Cassandra, and not solving the >>> parallelization to such an extent that I could have the diagrams rendered >>> in seconds. >>> >>> Now with Kudu we get some decent results rendering the diagrams/graphs >>> >>> >>> >>> The way we transfer data from Cassandra which is the Production system >>> storage to Kudu, is through an Apache Kafka topic (or many topics actually) >>> and then we have an application which ingests the data into Kudu >>> >>> >>> >>> >>> >>> Other Systems -- > Domain Storage App(s) -- > Cassandra -- > KAFKA -- > >>> KuduIngestion App -- > Kudu < -- Dashboard App(s) >>> >>> >>> >>> >>> >>> If you want to play with really fast analytics then perhaps consider >>> looking at Apache Ignite >>> >>> https://ignite.apache.org >>> >>> Which then act as a layer between Cassandra and your applications >>> storing into Cassandra (memory datagrid I think it is called) >>> >>> Basically, think of it as a big cache >>> >>> It is an in-memory thingi ☺ >>> >>> And then you can run some super fast queries >>> >>> >>> >>> -Tobias >>> >>> >>> >>> *From: *DuyHai Doan <doanduy...@gmail.com> >>> *Date: *Thursday, 8 June 2017 at 15:42 >>> *To: *Tobias Eriksson <tobias.eriks...@qvantel.com> >>> *Cc: *한 승호 <shha...@outlook.com>, "user@cassandra.apache.org" < >>> user@cassandra.apache.org> >>> *Subject: *Re: Cassandra & Spark >>> >>> >>> >>> Interesting >>> >>> >>> >>> Tobias, when you said "Instead we transferred the data to Apache Kudu", >>> did you transfer all Cassandra data into Kudu from with a single migration >>> and then tap into Kudo for aggregation or did you run data import every >>> day/week/month from Cassandra into Kudu ? >>> >>> >>> >>> From my point of view, the difficulty is not to have a static set of >>> data and run aggregation on it, there are a lot of alternatives out there. >>> The difficulty is to be able to run analytics on a live/production/changing >>> dataset with all the data movement & update that it implies. >>> >>> >>> >>> Regards >>> >>> >>> >>> On Thu, Jun 8, 2017 at 3:37 PM, Tobias Eriksson < >>> tobias.eriks...@qvantel.com> wrote: >>> >>> Hi >>> >>> Something to consider before moving to Apache Spark and Cassandra >>> >>> I have a background where we have tons of data in Cassandra, and we >>> wanted to use Apache Spark to run various jobs >>> >>> We loved what we could do with Spark, BUT…. >>> >>> We realized soon that we wanted to run multiple jobs in parallel >>> >>> Some jobs would take 30 minutes and some 45 seconds >>> >>> Spark is by default arranged so that it will take up all the resources >>> there is, this can be tweaked by using Mesos or Yarn >>> >>> But even with Mesos and Yarn we found it complicated to run multiple >>> jobs in parallel. >>> >>> So eventually we ended up throwing out Spark, >>> >>> Instead we transferred the data to Apache Kudu, and then we ran our >>> analysis on Kudu, and what a difference ! >>> >>> “my two cents!” >>> >>> -Tobias >>> >>> >>> >>> >>> >>> >>> >>> *From: *한 승호 <shha...@outlook.com> >>> *Date: *Thursday, 8 June 2017 at 10:25 >>> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org> >>> *Subject: *Cassandra & Spark >>> >>> >>> >>> Hello, >>> >>> >>> >>> I am Seung-ho and I work as a Data Engineer in Korea. I need some advice. >>> >>> >>> >>> My company recently consider replacing RDMBS-based system with Cassandra >>> and Hadoop. >>> >>> The purpose of this system is to analyze Cadssandra and HDFS data with >>> Spark. >>> >>> >>> >>> It seems many user cases put emphasis on data locality, for instance, >>> both Cassandra and Spark executor should be on the same node. >>> >>> >>> >>> The thing is, my company's data analyst team wants to analyze >>> heterogeneous data source, Cassandra and HDFS, using Spark. >>> >>> So, I wonder what would be the best practices of using Cassandra and >>> Hadoop in such case. >>> >>> >>> >>> Plan A: Both HDFS and Cassandra with NodeManager(Spark Executor) on the >>> same node >>> >>> >>> >>> Plan B: Cassandra + Node Manager / HDFS + NodeManager in each node >>> separately but the same cluster >>> >>> >>> >>> >>> >>> Which would be better or correct, or would be a better way? >>> >>> >>> >>> I appreciate your advice in advance :) >>> >>> >>> >>> Best Regards, >>> >>> Seung-Ho Han >>> >>> >>> >>> >>> >>> Windows 10용 메일 <https://go.microsoft.com/fwlink/?LinkId=550986>에서 보냄 >>> >>> >>> >>> >>> >>