Nope, Spark cassandra connector leverages data locality and get tremendous
improvements due to localitty.


- Affan


On Sat, Aug 25, 2018 at 11:25 AM CharSyam <chars...@gmail.com> wrote:

> Spark can read hdfs directly so locality is important but Spark can't read
> Cassandra data directly it can only connect by api. So I think you don't
> need to install them a same node
>
> 2018년 8월 25일 (토) 오후 3:16, Affan Syed <as...@an10.io>님이 작성:
>
>> Tobias,
>>
>> This is very interesting. Can I inquire a bit more on why you have both
>> C* and Kudu in the system?
>>
>> Wouldnt keeping just Kudu work (that was its initial purpose?). Is there
>> something to do with its production readiness? I ask as we have a similar
>> concern as well.
>>
>> Finally, how are your dashboard apps talking to Kudu? Is there a backend
>> that talks via impala, or do you have some calls to bash level scripts
>> communicating over some file system?
>>
>>
>>
>> - Affan
>>
>>
>> On Thu, Jun 8, 2017 at 7:01 PM Tobias Eriksson <
>> tobias.eriks...@qvantel.com> wrote:
>>
>>> Hi
>>>
>>> What I wanted was a dashboard with graphs/diagrams and it should not
>>> take minutes for the page to load
>>>
>>> Thus, it was a problem to have Spark with Cassandra, and not solving the
>>> parallelization to such an extent that I could have the diagrams rendered
>>> in seconds.
>>>
>>> Now with Kudu we get some decent results rendering the diagrams/graphs
>>>
>>>
>>>
>>> The way we transfer data from Cassandra which is the Production system
>>> storage to Kudu, is through an Apache Kafka topic (or many topics actually)
>>> and then we have an application which ingests the data into Kudu
>>>
>>>
>>>
>>>
>>>
>>> Other Systems -- > Domain Storage App(s) -- > Cassandra -- > KAFKA -- >
>>> KuduIngestion App -- > Kudu < -- Dashboard App(s)
>>>
>>>
>>>
>>>
>>>
>>> If you want to play with really fast analytics then perhaps consider
>>> looking at Apache Ignite
>>>
>>> https://ignite.apache.org
>>>
>>> Which then act as a layer between Cassandra and your applications
>>> storing into Cassandra (memory datagrid I think it is called)
>>>
>>> Basically, think of it as a big cache
>>>
>>> It is an in-memory thingi ☺
>>>
>>> And then you can run some super fast queries
>>>
>>>
>>>
>>> -Tobias
>>>
>>>
>>>
>>> *From: *DuyHai Doan <doanduy...@gmail.com>
>>> *Date: *Thursday, 8 June 2017 at 15:42
>>> *To: *Tobias Eriksson <tobias.eriks...@qvantel.com>
>>> *Cc: *한 승호 <shha...@outlook.com>, "user@cassandra.apache.org" <
>>> user@cassandra.apache.org>
>>> *Subject: *Re: Cassandra & Spark
>>>
>>>
>>>
>>> Interesting
>>>
>>>
>>>
>>> Tobias, when you said "Instead we transferred the data to Apache Kudu",
>>> did you transfer all Cassandra data into Kudu from with a single migration
>>> and then tap into Kudo for aggregation or did you run data import every
>>> day/week/month from Cassandra into Kudu ?
>>>
>>>
>>>
>>> From my point of view, the difficulty is not to have a static set of
>>> data and run aggregation on it, there are a lot of alternatives out there.
>>> The difficulty is to be able to run analytics on a live/production/changing
>>> dataset with all the data movement & update that it implies.
>>>
>>>
>>>
>>> Regards
>>>
>>>
>>>
>>> On Thu, Jun 8, 2017 at 3:37 PM, Tobias Eriksson <
>>> tobias.eriks...@qvantel.com> wrote:
>>>
>>> Hi
>>>
>>> Something to consider before moving to Apache Spark and Cassandra
>>>
>>> I have a background where we have tons of data in Cassandra, and we
>>> wanted to use Apache Spark to run various jobs
>>>
>>> We loved what we could do with Spark, BUT….
>>>
>>> We realized soon that we wanted to run multiple jobs in parallel
>>>
>>> Some jobs would take 30 minutes and some 45 seconds
>>>
>>> Spark is by default arranged so that it will take up all the resources
>>> there is, this can be tweaked by using Mesos or Yarn
>>>
>>> But even with Mesos and Yarn we found it complicated to run multiple
>>> jobs in parallel.
>>>
>>> So eventually we ended up throwing out Spark,
>>>
>>> Instead we transferred the data to Apache Kudu, and then we ran our
>>> analysis on Kudu, and what a difference !
>>>
>>> “my two cents!”
>>>
>>> -Tobias
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *From: *한 승호 <shha...@outlook.com>
>>> *Date: *Thursday, 8 June 2017 at 10:25
>>> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
>>> *Subject: *Cassandra & Spark
>>>
>>>
>>>
>>> Hello,
>>>
>>>
>>>
>>> I am Seung-ho and I work as a Data Engineer in Korea. I need some advice.
>>>
>>>
>>>
>>> My company recently consider replacing RDMBS-based system with Cassandra
>>> and Hadoop.
>>>
>>> The purpose of this system is to analyze Cadssandra and HDFS data with
>>> Spark.
>>>
>>>
>>>
>>> It seems many user cases put emphasis on data locality, for instance,
>>> both Cassandra and Spark executor should be on the same node.
>>>
>>>
>>>
>>> The thing is, my company's data analyst team wants to analyze
>>> heterogeneous data source, Cassandra and HDFS, using Spark.
>>>
>>> So, I wonder what would be the best practices of using Cassandra and
>>> Hadoop in such case.
>>>
>>>
>>>
>>> Plan A: Both HDFS and Cassandra with NodeManager(Spark Executor) on the
>>> same node
>>>
>>>
>>>
>>> Plan B: Cassandra + Node Manager / HDFS + NodeManager in each node
>>> separately but the same cluster
>>>
>>>
>>>
>>>
>>>
>>> Which would be better or correct, or would be a better way?
>>>
>>>
>>>
>>> I appreciate your advice in advance :)
>>>
>>>
>>>
>>> Best Regards,
>>>
>>> Seung-Ho Han
>>>
>>>
>>>
>>>
>>>
>>> Windows 10용 메일 <https://go.microsoft.com/fwlink/?LinkId=550986>에서 보냄
>>>
>>>
>>>
>>>
>>>
>>

Reply via email to