Re: Upgrade from 2.1 to 3.11

2018-08-24 Thread Mohamadreza Rostami
You have very large heap,it’s take most of  cpu time in GC stage.you should
in maximum set heap on 12GB and enable row cache to your cluster become
faster.

On Friday, 24 August 2018, Mun Dega  wrote:

> 120G data
> 28G heap out of 48 on system
> 9 node cluster, RF3
>
>
> On Thu, Aug 23, 2018, 17:19 Mohamadreza Rostami <
> mohamadrezarosta...@gmail.com> wrote:
>
>> Hi,
>> How much data do you have? How much RAM do your servers have? How much do
>> you have a heep?
>> On Thu, Aug 23, 2018 at 10:14 PM Mun Dega  wrote:
>>
>>> Hello,
>>>
>>> We recently upgraded from Cassandra 2.1 to 3.11.2 on one cluster.  The
>>> process went OK including upgradesstable but we started to experience high
>>> latency for r/w, occasional OOM and long GC pause after.
>>>
>>> For the same cluster with 2.1, we didn't have any issues like this.  We
>>> also kept server specs, heap, all the same in post upgrade
>>>
>>> Has anyone else had similar issues going to 3.11 and what are the major
>>> changes that could have such a major setback in the new version?
>>>
>>> Ma Dega
>>>
>>


Re: Tombstone experience

2018-08-24 Thread Rahul Singh
Thanks! Great tips on clearing tombstones. The TTL vs. business rules challenge 
is one we’ve seen in enterprises moving from relational to non-relational 
because there is no thought to planning a data retention policy.

Periodic business rules based cleaning via Spark works well if you use it to 
set a short TTL that you would have deleted and that will eventually clear our 
data depending on the value you set. My suggestion for those cases where you 
must do business rules deletions, use a continuous spark job / Spark streaming 
on another DC to maintain data hygiene.

Rahul Singh
Chief Executive Officer
m 202.905.2818

Anant Corporation
1010 Wisconsin Ave NW, Suite 250
Washington, D.C. 20007

We build and manage digital business technology platforms.
On Aug 24, 2018, 1:46 AM -0400, Charulata Sharma (charshar) 
, wrote:
> Hi All,
>
>    I have shared my experience of tombstone clearing in this blog post.
> Sharing it in this forum for wider distribution.
>
> https://medium.com/cassandra-tombstones-clearing-use-case/the-curios-case-of-tombstones-d897f681a378
>
>
> Thanks,
> Charu


Re: A blog about Cassandra in the IoT arena

2018-08-24 Thread DuyHai Doan
No what I meant by infinite partition is not auto sub-partitioning, even at
server-side. Ideally Cassandra should be able to support infinite partition
size and make compaction, repair and streaming of such partitions
manageable:

- compaction: find a way to iterate super efficiently through the whole
partition and merge-sort all sstables containing data of the same
partition.

 - repair: find another approach than Merkle tree because its resolution is
not granular enough. Ideally repair resolution should be at the clustering
level or every xxx clustering values

 - streaming: same idea as repair, in case of error/disconnection the
stream should be resumed at the latest clustering level checkpoint, or at
least should we checkpoint every xxx clustering values

 - partition index: find a way to index efficiently the huge partition.
Right now huge partition has a dramatic impact on partition index. The work
of Michael Kjellman on birch indices is going into the right direction
 (CASSANDRA-9754)

About tombstone, there is recently a research paper about Dotted DB and an
attempt to make delete without using tombstones:
http://haslab.uminho.pt/tome/files/dotteddb_srds.pdf



On Fri, Aug 24, 2018 at 12:38 AM, Rahul Singh 
wrote:

> Agreed. One of the ideas I had on partition size is to automatically
> synthetically shard based on some basic patterns seen in the data.
>
> It could be implemented as a tool that would create a new table with an
> additional part of the key that is an automatic created shard, or it would
> use an existing key and then migrate the data.
>
> The internal automatic shard would adjust as needed and keep
> “Subpartitons” or “rowsets” but return the full partition given some
> special CQL
>
> This is done today at the Data Access layer and he data model design but
> it’s pretty much a step by step process that could be algorithmically done.
>
> Regarding the tombstone — maybe we have another thread dedicated to
> cleaning tombstones - separate from compaction. Depending on the amount of
> tombstones and a threshold, it would be dedicated to deletion. It may be an
> edge case , but people face issues with tombstones all the time because
> they don’t know better.
>
> Rahul
> On Aug 23, 2018, 11:50 AM -0500, DuyHai Doan ,
> wrote:
>
> As I used to tell some people, the day we make :
>
> 1. partition size unlimited, or at least huge partition easily manageable
> (compaction, repair, streaming, partition index file)
> 2. tombstone a non-issue
>
> that day, Cassandra will dominate any other IoT technology out there
>
> Until then ...
>
> On Thu, Aug 23, 2018 at 4:54 PM, Rahul Singh  > wrote:
>
>> Good analysis of how the different key structures affect use cases and
>> performance. I think you could extend this article with potential
>> evaluation of FiloDB which specifically tries to solve the OLAP issue with
>> arbitrary queries.
>>
>> Another option is leveraging Elassandra (index in Elasticsearch
>> collocates with C*) or DataStax (index in Solr collocated with C*)
>>
>> I personally haven’t used SnappyData but that’s another Spark based DB
>> that could be leveraged for performance real-time queries on the OLTP side.
>>
>> Rahul
>> On Aug 23, 2018, 2:48 AM -0500, Affan Syed , wrote:
>>
>> Hi,
>>
>> we wrote a blog about some of the results that engineers from AN10 shared
>> earlier.
>>
>> I am sharing it here for greater comments and discussions.
>>
>> http://www.an10.io/technology/cassandra-and-iot-queries-are-
>> they-a-good-match/
>>
>>
>> Thank you.
>>
>>
>>
>> - Affan
>>
>>
>


data not deleted in data dir after keyspace dropped

2018-08-24 Thread Vitaliy Semochkin
Hi,
I'm using cassandra 3.11
When I  drop a keyspace it's data is not deleted from data dirs in a cluster.
what additional steps are needed to make cluster nodes to deleted
deleted data from the disk?

Regards,
Vitaliy

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



benefits oh HBase over Cassandra

2018-08-24 Thread Vitaliy Semochkin
Hi,

I read that once Facebook chose HBase over Cassandra for it's messenger,
but I never found what are the benefits for HBase over Cassandra,
can someone list, if there are any?

Regards,
Vitaliy

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



speeding up cassandra-unit startup

2018-08-24 Thread Vitaliy Semochkin
Hi,

I'm using cassandra-unit for integration tests,
which is using regular cassandra.yaml to create a cassandra instance.

What parameters are recommended to be changed in order to speed up
startup process.

Regards
Vitaliy

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



cqlsh --request-timeout=3600 doesn't seems to work

2018-08-24 Thread Vitaliy Semochkin
Hi,

i'm running count query for a very small table (less than 1000 000 records).
When the amount of records gets to 800 000 i receive read timeout
error in cqlsh.
I tried to run cqlsh with option --request-timeout=3600, but receive same error,
what should I do in order not to recieve timeout exception?

Regards,
Vitaliy

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: cqlsh --request-timeout=3600 doesn't seems to work

2018-08-24 Thread Pranay akula
You should change read_request_timeout in cassandra.yaml file.

Default is 5 sec

But it is not recommended to do count in cassandra better if u can avoid it


On Fri, Aug 24, 2018, 4:06 PM Vitaliy Semochkin 
wrote:

> Hi,
>
> i'm running count query for a very small table (less than 1000 000
> records).
> When the amount of records gets to 800 000 i receive read timeout
> error in cqlsh.
> I tried to run cqlsh with option --request-timeout=3600, but receive same
> error,
> what should I do in order not to recieve timeout exception?
>
> Regards,
> Vitaliy
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


why returned achievedConsistencyLevel is null

2018-08-24 Thread Vitaliy Semochkin
HI,

While using DataStax driver
session.execute("some insert
query")getExecutionInfo().getAchievedConsistencyLevel()
is already returned as null, despite data is stored. Why could it be?

Is it possible to make DataStax driver throw an exception in case
desired consistency level was not achieved during the insert?

Regards,
Vitaliy

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: data not deleted in data dir after keyspace dropped

2018-08-24 Thread Vineet G H
It takes a while in cluster for drop to propagte this depends on
amount of data and network traffic between your storage nodes
On Fri, Aug 24, 2018 at 1:54 PM Vitaliy Semochkin  wrote:
>
> Hi,
> I'm using cassandra 3.11
> When I  drop a keyspace it's data is not deleted from data dirs in a cluster.
> what additional steps are needed to make cluster nodes to deleted
> deleted data from the disk?
>
> Regards,
> Vitaliy
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: cqlsh --request-timeout=3600 doesn't seems to work

2018-08-24 Thread Vitaliy Semochkin
Thank you for the  fast replay, Pranay!
This is testing environment and using count on it will do no harm.

On Sat, Aug 25, 2018 at 12:11 AM Pranay akula
 wrote:
>
> You should change read_request_timeout in cassandra.yaml file.
>
> Default is 5 sec
>
> But it is not recommended to do count in cassandra better if u can avoid it
>
>
> On Fri, Aug 24, 2018, 4:06 PM Vitaliy Semochkin  wrote:
>>
>> Hi,
>>
>> i'm running count query for a very small table (less than 1000 000 records).
>> When the amount of records gets to 800 000 i receive read timeout
>> error in cqlsh.
>> I tried to run cqlsh with option --request-timeout=3600, but receive same 
>> error,
>> what should I do in order not to recieve timeout exception?
>>
>> Regards,
>> Vitaliy
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: data not deleted in data dir after keyspace dropped

2018-08-24 Thread Pranay akula
Cassandra creates a snapshot when u drop keyspace. So u should run nodetool
clear snapshot on all nodes to reclaim ur space.



On Fri, Aug 24, 2018, 4:14 PM Vineet G H  wrote:

> It takes a while in cluster for drop to propagte this depends on
> amount of data and network traffic between your storage nodes
> On Fri, Aug 24, 2018 at 1:54 PM Vitaliy Semochkin 
> wrote:
> >
> > Hi,
> > I'm using cassandra 3.11
> > When I  drop a keyspace it's data is not deleted from data dirs in a
> cluster.
> > what additional steps are needed to make cluster nodes to deleted
> > deleted data from the disk?
> >
> > Regards,
> > Vitaliy
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: user-h...@cassandra.apache.org
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


Re: benefits oh HBase over Cassandra

2018-08-24 Thread dinesh.jo...@yahoo.com.INVALID
I've worked with both databases. They're suitable for different use-cases. If 
you look at the CAP theorem; HBase is CP while Cassandra is a AP. If we talk 
about a specific use-case, it'll be easier to discuss.
Dinesh 

On Friday, August 24, 2018, 1:56:31 PM PDT, Vitaliy Semochkin 
 wrote:  
 
 Hi,

I read that once Facebook chose HBase over Cassandra for it's messenger,
but I never found what are the benefits for HBase over Cassandra,
can someone list, if there are any?

Regards,
Vitaliy

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

  

Re: benefits oh HBase over Cassandra

2018-08-24 Thread Elliott Sims
At the time that Facebook chose HBase, Cassandra was drastically less
mature than it is now and I think the original creators had already left.
There were already various Hadoop variants running for data analytics etc,
so lots of operational and engineering experience around it available.  So,
probably not a useful example to use in a technical comparison between
current HBase and current Cassandra.  Also, FB has since abandoned HBase
for messenger in favor of MyRocks.

On Fri, Aug 24, 2018 at 5:43 PM, dinesh.jo...@yahoo.com.INVALID <
dinesh.jo...@yahoo.com.invalid> wrote:

> I've worked with both databases. They're suitable for different use-cases.
> If you look at the CAP theorem; HBase is CP while Cassandra is a AP. If we
> talk about a specific use-case, it'll be easier to discuss.
>
> Dinesh
>
>
> On Friday, August 24, 2018, 1:56:31 PM PDT, Vitaliy Semochkin <
> vitaliy...@gmail.com> wrote:
>
>
> Hi,
>
> I read that once Facebook chose HBase over Cassandra for it's messenger,
> but I never found what are the benefits for HBase over Cassandra,
> can someone list, if there are any?
>
> Regards,
> Vitaliy
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


Re: data not deleted in data dir after keyspace dropped

2018-08-24 Thread Vitaliy Semochkin
Thank you very much for the fast reply, Vineet!

Is there any way to speed up this process or manually trigger
something analogs to vacuum full in PostgreSQL?
On Sat, Aug 25, 2018 at 12:14 AM Vineet G H  wrote:
>
> It takes a while in cluster for drop to propagte this depends on
> amount of data and network traffic between your storage nodes
> On Fri, Aug 24, 2018 at 1:54 PM Vitaliy Semochkin  
> wrote:
> >
> > Hi,
> > I'm using cassandra 3.11
> > When I  drop a keyspace it's data is not deleted from data dirs in a 
> > cluster.
> > what additional steps are needed to make cluster nodes to deleted
> > deleted data from the disk?
> >
> > Regards,
> > Vitaliy
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: user-h...@cassandra.apache.org
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: benefits oh HBase over Cassandra

2018-08-24 Thread Vitaliy Semochkin
Thank you very much for fast reply, Dinesh!
I was under impression that with tunable consistency  Cassandra can
act as CP (in case it is needed), e.g  by setting  ALL on both reads
and writes.
Do you agree with this statement?

PS Are there any other benefits of HBase you have found? I'd be glad
to hear usecases list.



On Sat, Aug 25, 2018 at 12:44 AM dinesh.jo...@yahoo.com.INVALID
 wrote:
>
> I've worked with both databases. They're suitable for different use-cases. If 
> you look at the CAP theorem; HBase is CP while Cassandra is a AP. If we talk 
> about a specific use-case, it'll be easier to discuss.
>
> Dinesh
>
>
> On Friday, August 24, 2018, 1:56:31 PM PDT, Vitaliy Semochkin 
>  wrote:
>
>
> Hi,
>
> I read that once Facebook chose HBase over Cassandra for it's messenger,
> but I never found what are the benefits for HBase over Cassandra,
> can someone list, if there are any?
>
> Regards,
> Vitaliy
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: data not deleted in data dir after keyspace dropped

2018-08-24 Thread Vitaliy Semochkin
Thank you very much Pranay, that was exactly what I needed!
On Sat, Aug 25, 2018 at 12:17 AM Pranay akula
 wrote:
>
> Cassandra creates a snapshot when u drop keyspace. So u should run nodetool 
> clear snapshot on all nodes to reclaim ur space.
>
>
>
> On Fri, Aug 24, 2018, 4:14 PM Vineet G H  wrote:
>>
>> It takes a while in cluster for drop to propagte this depends on
>> amount of data and network traffic between your storage nodes
>> On Fri, Aug 24, 2018 at 1:54 PM Vitaliy Semochkin  
>> wrote:
>> >
>> > Hi,
>> > I'm using cassandra 3.11
>> > When I  drop a keyspace it's data is not deleted from data dirs in a 
>> > cluster.
>> > what additional steps are needed to make cluster nodes to deleted
>> > deleted data from the disk?
>> >
>> > Regards,
>> > Vitaliy
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> > For additional commands, e-mail: user-h...@cassandra.apache.org
>> >
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Cassandra & Spark

2018-08-24 Thread Affan Syed
Tobias,

This is very interesting. Can I inquire a bit more on why you have both C*
and Kudu in the system?

Wouldnt keeping just Kudu work (that was its initial purpose?). Is there
something to do with its production readiness? I ask as we have a similar
concern as well.

Finally, how are your dashboard apps talking to Kudu? Is there a backend
that talks via impala, or do you have some calls to bash level scripts
communicating over some file system?



- Affan


On Thu, Jun 8, 2017 at 7:01 PM Tobias Eriksson 
wrote:

> Hi
>
> What I wanted was a dashboard with graphs/diagrams and it should not take
> minutes for the page to load
>
> Thus, it was a problem to have Spark with Cassandra, and not solving the
> parallelization to such an extent that I could have the diagrams rendered
> in seconds.
>
> Now with Kudu we get some decent results rendering the diagrams/graphs
>
>
>
> The way we transfer data from Cassandra which is the Production system
> storage to Kudu, is through an Apache Kafka topic (or many topics actually)
> and then we have an application which ingests the data into Kudu
>
>
>
>
>
> Other Systems -- > Domain Storage App(s) -- > Cassandra -- > KAFKA -- >
> KuduIngestion App -- > Kudu < -- Dashboard App(s)
>
>
>
>
>
> If you want to play with really fast analytics then perhaps consider
> looking at Apache Ignite
>
> https://ignite.apache.org
>
> Which then act as a layer between Cassandra and your applications storing
> into Cassandra (memory datagrid I think it is called)
>
> Basically, think of it as a big cache
>
> It is an in-memory thingi ☺
>
> And then you can run some super fast queries
>
>
>
> -Tobias
>
>
>
> *From: *DuyHai Doan 
> *Date: *Thursday, 8 June 2017 at 15:42
> *To: *Tobias Eriksson 
> *Cc: *한 승호 , "user@cassandra.apache.org" <
> user@cassandra.apache.org>
> *Subject: *Re: Cassandra & Spark
>
>
>
> Interesting
>
>
>
> Tobias, when you said "Instead we transferred the data to Apache Kudu",
> did you transfer all Cassandra data into Kudu from with a single migration
> and then tap into Kudo for aggregation or did you run data import every
> day/week/month from Cassandra into Kudu ?
>
>
>
> From my point of view, the difficulty is not to have a static set of data
> and run aggregation on it, there are a lot of alternatives out there. The
> difficulty is to be able to run analytics on a live/production/changing
> dataset with all the data movement & update that it implies.
>
>
>
> Regards
>
>
>
> On Thu, Jun 8, 2017 at 3:37 PM, Tobias Eriksson <
> tobias.eriks...@qvantel.com> wrote:
>
> Hi
>
> Something to consider before moving to Apache Spark and Cassandra
>
> I have a background where we have tons of data in Cassandra, and we wanted
> to use Apache Spark to run various jobs
>
> We loved what we could do with Spark, BUT….
>
> We realized soon that we wanted to run multiple jobs in parallel
>
> Some jobs would take 30 minutes and some 45 seconds
>
> Spark is by default arranged so that it will take up all the resources
> there is, this can be tweaked by using Mesos or Yarn
>
> But even with Mesos and Yarn we found it complicated to run multiple jobs
> in parallel.
>
> So eventually we ended up throwing out Spark,
>
> Instead we transferred the data to Apache Kudu, and then we ran our
> analysis on Kudu, and what a difference !
>
> “my two cents!”
>
> -Tobias
>
>
>
>
>
>
>
> *From: *한 승호 
> *Date: *Thursday, 8 June 2017 at 10:25
> *To: *"user@cassandra.apache.org" 
> *Subject: *Cassandra & Spark
>
>
>
> Hello,
>
>
>
> I am Seung-ho and I work as a Data Engineer in Korea. I need some advice.
>
>
>
> My company recently consider replacing RDMBS-based system with Cassandra
> and Hadoop.
>
> The purpose of this system is to analyze Cadssandra and HDFS data with
> Spark.
>
>
>
> It seems many user cases put emphasis on data locality, for instance, both
> Cassandra and Spark executor should be on the same node.
>
>
>
> The thing is, my company's data analyst team wants to analyze
> heterogeneous data source, Cassandra and HDFS, using Spark.
>
> So, I wonder what would be the best practices of using Cassandra and
> Hadoop in such case.
>
>
>
> Plan A: Both HDFS and Cassandra with NodeManager(Spark Executor) on the
> same node
>
>
>
> Plan B: Cassandra + Node Manager / HDFS + NodeManager in each node
> separately but the same cluster
>
>
>
>
>
> Which would be better or correct, or would be a better way?
>
>
>
> I appreciate your advice in advance :)
>
>
>
> Best Regards,
>
> Seung-Ho Han
>
>
>
>
>
> Windows 10용 메일 에서 보냄
>
>
>
>
>


Re: Cassandra & Spark

2018-08-24 Thread CharSyam
Spark can read hdfs directly so locality is important but Spark can't read
Cassandra data directly it can only connect by api. So I think you don't
need to install them a same node

2018년 8월 25일 (토) 오후 3:16, Affan Syed 님이 작성:

> Tobias,
>
> This is very interesting. Can I inquire a bit more on why you have both C*
> and Kudu in the system?
>
> Wouldnt keeping just Kudu work (that was its initial purpose?). Is there
> something to do with its production readiness? I ask as we have a similar
> concern as well.
>
> Finally, how are your dashboard apps talking to Kudu? Is there a backend
> that talks via impala, or do you have some calls to bash level scripts
> communicating over some file system?
>
>
>
> - Affan
>
>
> On Thu, Jun 8, 2017 at 7:01 PM Tobias Eriksson <
> tobias.eriks...@qvantel.com> wrote:
>
>> Hi
>>
>> What I wanted was a dashboard with graphs/diagrams and it should not take
>> minutes for the page to load
>>
>> Thus, it was a problem to have Spark with Cassandra, and not solving the
>> parallelization to such an extent that I could have the diagrams rendered
>> in seconds.
>>
>> Now with Kudu we get some decent results rendering the diagrams/graphs
>>
>>
>>
>> The way we transfer data from Cassandra which is the Production system
>> storage to Kudu, is through an Apache Kafka topic (or many topics actually)
>> and then we have an application which ingests the data into Kudu
>>
>>
>>
>>
>>
>> Other Systems -- > Domain Storage App(s) -- > Cassandra -- > KAFKA -- >
>> KuduIngestion App -- > Kudu < -- Dashboard App(s)
>>
>>
>>
>>
>>
>> If you want to play with really fast analytics then perhaps consider
>> looking at Apache Ignite
>>
>> https://ignite.apache.org
>>
>> Which then act as a layer between Cassandra and your applications storing
>> into Cassandra (memory datagrid I think it is called)
>>
>> Basically, think of it as a big cache
>>
>> It is an in-memory thingi ☺
>>
>> And then you can run some super fast queries
>>
>>
>>
>> -Tobias
>>
>>
>>
>> *From: *DuyHai Doan 
>> *Date: *Thursday, 8 June 2017 at 15:42
>> *To: *Tobias Eriksson 
>> *Cc: *한 승호 , "user@cassandra.apache.org" <
>> user@cassandra.apache.org>
>> *Subject: *Re: Cassandra & Spark
>>
>>
>>
>> Interesting
>>
>>
>>
>> Tobias, when you said "Instead we transferred the data to Apache Kudu",
>> did you transfer all Cassandra data into Kudu from with a single migration
>> and then tap into Kudo for aggregation or did you run data import every
>> day/week/month from Cassandra into Kudu ?
>>
>>
>>
>> From my point of view, the difficulty is not to have a static set of data
>> and run aggregation on it, there are a lot of alternatives out there. The
>> difficulty is to be able to run analytics on a live/production/changing
>> dataset with all the data movement & update that it implies.
>>
>>
>>
>> Regards
>>
>>
>>
>> On Thu, Jun 8, 2017 at 3:37 PM, Tobias Eriksson <
>> tobias.eriks...@qvantel.com> wrote:
>>
>> Hi
>>
>> Something to consider before moving to Apache Spark and Cassandra
>>
>> I have a background where we have tons of data in Cassandra, and we
>> wanted to use Apache Spark to run various jobs
>>
>> We loved what we could do with Spark, BUT….
>>
>> We realized soon that we wanted to run multiple jobs in parallel
>>
>> Some jobs would take 30 minutes and some 45 seconds
>>
>> Spark is by default arranged so that it will take up all the resources
>> there is, this can be tweaked by using Mesos or Yarn
>>
>> But even with Mesos and Yarn we found it complicated to run multiple jobs
>> in parallel.
>>
>> So eventually we ended up throwing out Spark,
>>
>> Instead we transferred the data to Apache Kudu, and then we ran our
>> analysis on Kudu, and what a difference !
>>
>> “my two cents!”
>>
>> -Tobias
>>
>>
>>
>>
>>
>>
>>
>> *From: *한 승호 
>> *Date: *Thursday, 8 June 2017 at 10:25
>> *To: *"user@cassandra.apache.org" 
>> *Subject: *Cassandra & Spark
>>
>>
>>
>> Hello,
>>
>>
>>
>> I am Seung-ho and I work as a Data Engineer in Korea. I need some advice.
>>
>>
>>
>> My company recently consider replacing RDMBS-based system with Cassandra
>> and Hadoop.
>>
>> The purpose of this system is to analyze Cadssandra and HDFS data with
>> Spark.
>>
>>
>>
>> It seems many user cases put emphasis on data locality, for instance,
>> both Cassandra and Spark executor should be on the same node.
>>
>>
>>
>> The thing is, my company's data analyst team wants to analyze
>> heterogeneous data source, Cassandra and HDFS, using Spark.
>>
>> So, I wonder what would be the best practices of using Cassandra and
>> Hadoop in such case.
>>
>>
>>
>> Plan A: Both HDFS and Cassandra with NodeManager(Spark Executor) on the
>> same node
>>
>>
>>
>> Plan B: Cassandra + Node Manager / HDFS + NodeManager in each node
>> separately but the same cluster
>>
>>
>>
>>
>>
>> Which would be better or correct, or would be a better way?
>>
>>
>>
>> I appreciate your advice in advance :)
>>
>>
>>
>> Best Regards,
>>
>> Seung-Ho Han
>>
>>
>>
>>
>>
>> Windows 10용 메일 

Re: Cassandra & Spark

2018-08-24 Thread Affan Syed
Nope, Spark cassandra connector leverages data locality and get tremendous
improvements due to localitty.


- Affan


On Sat, Aug 25, 2018 at 11:25 AM CharSyam  wrote:

> Spark can read hdfs directly so locality is important but Spark can't read
> Cassandra data directly it can only connect by api. So I think you don't
> need to install them a same node
>
> 2018년 8월 25일 (토) 오후 3:16, Affan Syed 님이 작성:
>
>> Tobias,
>>
>> This is very interesting. Can I inquire a bit more on why you have both
>> C* and Kudu in the system?
>>
>> Wouldnt keeping just Kudu work (that was its initial purpose?). Is there
>> something to do with its production readiness? I ask as we have a similar
>> concern as well.
>>
>> Finally, how are your dashboard apps talking to Kudu? Is there a backend
>> that talks via impala, or do you have some calls to bash level scripts
>> communicating over some file system?
>>
>>
>>
>> - Affan
>>
>>
>> On Thu, Jun 8, 2017 at 7:01 PM Tobias Eriksson <
>> tobias.eriks...@qvantel.com> wrote:
>>
>>> Hi
>>>
>>> What I wanted was a dashboard with graphs/diagrams and it should not
>>> take minutes for the page to load
>>>
>>> Thus, it was a problem to have Spark with Cassandra, and not solving the
>>> parallelization to such an extent that I could have the diagrams rendered
>>> in seconds.
>>>
>>> Now with Kudu we get some decent results rendering the diagrams/graphs
>>>
>>>
>>>
>>> The way we transfer data from Cassandra which is the Production system
>>> storage to Kudu, is through an Apache Kafka topic (or many topics actually)
>>> and then we have an application which ingests the data into Kudu
>>>
>>>
>>>
>>>
>>>
>>> Other Systems -- > Domain Storage App(s) -- > Cassandra -- > KAFKA -- >
>>> KuduIngestion App -- > Kudu < -- Dashboard App(s)
>>>
>>>
>>>
>>>
>>>
>>> If you want to play with really fast analytics then perhaps consider
>>> looking at Apache Ignite
>>>
>>> https://ignite.apache.org
>>>
>>> Which then act as a layer between Cassandra and your applications
>>> storing into Cassandra (memory datagrid I think it is called)
>>>
>>> Basically, think of it as a big cache
>>>
>>> It is an in-memory thingi ☺
>>>
>>> And then you can run some super fast queries
>>>
>>>
>>>
>>> -Tobias
>>>
>>>
>>>
>>> *From: *DuyHai Doan 
>>> *Date: *Thursday, 8 June 2017 at 15:42
>>> *To: *Tobias Eriksson 
>>> *Cc: *한 승호 , "user@cassandra.apache.org" <
>>> user@cassandra.apache.org>
>>> *Subject: *Re: Cassandra & Spark
>>>
>>>
>>>
>>> Interesting
>>>
>>>
>>>
>>> Tobias, when you said "Instead we transferred the data to Apache Kudu",
>>> did you transfer all Cassandra data into Kudu from with a single migration
>>> and then tap into Kudo for aggregation or did you run data import every
>>> day/week/month from Cassandra into Kudu ?
>>>
>>>
>>>
>>> From my point of view, the difficulty is not to have a static set of
>>> data and run aggregation on it, there are a lot of alternatives out there.
>>> The difficulty is to be able to run analytics on a live/production/changing
>>> dataset with all the data movement & update that it implies.
>>>
>>>
>>>
>>> Regards
>>>
>>>
>>>
>>> On Thu, Jun 8, 2017 at 3:37 PM, Tobias Eriksson <
>>> tobias.eriks...@qvantel.com> wrote:
>>>
>>> Hi
>>>
>>> Something to consider before moving to Apache Spark and Cassandra
>>>
>>> I have a background where we have tons of data in Cassandra, and we
>>> wanted to use Apache Spark to run various jobs
>>>
>>> We loved what we could do with Spark, BUT….
>>>
>>> We realized soon that we wanted to run multiple jobs in parallel
>>>
>>> Some jobs would take 30 minutes and some 45 seconds
>>>
>>> Spark is by default arranged so that it will take up all the resources
>>> there is, this can be tweaked by using Mesos or Yarn
>>>
>>> But even with Mesos and Yarn we found it complicated to run multiple
>>> jobs in parallel.
>>>
>>> So eventually we ended up throwing out Spark,
>>>
>>> Instead we transferred the data to Apache Kudu, and then we ran our
>>> analysis on Kudu, and what a difference !
>>>
>>> “my two cents!”
>>>
>>> -Tobias
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *From: *한 승호 
>>> *Date: *Thursday, 8 June 2017 at 10:25
>>> *To: *"user@cassandra.apache.org" 
>>> *Subject: *Cassandra & Spark
>>>
>>>
>>>
>>> Hello,
>>>
>>>
>>>
>>> I am Seung-ho and I work as a Data Engineer in Korea. I need some advice.
>>>
>>>
>>>
>>> My company recently consider replacing RDMBS-based system with Cassandra
>>> and Hadoop.
>>>
>>> The purpose of this system is to analyze Cadssandra and HDFS data with
>>> Spark.
>>>
>>>
>>>
>>> It seems many user cases put emphasis on data locality, for instance,
>>> both Cassandra and Spark executor should be on the same node.
>>>
>>>
>>>
>>> The thing is, my company's data analyst team wants to analyze
>>> heterogeneous data source, Cassandra and HDFS, using Spark.
>>>
>>> So, I wonder what would be the best practices of using Cassandra and
>>> Hadoop in such case.
>>>
>>>
>>>
>>> Plan A: Both HDFS and Cassandra with NodeManager(Spark Execut