any doc about it...
From: user@cassandra.apache.org
Subject: Re: Spark and intermediate results
You can run spark against your Cassandra data directly without using a shared
filesystem.
https://github.com/datastax/spark-cassandra-connector
On Fri, Oct 9, 2015 at 6:09 AM Marcelo Valle
Hello,
I saw this nice link from an event:
http://www.datastax.com/dev/blog/zen-art-spark-maintenance?mkt_tok=3RkMMJWWfF9wsRogvqzIZKXonjHpfsX56%2B8uX6GylMI%2F0ER3fOvrPUfGjI4GTcdmI%2BSLDwEYGJlv6SgFSrXMMblswLgIXBY%3D
I would like to test using Spark to perform some operations on a column family,
I think there is a very important point in Scylladb - latency.
Performance can be an important requirement, but the fact scylladb is written
in C and uses lock free algorithms inside means it should have lower latency
than Cassandra, which enables it's use for a wider range of applications.
It
> When one partition's data is extreme large, the write/read will slow?
This is actually a good question, If a partition has near 2 billion rows, will
writes or reads get too slow? My understanding is it shouldn't, as data is
indexed inside a partition and when you read or write you are doing a
model,
then investigate things like disk latency, noisy neighbours (if you are on vms/
in the cloud).
On 26 February 2015 at 03:01, Marcelo Valle (BLOOMBERG/ LONDON)
wrote:
I am sorry if it's too basic and you already looked at that, but the first
thing I would ask would be the data model.
I am sorry if it's too basic and you already looked at that, but the first
thing I would ask would be the data model.
What data model are you using (how is your data partitioned)? What queries are
you running? If you are using ALLOW FILTERING, for instance, it will be very
easy to say why it's
eb 24, 2015 at 6:57 PM, Marcelo Valle (BLOOMBERG/ LONDON)
wrote:
Super column? Out of curiosity, which Cassandra version are you running?
From: user@cassandra.apache.org
Subject: Re: Cassandra Read Timeout
Hello
The structure is the same , the CFs are super column CFs , where key is
Super column? Out of curiosity, which Cassandra version are you running?
From: user@cassandra.apache.org
Subject: Re: Cassandra Read Timeout
Hello
The structure is the same , the CFs are super column CFs , where key is long (
timestamp to partition the index , so each 11 days new row is creat
Indeed, I thought something odd could be happening to your cluster, but it
seems it's working fine but the request is taking too long to complete.
I noticed from your cfstats the read count was about 10 in the first CF and in
the second one it was about 1000... Would you be doing much more read
Yulian,
Maybe other people have other clues, but I think if you could monitor the
behavior in tpstats after activity "Seeking to partition beginning in data
file" it could help to find the problem. Which type of thread is getting stuck?
Do you see any number increasing continuously during the r
I will try it for sure Frens, very nice!
Thanks for sharing!
From: user@cassandra.apache.org
Subject: Re:PySpark and Cassandra integration
Hi all,
Wanted to let you know I've forked PySpark Cassandra on
https://github.com/TargetHolding/pyspark-cassandra. Unfortunately the original
code didn't
My cents:
You could partition your data per date and second query would be easy.
If you need to query ALL data for a client id, it would be hard though, but
querying last 10 days for a client id could be easy, for instance.
If you need to query ALL, it would probably be better to create another
Cassandra 2.1 comes with incremental repair, and I haven't read the details
myself:
http://www.datastax.com/documentation/cassandra/2.1/cassandra/operations/ops_repair_nodes_c.html
However, AFAIK, a full repair will rebuild all sstables, that's why you should
have more than 50% of disk space av
zarte Rolo
Cassandra Consultant
Pythian - Love your data
rolo@pythian | Twitter: cjrolo | Linkedin: linkedin.com/in/carlosjuzarterolo
Tel: 1649
www.pythian.com
On Fri, Feb 13, 2015 at 3:05 PM, Marcelo Valle (BLOOMBERG/ LONDON)
wrote:
Actually, I am not the one looking for support, but I thank you a
: Re: best supported spark connector for Cassandra
Of course, Stratio Deep and Stratio Cassandra are licensed Apache 2.0.
Regarding the Cassandra support, I can introduce you to someone in Stratio that
can help you.
2015-02-12 15:05 GMT+01:00 Marcelo Valle (BLOOMBERG/ LONDON)
:
Thanks for
There is no automatic indexing in Cassandra. There are secondary indexes, but
not for these cases.
You could use a solution like DSE, to get data automatically indexed on solr,
in each node, as soon as data comes. Then you could do such a query on solr.
If the query can be slow, you could run a M
eb 11, 2015 at 2:51 PM, Marcelo Valle (BLOOMBERG/ LONDON)
wrote:
I just finished a scala course, nice exercise to check what I learned :D
Thanks for the answer!
From: user@cassandra.apache.org
Subject: Re: best supported spark connector for Cassandra
Start looking at the Spark/Cassandra connector
Thanks Jirka!
From: user@cassandra.apache.org
Subject: Re: How to speed up SELECT * query in Cassandra
Hi,
here are some snippets of code in scala which should get you started.
Jirka H.
loop { lastRow => val query = lastRow matc
-L336
Start digging from this all the way down the code.
As for Stratio Deep, I can't tell how the did the integration with Spark. Take
some time to dig down their code to understand the logic.
On Wed, Feb 11, 2015 at 2:25 PM, Marcelo Valle (BLOOMBERG/ LONDON)
wrote:
Taking the opport
Taking the opportunity Spark was being discussed in another thread, I decided
to start a new one as I have interest in using Spark + Cassandra in the feature.
About 3 years ago, Spark was not an existing option and we tried to use hadoop
to process Cassandra data. My experience was horrible and
dra's distributed nature vs partitioning
data on hadoop makes spark on hdfs actually fasted than on cassandra
--
Colin Clark
+1 612 859 6129
Skype colin.p.clark
On Feb 11, 2015, at 4:49 AM, Jens Rantil wrote:
On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON)
Look for the message "Re: Fastest way to map/parallel read all values in a
table?" in the mailing list, it was recently discussed. You can have several
parallel processes each one reading a slice of the data, by splitting min/max
murmur3 hash ranges.
In the company I used to work we developed a
AFAIK, if you were using RF 3 in a 3 node cluster, so all your nodes had all
your data.
When the number of nodes started to grow, this assumption stopped being true.
I think Cassandra will scale linearly from 9 nodes on, but comparing a
situation where all your nodes hold all your data is not re
Just for the record, I was doing the exact same thing in an internal
application in the start up I used to work. We have had the need of writing
custom code process in parallel all rows of a column family. Normally we would
use Spark for the job, but in our case the logic was a little more compl
ively, they won't see the changes for 20 to 50ms, unless they know to
read the details for that exact alert.
On Wed, Feb 4, 2015 at 11:57 AM, Marcelo Valle (BLOOMBERG/ LONDON)
wrote:
I don't want to optimize for reads or writes, I want to optimize for having the
smallest gap possib
xpect to update alerts? How often do you expect to
read the alerts? I suspect you'll be doing 100x more reads (or more), in which
case optimizing for reads is the definitely right choice.
On Wed, Feb 4, 2015 at 9:50 AM, Marcelo Valle (BLOOMBERG/ LONDON)
wrote:
Hello everyone,
I am thin
the user
again, I will read the worst case very often.
Chris
On Wed, Feb 4, 2015 at 9:33 AM, Marcelo Valle (BLOOMBERG/ LONDON)
wrote:
> The data model lgtm. You may need to balance the size of the time buckets
> with the amount of alarms to prevent partitions from getting too large.
Hello everyone,
I am thinking about the architecture of my application using Cassandra and I am
asking myself if I should or shouldn't normalize an entity.
I have users and alerts in my application and for each user, several alerts.
The first model which came into my mind was creating an "alert
> The data model lgtm. You may need to balance the size of the time buckets
> with the amount of alarms to prevent partitions from getting too large. 1
month may be a little large, I would aim to keep the partitions below 25mb (can
check with nodetool cfstats) or so in size to keep everything hap
29 matches
Mail list logo