Two problems with Cassandra

2015-02-11 Thread Pavel Velikhov
Hi, I’m using Cassandra to store NLP data, the dataset is not that huge (about 1TB), but I need to iterate over it quite frequently, updating the full dataset (each record, but not necessarily each column). I’ve run into two problems (I’m using the latest Cassandra): 1. I was trying to c

How to speed up SELECT * query in Cassandra

2015-02-11 Thread Ja Sam
Is there a simple way (or even a complicated one) how can I speed up SELECT * FROM [table] query? I need to get all rows form one table every day. I split tables, and create one for each day, but still query is quite slow (200 millions of records) I was thinking about run this query in parallel, b

Re:How to speed up SELECT * query in Cassandra

2015-02-11 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Look for the message "Re: Fastest way to map/parallel read all values in a table?" in the mailing list, it was recently discussed. You can have several parallel processes each one reading a slice of the data, by splitting min/max murmur3 hash ranges. In the company I used to work we developed a

Re: Two problems with Cassandra

2015-02-11 Thread Carlos Rolo
Hello Pavel, What is the size of the Cluster (# of nodes)? And you need to iterate over the full 1TB every time you do the update? Or just parts of it? IMO information is short to make any kind of assessment of the problem you are having. I can suggest to try a 2.0.x (or 2.1.1) release to see if

Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Jens Rantil
On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) < mvallemil...@bloomberg.net> wrote: > If you use Cassandra enterprise, you can use hive, AFAIK. Even better, you can use Spark/Shark with DSE. Cheers, Jens -- Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se

Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Colin
I use spark with cassandra, and you dont need DSE. I see a lot of people ask this same question below (how do I get a lot of data out of cassandra?), and my question is always, why arent you updating both places at once? For example, we use hadoop and cassandra in conjunction with each other, w

Re: Two problems with Cassandra

2015-02-11 Thread Pavel Velikhov
Hi Carlos, I tried on a single node and a 4-node cluster. On the 4-node cluster I setup the tables with replication factor = 2. I usually iterate over a subset, but it can be about ~40% right now. Some of my column values could be quite big… I remember I was exporting to csv and I had to chan

Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Jiri Horky
The fastest way I am aware of is to do the queries in parallel to multiple cassandra nodes and make sure that you only ask them for keys they are responsible for. Otherwise, the node needs to resend your query which is much slower and creates unnecessary objects (and thus GC pressure). You can man

Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Marcelo Valle (BLOOMBERG/ LONDON)
> cassandra makes a very poor datawarehouse ot long term time series store Really? This is not the impression I have... I think Cassandra is good to store larges amounts of data and historical information, it's only not good to store temporary data. Netflix has a large amount of data and it's al

road map for Cassandra 3.0

2015-02-11 Thread Ernesto Reinaldo Barreiro
Hi, Is there a public road map for Cassandra 3.0? Are there any estimates for the release date of 3.0? -- Regards - Ernesto Reinaldo Barreiro

Re: Two problems with Cassandra

2015-02-11 Thread Carlos Rolo
Update should not be a problem because no read is done, so no need to pull the data out. Is that row bigger than your memory capacity (Or HEAP size)? For dealing with large heaps you can refer to this ticket: CASSANDRA-8150. It provides some nice tips. If someone else can share experience would b

Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread DuyHai Doan
"The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra" Prove it. Did you ever have a look into the source code of the Spark/Cassandra connector to see how data locality is achieved before throwing out such statem

Re: road map for Cassandra 3.0

2015-02-11 Thread DuyHai Doan
Look at the JIRA, filter by 3.0. But it's not very accurate. There are lot of new features scheduled for 3.0. Some of them will make it on time for 3.0.0 like User Defined Functions I guess. Some other features will be shipped with future 3 middle/minor versions. On Wed, Feb 11, 2015 at 1:25 PM,

Re: road map for Cassandra 3.0

2015-02-11 Thread Ernesto Reinaldo Barreiro
Thanks for your answer! On Wed, Feb 11, 2015 at 1:03 PM, DuyHai Doan wrote: > Look at the JIRA, filter by 3.0. But it's not very accurate. There are lot > of new features scheduled for 3.0. Some of them will make it on time for > 3.0.0 like User Defined Functions I guess. Some other features wil

Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Ja Sam
Your answer looks very promising How do you calculate start and stop? On Wed, Feb 11, 2015 at 12:09 PM, Jiri Horky wrote: > The fastest way I am aware of is to do the queries in parallel to > multiple cassandra nodes and make sure that you only ask them for keys > they are responsible for. Oth

best supported spark connector for Cassandra

2015-02-11 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Taking the opportunity Spark was being discussed in another thread, I decided to start a new one as I have interest in using Spark + Cassandra in the feature. About 3 years ago, Spark was not an existing option and we tried to use hadoop to process Cassandra data. My experience was horrible and

Re: best supported spark connector for Cassandra

2015-02-11 Thread DuyHai Doan
Start looking at the Spark/Cassandra connector here (in Scala): https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector Data locality is provided by this method: https://github.com/datastax/spark-cassandra-connector/bl

Re: best supported spark connector for Cassandra

2015-02-11 Thread Marcelo Valle (BLOOMBERG/ LONDON)
I just finished a scala course, nice exercise to check what I learned :D Thanks for the answer! From: user@cassandra.apache.org Subject: Re: best supported spark connector for Cassandra Start looking at the Spark/Cassandra connector here (in Scala): https://github.com/datastax/spark-cassandra-

changes to metricsReporterConfigFile requires restart of cassandra?

2015-02-11 Thread Erik Forsberg
Hi! I was pleased to find out that cassandra 2.0.x has added support for pluggable metrics export, which even includes a graphite metrics sender. Question: Will changes to the metricsReporterConfigFile require a restart of cassandra to take effect? I.e, if I want to add a new exported metric to

Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Colin
Did you want me to included specific examples from my employment at datastax or start from the ground up? All spark is on cassandra is a better than the previous use of hive. The fact that datastax hasnt provided any benchmarks themselves other than glossy marketing statements pretty much say

Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread DuyHai Doan
For your information Colin: http://en.wikipedia.org/wiki/List_of_fallacies. Look at "Burden of proof " You stated "The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on

Re: Pagination support on Java Driver Query API

2015-02-11 Thread Eric Stevens
> I can't believe that everyone read & process all rows at once (without pagination). Probably not too many people try to read all rows in a table as a single rolling operation with a standard client driver. But those who do would use token() to keep track of where they are and be able to resume

Adding new node - OPSCenter problems

2015-02-11 Thread Batranut Bogdan
Hello all, I have added new 3 nodes to existing cluster. I must point out that I have copied the cassandra yaml file, from an existing node and just changed listen_addres per instructions here: Adding nodes to an existing cluster | DataStax Cassandra 2.0 Documentation |   | |   |   |   |   |   |

Re: Recommissioned a node

2015-02-11 Thread Eric Stevens
AFAIK it should be ok after the repair completed (it was missing all writes while it was decommissioning and while it was offline, and nobody would have been keeping hinted handoffs for it, so repair was the right thing to do). Unless RF=N you're now due for a cleanup on the other nodes. Generall

Re: Adding new node - OPSCenter problems

2015-02-11 Thread Carlos Rolo
Hello, What is the output of nodetool status? All nodes should appear, otherwise there is some configuration error. Regards, Carlos Juzarte Rolo Cassandra Consultant Pythian - Love your data rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo

Re: changes to metricsReporterConfigFile requires restart of cassandra?

2015-02-11 Thread Eric Stevens
AFAIK yes. If you want just a subset of the metrics, I would suggest exporting them all, and filtering on the Graphite side. On Wed, Feb 11, 2015 at 6:54 AM, Erik Forsberg wrote: > Hi! > > I was pleased to find out that cassandra 2.0.x has added support for > pluggable metrics export, which eve

Re: Adding new node - OPSCenter problems

2015-02-11 Thread Batranut Bogdan
Hello, nodetool status shows existing nodes as UN and the new 3 as UJ . What is strange is that in the Owns column for the new 3 nodes I have ? instead of a percentage value. What I see is all are in rack1 in the Rack column. On Wednesday, February 11, 2015 4:50 PM, Carlos Rolo wrote:

Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Colin
No, the question isnt closed. You dont get to decide that. I dont run a website making claims regarding cassandra and spark - your employer does. Again, where are your benchmarks? I will publish mine, then we'll see what you've got. -- Colin Clark +1 612 859 6129 Skype colin.p.clark > On

Safely delete tmplink files - 2.1.2

2015-02-11 Thread Demian Berjman
Hi, we are expecting the 2.1.3 release to fix the delete of tmplink files. In the meantime, is it safe to delete this files without shutting down Cassandra? Thanks,

Re: Recommissioned a node

2015-02-11 Thread Stefano Ortolani
Hi Eric, thanks for your answer. The reason why it got recommissioned was simply because the machine got restarted (with auto_bootstrap set to to true). A cleaner, and correct, recommission would have just required wiping the data folder, am I correct? Or would I have needed to change something el

Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Jiri Horky
Well, I always wondered how Cassandra can by used in Hadoop-like environment where you basically need to do full table scan. I need to say that our experience is that cassandra is perfect for writing, reading specific values by key, but definitely not for reading all of the data out of it. Some of

How to use CqlBulkOutputFormat with MultipleOutputs

2015-02-11 Thread Aby Kuruvilla
Have been trying to import data into multiple Column Families through a Hadoop job. Was able to use CqlOutputFormat to move data to a single column family, but don't think this supports imports to multiple column families. From some searching saw that CqlBulkOutputformat has support to write to mu

Re: best supported spark connector for Cassandra

2015-02-11 Thread shahab
I am using Calliope cassandra-spark connector( http://tuplejump.github.io/calliope/), which is quite handy and easy to use! The only problem is that it is a bit outdates , works with Spark 1.1.0, hopefully new version comes soon. best, /Shahab On Wed, Feb 11, 2015 at 2:51 PM, Marcelo Valle (BLOOM

Re: Recommissioned a node

2015-02-11 Thread Eric Stevens
Yes, including the system and commitlog directory. Then when it starts, it's like a brand new node and will bootstrap to join. On Wed, Feb 11, 2015 at 8:56 AM, Stefano Ortolani wrote: > Hi Eric, > > thanks for your answer. The reason why it got recommissioned was simply > because the machine go

Re: Pagination support on Java Driver Query API

2015-02-11 Thread Ajay
Hi Eric, Thanks for your reply. I am using Cassandra 2.0.11 and in that I cannot append condition like last clustering key column > value of the last row in the previous batch. It fails Preceding column is either not restricted or by a non-EQ relation. It means I need to specify equal condition

Re: Pagination support on Java Driver Query API

2015-02-11 Thread Ajay
Basically I am trying different queries with your approach. One such query is like Select * from mycf where condition on partition key order by ck1 asc, ck2 desc where ck1 and ck2 are clustering keys in that order. Here how do we achieve pagination support? Thanks Ajay On Feb 11, 2015 11:16 PM,

Re: Recommissioned a node

2015-02-11 Thread Robert Coli
On Tue, Feb 10, 2015 at 9:13 PM, Stefano Ortolani wrote: > I recommissioned a node after decommissioningit. > That happened (1) after a successfull decommission (checked), (2) without > wiping the data directory on the node, (3) simply by restarting the > cassandra service. The node now reports h

Re: Recommissioned a node

2015-02-11 Thread Stefano Ortolani
Hi Robert, it all happened within 30 minutes, so way before the default gc_grace_second (864000), so I should be fine. However, this is quite shocking if you ask me. The only possibility of getting to an inconsistent state only by restarting a node is appalling... Can other people confirm that a

Re: Recommissioned a node

2015-02-11 Thread Jonathan Haddad
It could, because the tombstones that mark data deleted may have been removed. There would be nothing that says "this data is gone". If you're worried about it, turn up your gc grace seconds. Also, don't revive nodes back into a cluster with old data sitting on them. On Wed Feb 11 2015 at 11:18

Re: Recommissioned a node

2015-02-11 Thread Robert Coli
On Wed, Feb 11, 2015 at 11:20 AM, Jonathan Haddad wrote: > It could, because the tombstones that mark data deleted may have been > removed. There would be nothing that says "this data is gone". > > If you're worried about it, turn up your gc grace seconds. Also, don't > revive nodes back into a

Re: Recommissioned a node

2015-02-11 Thread Jonathan Haddad
And after decreasing your RF (rare but happens) On Wed Feb 11 2015 at 11:31:38 AM Robert Coli wrote: > On Wed, Feb 11, 2015 at 11:20 AM, Jonathan Haddad > wrote: > >> It could, because the tombstones that mark data deleted may have been >> removed. There would be nothing that says "this data i

Re: Safely delete tmplink files - 2.1.2

2015-02-11 Thread Robert Coli
On Wed, Feb 11, 2015 at 7:52 AM, Demian Berjman wrote: > Hi, we are expecting the 2.1.3 release to fix the delete of tmplink files. > In the meantime, is it safe to delete this files without shutting down > Cassandra? > If I were experiencing issues with 2.1.2, I would downgrade to 2.1.1, fwiw.

Re: Two problems with Cassandra

2015-02-11 Thread Robert Coli
On Wed, Feb 11, 2015 at 2:22 AM, Pavel Velikhov wrote: > 2. While trying to update the full dataset with a simple transformation > (again via python driver), single node and clustered Cassandra run out of > memory no matter what settings I try, even I put a lot of sleeps into the > mix. However

Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Jiri Horky
Hi, here are some snippets of code in scala which should get you started. Jirka H. loop {lastRow =>val query = lastRow match {case Some(row) => nextPageQuery(row, upperLimit)case None => initialQuery(lowerLimit)}session.execute(query).all} private def nextPageQuery(row: Row, upperLimit: String

Re: High GC activity on node with 4TB on data

2015-02-11 Thread Jiri Horky
Hi Chris, On 02/09/2015 04:22 PM, Chris Lohfink wrote: > - number of tombstones - how can I reliably find it out? > https://github.com/spotify/cassandra-opstools > https://github.com/cloudian/support-tools thanks. > > If not getting much compression it may be worth trying to disable it, > it may