read huge data from CSV and write into Cassandra

2014-07-25 Thread Akshay Ballarpure
How to read data from large CSV file which is having 100+ columns and millions of rows and inserting into Cassandra every 1 minute. Thanks & Regards Akshay Ghanshyam Ballarpure Tata Consultancy Services Cell:- 9985084075 Mailto: akshay.ballarp...@tcs.com Website: http://www.tcs.com __

How to get rid of stale info in gossip

2014-07-25 Thread Rahul Neelakantan
Is there a way to get rid of stale information that shows up for removed/dead nodes in gossip, without a complete cluster bounce? Rahul Neelakantan

Re: How to get rid of stale info in gossip

2014-07-25 Thread Mark Reddy
After removing a node, it's information can persist in the Gossiper for up to 3 days, after which time it should be removed. Are you having issues with a removed node state persisting for longer? Mark On Fri, Jul 25, 2014 at 11:33 AM, Rahul Neelakantan wrote: > Is there a way to get rid of s

Re: Hot, large row

2014-07-25 Thread Keith Wright
Answers to your questions below but in the end I believe the root issue here is that LCS is clearly not compacting away as it should resulting in reads across many SSTables which as you noted is “fishy”. I’m considering filing a JIRA for this, sound reasonable? We are running OOTB JMV tuning

Re: Hot, large row

2014-07-25 Thread Duncan Sands
Hi Keith, On 25/07/14 14:43, Keith Wright wrote: Answers to your questions below but in the end I believe the root issue here is that LCS is clearly not compacting away as it should resulting in reads across many SSTables which as you noted is “fishy”. I’m considering filing a JIRA for this, s

Re: How to get rid of stale info in gossip

2014-07-25 Thread Rahul Neelakantan
Yes, and this is a really old version of casandra 1.0.8. Rahul Neelakantan 678-451-4545 > On Jul 25, 2014, at 7:29 AM, Mark Reddy wrote: > > After removing a node, it's information can persist in the Gossiper for up to > 3 days, after which time it should be removed. > > Are you having issu

Re: Hot, large row

2014-07-25 Thread DuyHai Doan
Hello Keith 1. Periodically seeing one node stuck in CMS GC causing high read latency. Seems to recover on its own after an hour or so How many nodes do you have ? And how many distinct user_id roughtly is there ? Looking at your jvm settings it seems that you have the GC log enabled. It

Re: Hot, large row

2014-07-25 Thread Ken Hancock
Keith, If I'm understanding your schema it sounds like you have rows/partitions with TTLs that continuously grow over time. This sounds a lot like https://issues.apache.org/jira/browse/CASSANDRA-6654 which was marked as working as designed, but totally unintuitive. (Johnathan Ellis went off and

Re: Caffinitas Mapper - Java object mapper for Apache Cassandra

2014-07-25 Thread Vivek Mishra
How is it different than kundera? On 20/07/2014 9:03 pm, "Robert Stupp" wrote: > Hi all, > > I've just released the first beta version of Caffinitas Mapper. > > Caffinitas Mapper is an advanced Java object mapper for Apache Cassandra > NoSQL database. It offers an annotation based declaration mod

Re: Hot, large row

2014-07-25 Thread Keith Wright
Ha, check out who filed that ticket! Yes I’m aware of it. My hope is that it was mostly addressed in CASSANDRA-6563 so I may upgrade from 2.0.6 to 2.0.9. I’m really just surprised that others are not doing similar actions as I and thus experiencing similar issues. To answer DuyHai’s questio

Re: read huge data from CSV and write into Cassandra

2014-07-25 Thread Jack Krupansky
Read the csv file using a Java app and then index the rows using the Cassandra Java driver with multiple, parallel input streams. Oh, and make sure to provision your cluster with enough nodes to handle your desired ingestion and query rates. Do a proof of concept with a six node cluster with RF

Re: Hot, large row

2014-07-25 Thread Jack Krupansky
Is it the accumulated tombstones on a row that make it act as if “wide”? Does cfhistograms count the tombstones or subtract them when reporting on cell-count for rows? (I don’t know.) -- Jack Krupansky From: Keith Wright Sent: Friday, July 25, 2014 10:24 AM To: user@cassandra.apache.org Cc: D

Re: Caffinitas Mapper - Java object mapper for Apache Cassandra

2014-07-25 Thread Robert Stupp
I don't know kundera in detail. Goal for Caffinitas Mapper is to provide a convenient object mapper for Apache Cassandra using an annotation based declarative approach that does not have the limitations that JPA annotations have. Am 25.07.2014 um 16:21 schrieb Vivek Mishra : > How is it differe

Replication factor 2 with immutable data

2014-07-25 Thread Jon Travis
I have a couple questions regarding the availability of my data in a RF=2 scenario. - The setup - I am currently storing immutable data in a CF with RF=2 and read_repair_chance = 0.0. There is a lot of data, so bumping up to RF=3 would increase my storage costs quite dramatically. For the most p

Re: Replication factor 2 with immutable data

2014-07-25 Thread Robert Coli
On Fri, Jul 25, 2014 at 10:46 AM, Jon Travis wrote: > I have a couple questions regarding the availability of my data in a RF=2 > scenario. > You have just explained why consistency does not work with Replication Factor of fewer than 3 and Consistency Level of less than QUORUM. Basically, with

Does SELECT … IN () use parallel dispatch?

2014-07-25 Thread Kevin Burton
Say I have about 50 primary keys I need to fetch. I'd like to use parallel dispatch. So that if I have 50 hosts, and each has record, I can read from all 50 at once. I assume cassandra does the right thing here ? I believe it does… at least from reading the docs but it's still a bit unclear. K

Re: Does SELECT … IN () use parallel dispatch?

2014-07-25 Thread DuyHai Doan
Nope. Select ... IN() sends one request to a coordinator. This coordinator dispatch the request to 50 nodes as in your example and waits for 50 responses before sending back the final result. As you can guess this approach is not optimal since the global request latency is bound to the slowest late

Re: Index creation sometimes fails

2014-07-25 Thread Clint Kelly
Hi Tyler, FWIW I was not able to reproduce this problem with a smaller example. I'll go ahead and file the JIRA anyway. Thanks for your help! Best regards, Clint On Thu, Jul 17, 2014 at 3:05 PM, Tyler Hobbs wrote: > > On Thu, Jul 17, 2014 at 4:59 PM, Clint Kelly > wrote: > >> >> I will pos

Re: Does SELECT … IN () use parallel dispatch?

2014-07-25 Thread Graham Sanderson
Of course the driver in question is allowed to be smarter and can do so if use use a ? parameter for a list or even individual elements I'm not sure which if any drivers currently do this but we plan to combine this with token aware routing in our scala driver in the future Sent from my iPhone

do all nodes actually send the data to the coordinator when doing a read?

2014-07-25 Thread Brian Tarbox
We're considering a C* setup with very large columns and I have a question about the details of read. I understand that a read request gets handled by the coordinator which sends read requests to of the nodes holding replicas of the data, and once nodes have replied with consistent data it is re

Re: Does SELECT … IN () use parallel dispatch?

2014-07-25 Thread Kevin Burton
On Fri, Jul 25, 2014 at 11:14 AM, DuyHai Doan wrote: > Nope. Select ... IN() sends one request to a coordinator. This coordinator > dispatch the request to 50 nodes as in your example and waits for 50 > responses before sending back the final result. As you can guess this > approach is not optima

Re: Does SELECT … IN () use parallel dispatch?

2014-07-25 Thread Kevin Burton
Perhaps the best strategy is to have the datastax java-driver do this and I just wait or each result individually. This will give me parallel dispatch. On Fri, Jul 25, 2014 at 11:40 AM, Graham Sanderson wrote: > Of course the driver in question is allowed to be smarter and can do so if > use u

Re: Does SELECT … IN () use parallel dispatch?

2014-07-25 Thread Laing, Michael
We use IN (keeping the number down). The coordinator does parallel dispatch AND applies ORDERED BY to the aggregate results, which we would otherwise have to do ourselves. Anyway, worth it for us. ml On Fri, Jul 25, 2014 at 1:24 PM, Kevin Burton wrote: > Perhaps the best strategy is to have th

Re: Does SELECT … IN () use parallel dispatch?

2014-07-25 Thread Kevin Burton
Ah.. ok. Nice. That should work. Parallel dispatch on the client would work too.. using async. On Fri, Jul 25, 2014 at 1:37 PM, Laing, Michael wrote: > We use IN (keeping the number down). The coordinator does parallel > dispatch AND applies ORDERED BY to the aggregate results, which we would

IN clause with composite primary key?

2014-07-25 Thread Kevin Burton
How the heck would you build an IN clause with a primary key which is composite? so say columns foo and bar are the primary key. if you just had foo as your column name, you can do where foo in () … but with two keys I don't see how it's possible. specifying both actually builds a cartesian pr

Re: do all nodes actually send the data to the coordinator when doing a read?

2014-07-25 Thread Mark Reddy
Hi Brian, A read request will be handled in the following manner: Once the coordinator receives a read request it will firstly determine the replicas responsible for the data. From there those replicas are sorted by "proximity" to the coordinator. The closest node as determined by proximity sorti

Re: IN clause with composite primary key?

2014-07-25 Thread DuyHai Doan
Below are the rules for IN clause a. composite partition keys: the IN clause only applies to the last composite component b. clustering keys: the IN clause only applies to the last clustering key Contrived example: CREATE TABLE test( pk1 int, pk2 int, clust1 int, clust2 int, clust

Re: Does SELECT … IN () use parallel dispatch?

2014-07-25 Thread Laing, Michael
Except then you have to merge results if you want them ordered. On Fri, Jul 25, 2014 at 2:15 PM, Kevin Burton wrote: > Ah.. ok. Nice. That should work. Parallel dispatch on the client would > work too.. using async. > > > On Fri, Jul 25, 2014 at 1:37 PM, Laing, Michael > wrote: > >> We use I

Re: do all nodes actually send the data to the coordinator when doing a read?

2014-07-25 Thread DuyHai Doan
Thanks Mark for the very detailed explanation. However what's about timestamp checking ? You're saying that the coordinator checks for the digest of data (cell value) from both nodes but if the cell name have different timestamp would it still request a full data read to the node having the most

Re: do all nodes actually send the data to the coordinator when doing a read?

2014-07-25 Thread Jaydeep Chovatia
Yes. Digest includes following: {name, value, timestamp, flags(deleted, expired, etc.)} On Fri, Jul 25, 2014 at 2:33 PM, DuyHai Doan wrote: > Thanks Mark for the very detailed explanation. > > However what's about timestamp checking ? You're saying that the > coordinator checks for the digest

Re: IN clause with composite primary key?

2014-07-25 Thread Laing, Michael
You may also want to use tuples for the clustering columns: The tuple notation may also be used for IN clauses on CLUSTERING COLUMNS: > > SELECT * FROM posts WHERE userid='john doe' AND (blog_title, posted_at) IN > (('John''s Blog', '2012-01-01), ('Extreme Chess', '2014-06-01')) > > > from https:

Changing IPs of all nodes in a ring

2014-07-25 Thread Rahul Neelakantan
All, I need to change the IPs of all nodes in my ring in a flash cut, at the same time. Any recommendations on how to do this? Rahul Neelakantan

Re: Changing IPs of all nodes in a ring

2014-07-25 Thread Robert Coli
On Fri, Jul 25, 2014 at 3:54 PM, Rahul Neelakantan wrote: > I need to change the IPs of all nodes in my ring in a flash cut, at the > same time. Any recommendations on how to do this? > What are your uptime requirements as you do this? Because no, there's no way to change the ip address on all

Re: Changing IPs of all nodes in a ring

2014-07-25 Thread Rahul Neelakantan
I am ok with taking upto 2 hours of planned downtime. The problem is all the IPs will change at the same time and the previous IPs will no longer be available. So it's either all old IPs or all new IPs. Rahul Neelakantan 678-451-4545 > On Jul 25, 2014, at 7:23 PM, Robert Coli wrote: > >> On

Re: do all nodes actually send the data to the coordinator when doing a read?

2014-07-25 Thread Mark Reddy
> > However what's about timestamp checking ? You're saying that the > coordinator checks for the digest of data (cell value) from both nodes but > if the cell name have different timestamp would it still request a full > data read to the node having the most recent time ? When generating the has

Re: Changing IPs of all nodes in a ring

2014-07-25 Thread Robert Coli
On Fri, Jul 25, 2014 at 4:42 PM, Rahul Neelakantan wrote: > I am ok with taking upto 2 hours of planned downtime. The problem is all > the IPs will change at the same time and the previous IPs will no longer be > available. So it's either all old IPs or all new IPs. > Are the new IPs available b

Re: Changing IPs of all nodes in a ring

2014-07-25 Thread Rahul Neelakantan
The new IPs are not available before switch time. So I will try the all down method you mentioned. Do I need to do any move tokens of the new IPs to the old token -1 and the removeToken of the old tokens? I ask this because the old IPs will continue to show in gossip info with a status of Nor

Re: IN clause with composite primary key?

2014-07-25 Thread Kevin Burton
ah.. this only works in clustering columns… hm.. won't work in our situation though :-( On Fri, Jul 25, 2014 at 3:45 PM, Laing, Michael wrote: > You may also want to use tuples for the clustering columns: > > The tuple notation may also be used for IN clauses on CLUSTERING COLUMNS: >> >> SELECT

any plans for coprocessors?

2014-07-25 Thread Kevin Burton
Are there any plans to add coprocessors to cassandra? Embedding logic directly in a cassandra daemon would be nice. -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* blog: http://burtonator.wordpress.com … or check out my Google+ profile