Re: Deduplicating data on a node (RF=1)

2014-11-17 Thread Eric Stevens
If the new node never formally joined the cluster (streaming never completed, it never entered UN state), shouldn't that node be safe to scrub and start over again? It shouldn't be taking primary writes while it's bootstrapping, should it? On Mon Nov 17 2014 at 6:34:04 PM Michael Shuler wrote:

Re: Getting the counters with the highest values

2014-11-24 Thread Eric Stevens
You're right that there's no way to use the counter data type to materialize a view ordered by the counter. Computing this post hoc is the way to go if your needs allow for it (if not, something like Summingbird or vanilla Storm may be necessary). I might suggest that you make your primary key fo

Re: Getting the counters with the highest values

2014-11-25 Thread Eric Stevens
local midnight instead of a global midnight. Interesting idea. > > Thanks for your response. > > Robert > > On Nov 24, 2014, at 9:40 AM, Eric Stevens wrote: > > You're right that there's no way to use the counter data type to > materialize a view ordered by the

Re: Issues in moving data from cassandra to elasticsearch in java.

2014-11-25 Thread Eric Stevens
Consider adding log_bucket timestamp, and then indexing that column. Your data loader can SELECT * FROM logs WHERE log_bucket = ?. The value you supply there would be the timestamp log bucket you're processing - in your case logged_at % 5. However, I'll caution against writing data to Cassandra

Re: multiple threads updating result in TransportException

2014-11-27 Thread Eric Stevens
A lot of people do a lot of multi-threaded work with Datastax Java Driver. It looks like you're using Cassandra Driver 2.0.0-RC2, might I suggest as a first step, at least upgrade to 2.0.0 final? RC2 wasn't even the final release candidate for 2.0.0. On Wed Nov 26 2014 at 8:44:07 AM Brian Tarbox

Re: Data synchronization between 2 running clusters on different availability zone

2014-11-27 Thread Eric Stevens
There's no reason you can't run on multiple cloud providers as long as you treat them as logically distinct DC's. It should largely work the same way as running in several AWS regions, but you'll need to use something like GossipingPropertyFileSnitch because the EC2 snitches are specific to AWS.

Re: Column family ID mismatch-Error on concurrent schema modifications

2014-11-27 Thread Eric Stevens
Be careful with creating many dynamically created column families unless you're cleaning up old ones to keep the total number of CF's reasonable. Having many column families will increase memory pressure and reduce overall performance. On Thu Nov 27 2014 at 8:19:35 AM DuyHai Doan wrote: > Hello

Re: Date Tiered Compaction Strategy and collections

2014-11-28 Thread Eric Stevens
The underlying write time is still tracked for each value in the collection - it's part of how conflict resolution is managed - but it's not exposed through CQL. On Fri Nov 28 2014 at 4:18:47 AM Batranut Bogdan wrote: > Hello all, > > If one has a table like this: > id text, > ts timestamp > val

Re: Column family ID mismatch-Error on concurrent schema modifications

2014-11-28 Thread Eric Stevens
@Jens, > will "inactive" CFs be released from C*'s memory after i.e. a few days > or when under resource pressure? No, certain memory structures are allocated and will remain resident on each node for as long as the table exists. > These CFs are used as "time buckets", but are to be kept for spe

Re: Recommissioned node is much smaller

2014-12-03 Thread Eric Stevens
How does the difference in load compare to the effective ownership? If you deleted the system directory as well, you should end up with new ranges, so I'm wondering if perhaps you just ended up with a really bad shuffle. Did you run removenode on the old host after you took it down (I assume so si

Re: Cassandra taking snapshots automatically?

2014-12-03 Thread Eric Stevens
Do you have snapshot_before_compaction enabled? http://datastax.com/documentation/cassandra/2.0/cassandra/configuration/configCassandra_yaml_r.html#reference_ds_qfg_n1r_1k__snapshot_before_compaction On Wed Dec 03 2014 at 10:25:12 AM Robert Wille wrote: > I built my first multi-node cluster and

Re: Recommissioned node is much smaller

2014-12-03 Thread Eric Stevens
correlation. > > I think the moral of the story is that I shouldn’t delete the system > directory. If I have issues with a node, I should recommission it properly. > > Robert > > > On Dec 3, 2014, at 10:23 AM, Eric Stevens wrote: > > How does the difference in loa

Re: Pros and cons of lots of very small partitions versus fewer larger partitions

2014-12-06 Thread Eric Stevens
B would work better in the case where you need to do sequential or ranged style reads on the id, particularly if id has any significant sparseness (eg, id is a timeuuid). You can compute the buckets and do reads of entire buckets within your range. However if you're doing random access by id, the

Re: nodetool repair exception

2014-12-06 Thread Eric Stevens
The official recommendation is 100k: http://www.datastax.com/documentation/cassandra/2.0/cassandra/install/installRecommendSettings.html I wonder if there's an advantage to this over unlimited if you're running servers which are dedicated to your Cassandra cluster (which you should be for anything

Re: How to model data to achieve specific data locality

2014-12-06 Thread Eric Stevens
It depends on the size of your data, but if your data is reasonably small, there should be no trouble including thousands of records on the same partition key. So a data model using PRIMARY KEY ((seq_id), seq_type) ought to work fine. If the data size per partition exceeds some threshold that rep

Re: Keyspace and table/cf limits

2014-12-06 Thread Eric Stevens
Based on recent conversations with Datastax engineers, the recommendation is definitely still to run a finite and reasonable set of column families. The best way I know of to support multitenancy is to include tenant id in all of your partition keys. On Fri Dec 05 2014 at 7:39:47 PM Kai Wang wro

Re: How to model data to achieve specific data locality

2014-12-07 Thread Eric Stevens
ot; is defined as your clustering column, you can have many of > them using the same table structure ... > > > > > > On Sat, Dec 6, 2014 at 10:09 PM, Kai Wang wrote: > >> On Sat, Dec 6, 2014 at 11:18 AM, Eric Stevens wrote: >> >>> It depends on

Re: Cassandra Doesn't Get Linear Performance Increment in Stress Test on Amazon EC2

2014-12-07 Thread Eric Stevens
Hi Joy, Are you resetting your data after each test run? I wonder if your tests are actually causing you to fall behind on data grooming tasks such as compaction, and so performance suffers for your later tests. There are *so many* factors which can affect performance, without reviewing test met

Re: Cassandra Doesn't Get Linear Performance Increment in Stress Test on Amazon EC2

2014-12-07 Thread Eric Stevens
whelming a cluster, you're probably already in insane and difficult to recover crisis mode). On Sun Dec 07 2014 at 8:55:47 AM Eric Stevens wrote: > Hi Joy, > > Are you resetting your data after each test run? I wonder if your tests > are actually causing you to fall behind on data

Re: How to model data to achieve specific data locality

2014-12-08 Thread Eric Stevens
;>> even plain English, but not belabored with full CQL syntax.) would be very >>> helpful. I mean, Cassandra has no “subset” concept, nor a “load subset” >>> command, so what are we really talking about? >>> >>> Also, I presume we are talking CQL, but some of t

Re: Cassandra Doesn't Get Linear Performance Increment in Stress Test on Amazon EC2

2014-12-08 Thread Eric Stevens
999)? >>> I want to get the max QPS under a certain limited latency. >>> >>> I know my testing scenario are not the common case in production, I just >>> want to know how much burden my cluster can bear under stress. >>> >>> So, how did you test your

Re: UPDATE statement is failed

2014-12-10 Thread Eric Stevens
Writing then immediately reading the same data (or reading then immediately writing) are both antipatterns in any eventually consistent system, Cassandra included. You may need to investigate Compare and Set operations and see if they will work for your needs. Or else look into Serial consistency

Using Per-Table Keyspaces for Tunable Replication

2014-12-12 Thread Eric Stevens
We're considering moving to a model where we put each of our tables in a dedicated keyspace. This is so we can tune replication per table, and change our mind about that replication on a per-table basis without a major migration. The biggest driver for this is Solr integration, we want to tune RF

Re: Using Per-Table Keyspaces for Tunable Replication

2014-12-12 Thread Eric Stevens
ng up keyspaces into logical >> replication groups makes the most sense from a maintainability and >> performance standpoint. >> >> On Fri, Dec 12, 2014 at 11:21 AM, Eric Stevens wrote: >>> >>> We're considering moving to a model where we put each

Re: batch_size_warn_threshold_in_kb

2014-12-13 Thread Eric Stevens
Jon, > The really important thing to really take away from Ryan's original post is that batches are not there for performance. > tl;dr: you probably don't want batch, you most likely want many async calls My own rudimentary testing does not bear this out - at least not if you mean to say that bat

Re: batch_size_warn_threshold_in_kb

2014-12-13 Thread Eric Stevens
You can seen what the partition key strategies are for each of the tables, test5 shows the least improvement. The set (aid, end) should be unique, and bckt is derived from end. Some of these layouts result in clustering on the same partition keys, that's actually tunable with the "~15 per bucket"

Re: batch_size_warn_threshold_in_kb

2014-12-13 Thread Eric Stevens
13, 2014 at 9:44 AM, Eric Stevens wrote: > You can seen what the partition key strategies are for each of the tables, > test5 shows the least improvement. The set (aid, end) should be unique, > and bckt is derived from end. Some of these layouts result in clustering > on the same

Re: batch_size_warn_threshold_in_kb

2014-12-13 Thread Eric Stevens
ldn't > optimize for that case. > > Can you post your test code in a gist or something? I can't really talk > about your benchmark without seeing it and you're basing your stance on the > premise that it is correct, which it may not be. > > > > On Sat Dec 13

Re: batch_size_warn_threshold_in_kb

2014-12-13 Thread Eric Stevens
ow to model my application. On Sat, Dec 13, 2014 at 10:58 AM, Eric Stevens wrote: > Isn't the net effect of coordination overhead incurred by batches > basically the same as the overhead incurred by RoundRobin or other > non-token-aware request routing? As the cluster size increases

Re: batch_size_warn_threshold_in_kb

2014-12-15 Thread Eric Stevens
, bckt), proto, end) no explicit ordering = 13,111,244,000 traverse test3 ((aid, bckt), end, proto) reverse order= 25,163,064,000 traverse test5 ((aid, bckt, end)) = 30,233,744,000 On Sat, Dec 13, 2014 at 11:07 AM, Jonathan Haddad wrote: > > > On Sat D

Re: batch_size_warn_threshold_in_kb

2014-12-16 Thread Eric Stevens
> of futures, you're now blocking requests from being submitted and limiting > the throughput. > > 4) you're still only using 3 servers. The horror of using batches > increases linearly as you add servers. > > 5) What exactly are you summing in the end? The total

Re: does consistency=ALL for deletes obviate the need for tombstones?

2014-12-16 Thread Eric Stevens
No, deletes are always written as a tombstone no matter the consistency. This is because data at rest is written to sstables which are immutable once written. The tombstone marks that a record in another sstable is now deleted, and so a read of that value should be treated as if it doesn't exist.

Re: Replacing nodes disks

2014-12-22 Thread Eric Stevens
You should be able to use Cassandra's built in tooling for sure. But just be aware that restoring from a backup of the data will be a lot faster and won't introduce any stress on the existing cluster. Repair and replace operations aren't free to the other nodes, so an offline backup and restore is

Re: installing cassandra

2014-12-22 Thread Eric Stevens
If you're just trying to get your feet wet with distributed software, and your node count is going to be reasonably low and won't grow any time soon, it's probably easier to just install it yourself rather than trying to also learn how to use software deployment technologies like puppet or chef. Th

Re: CQL3 vs Thrift

2014-12-24 Thread Eric Stevens
As Ryan mentioned, CQL is simply a translation layer to the underlying storage mechanism you're already familiar with with Thrift. There are definitely corner cases where it's not possible to get a one-for-one equivalent in CQL vs Thrift, and even when there's equivalents, the underlying data migh

Re: Counter Column

2014-12-26 Thread Eric Stevens
Timestamps are timezone independent. This is a property of timestamps, not a property of Cassandra. A given moment is the same timestamp everywhere in the world. To display this in a human readable form, you then need to know what timezone you're attempting to represent the timestamp as, this is

Re: Why read row is so slower than read column.

2014-12-26 Thread Eric Stevens
I would suggest enabling tracing in cqlsh and see what it has to say. There are many things which could cause this, but I'm thinking in particular you may have a lot of tombstones which get lifted when you read the whole row, and are missed when you read just one column. On Fri, Dec 26, 2014 at 6:

Re: Counter Column

2014-12-27 Thread Eric Stevens
h some articles which mentioned that the client to pass the >>> timestamp for insert and update. Is that anyway we can avoid it and >>> Cassandra assume the current time of the server? >>> >>> Thanks >>> Ajay >>> On Dec 26, 2014 10:50 PM, "

Re: any code to load large data from web into Cassandra

2014-12-27 Thread Eric Stevens
I think Joanne is taking not about bulk loading, but about just general access as in any standard client driver. Joanne, this is a pretty broad topic. You would need to have some part of a website built in some language such as Python or Java or some other language. Then you would use an appropria

Re: Re: Why read row is so slower than read column.

2014-12-27 Thread Eric Stevens
Can you send us your exact data model? Even though you normally use Thrift, you may also be able to access the data from CQL, and if so, query tracing is a very powerful feature in CQL which may describe why there is a performance difference. Do you do deletes of data? If so, tombstones really m

Re: Best practice for sorting on frequent updated column?

2014-12-29 Thread Eric Stevens
This is a bit difficult. Depending on your access patterns and data volume, I'd be inclined to keep a separate table with a (count, foreign_key) clustering key. Then do a client-side join to read the data back in the order you're looking for. That will at least make the heavily updated table hav

Re: User click count

2014-12-29 Thread Eric Stevens
> If the counters get incorrect, it could't be corrected You'd have to store something that allowed you to correct it. For example, the TimeUUID approach to keep true counts, which are slow to read but accurate, and a background process that trues up your counter columns periodically. On Mon, De

Re: CQL3 vs Thrift

2014-12-29 Thread Eric Stevens
So while not exactly the same, this seems like a good analogy for suggesting a third interface to fix problems with existing interfaces: http://xkcd.com/927/ Even if the CQL parsing code in Cassandra is subpar (I haven't studied it), that's not an especially compelling case to suggest replacing th

Re: User click count

2014-12-31 Thread Eric Stevens
ct the counters. Now in this case, there will be one row >> per day but as many columns as user click. >> >> Other way is to store row per hour >> user id : /mm/dd/hh as row key and dynamic columns for each click >> with column key as timestamp and value as empty. &g

Re: Storing large files for later processing through hadoop

2015-01-02 Thread Eric Stevens
> Can this split and combine be done automatically by cassandra when inserting/fetching the file without application being bothered about it? There are client libraries which offer recipes for this, but in general, no. You're trying to do something with Cassandra that it's not designed to do. You

Re: is primary key( foo, bar) the same as primary key ( foo ) with a ‘set' of bars?

2015-01-02 Thread Eric Stevens
> And also stored entirely for each UPDATE. Change one element, re-serialize the whole thing to disk. Is this true? I thought updates (adds, removes, but not overwrites) affected just the indicated columns. Isn't it just the reads that involve reading the entire collection? DS docs talk about r

Re: Why does C* repeatedly compact the same tables over and over?

2015-01-08 Thread Eric Stevens
Are you using Leveled compaction strategy? If you fall behind on compaction in leveled (and you will during bootstrap), by default Cassandra will fall back to size tiered compaction in level 0. This will cause SSTables larger than sstable_size_in_mb, and those will be recompacted away into level

Re: Is there a way to add a new node to a cluster but not sync old data?

2015-01-11 Thread Eric Stevens
Yes, but it won't do what I suspect you're hoping for. If you disable auto_bootstrap in cassandra.yaml the node will join the cluster and will not stream any old data from existing nodes. The cluster will now be in an inconsistent state. If you bring enough nodes online this way to violate your

Re: Growing SSTable count as Cassandra does not saturate the disk I/O

2015-01-12 Thread Eric Stevens
Are you using compression on the sstables? If so, possibly you're CPU bound instead of disk bound. On Mon, Jan 12, 2015 at 3:47 AM, William Saar wrote: > Hi, > > We are running a test with Cassandra 2.1.2 on Fusion I/O drives where we > load about 2 billion rows of data during a few hours each

Re: setting up prod cluster

2015-01-12 Thread Eric Stevens
Hi Tim, replies inline below. On Sun, Jan 11, 2015 at 8:03 PM, Tim Dunphy wrote: > Hey all, > > I've been experimenting with Cassandra on a small scale and in my own > sandbox for a while now. I'm pretty used to working with it to get small > clusters up and running and gossiping with each othe

Re: Permanent ReadTimeout

2015-01-12 Thread Eric Stevens
If you're getting 30 second GC's, this all by itself could and probably does explain the problem. If you're writing exclusively to A, and there are frequent partitions between A and B, then A is potentially working a lot harder than B, because it needs to keep track of hinted handoffs to replay to

Re: Permanent ReadTimeout

2015-01-13 Thread Eric Stevens
> >> 4) I will check it as soon as possible and post it here. If you have any >> suggestion what else should I check, you are welcome :) >> >> >> >> >> On Mon, Jan 12, 2015 at 7:28 PM, Eric Stevens wrote: >> >>> If you're getting 30 seco

Re: Timeseries: Include rows immediately adjacent to range query?

2015-01-15 Thread Eric Stevens
It seems like you should be able to solve it with two more queries immediately after your first query: SELECT * FROM timeseries WHERE tstamp < ${MIN(firstQuery.tstamp)} LIMIT 1 SELECT * FROM timeseries WHERE tstamp > ${MAX(firstQuery.tstamp)} LIMIT 1 On Tue, Jan 13, 2015 at 9:31 AM, Hugo José Pi

Re: Storing PDF data on Cassandra db

2015-01-15 Thread Eric Stevens
@DENIZ, Jon's point is that CQL is the new standard, Thrift is frozen and being deprecated. Anything you build using the Thrift interface will hurt you over time, so you ought to just go for CQL. There really is next to no reason not to use CQL aside from personal preference, and that argument do

Re: Many really small SSTables

2015-01-15 Thread Eric Stevens
Yes, many sstables can have a huge negative impact read performance, and will also create memory pressure on that node. There are a lot of things which can produce this effect, and it strongly also suggests you're falling behind on compaction in general (check nodetool compactionstats, you should

Re: Growing SSTable count as Cassandra does not saturate the disk I/O

2015-01-15 Thread Eric Stevens
mber of concurrent compactors do not seem to help). Oddly > enough, one node has just 160 SSTables while the rest are at 500-600 > tables. > > > > Is size-tiered compaction easier on the CPU than leveled compaction? > > > > Thanks, > > William > > > >

Re: Many really small SSTables

2015-01-16 Thread Eric Stevens
ables accessed, that changed from 2.0 to 2.1 I think > as I see this only on testing cluster). > > I looks to me that compactions were not triggered. I tried a nodetool > compact on one node overnight - but that crashed the entire node. > > Roland > > Am 15.01.2015 um

Re: Retrieving all row keys of a CF

2015-01-16 Thread Eric Stevens
Note that getAllRows() is deprecated in Astyanax (see here ). You should prefer to use the AllRowsReader recipe: https://github.com/Netflix/astyanax/wiki/AllRowsReader-All-rows-query Note the

Re: Retrieving all row keys of a CF

2015-01-17 Thread Eric Stevens
If you're getting partial data back, then failing eventually, try setting .withCheckpointManager() - this will let you keep track of the token ranges you've successfully processed, and not attempt to reprocess them. This will also let you set up tasks on bigger data sets that take hours or days to

Re: "Not enough replica available” when consistency is ONE?

2015-01-18 Thread Eric Stevens
Check out http://www.datastax.com/documentation/cassandra/2.0/cassandra/dml/dml_tunable_consistency_c.html > Cassandra 2.0 uses the Paxos consensus protocol, which resembles 2-phase commit, to support linearizable consistency. All operations are quorum-based ... This kicks in whenever you do CAS

Re: number of replicas per data center?

2015-01-19 Thread Eric Stevens
> Ah.. six replicas. At least its super inexpensive that way (sarcasm!) Well it's up to you to decide what your data locality and fault tolerance requirements are. If you want to run two DC's, costs are going to increase since each DC has a full set of replicas within itself. But you get the be

Re: Cassandra fetches complete partition

2015-01-19 Thread Eric Stevens
It depends on your version of Cassandra. I would suggest starting with this, which describes the differences between 2.0 and 2.1 http://www.datastax.com/dev/blog/row-caching-in-cassandra-2-1 In particular: > In previous releases, this cache has required storing the entire partition in memory, wh

Re: Nodetool removenode stuck

2015-01-19 Thread Eric Stevens
I've seen removenode hang indefinitely also (per CASSANDRA-6542). Generally speaking, if a node is in good health and you want to take it out of the cluster for whatever reason (including the one you mentioned), nodetool decommission is a better choice. Removenode is for when a node is unrecoverab

Re: Compaction failing to trigger

2015-01-20 Thread Eric Stevens
@Rob - he's probably referring to the thread titled "Reasons for nodes not compacting?" where Tyler speculates that the tables are falling below the cold read threshold for compaction. He speculated it may be a bug. At the same time in a different thread, Roland had a similar problem, and Tyler's

Re: Is there a way to add a new node to a cluster but not sync old data?

2015-01-21 Thread Eric Stevens
atong Zhang wrote: > Thanks for the reply. The bootstrap of new node put a heavy burden on the > whole cluster and I don't know why. So that' the issue I want to fix > actually. > > On Mon, Jan 12, 2015 at 6:08 AM, Eric Stevens wrote: > >> Yes, but it won't d

Re: Which Topology fits best ?

2015-01-25 Thread Eric Stevens
As far as I know they're effectively the same. NetworkTopologyStrategy is useful when you want to set up separate RF per DC, such as if you want to have an analytics DC with lower RF to save money. On Sun, Jan 25, 2015 at 8:01 AM, SEGALIS Morgan wrote: > Hi everyone, > I need one more time your

Re: Re: Dynamic Columns

2015-01-26 Thread Eric Stevens
> are you really recommending I throw 4 years of work out and completely rewrite code that works and has been tested? Our codebase was about 3 years old, and we finished migrating it to CQL not that long ago. It can definitely be frustrating to have to touch stable code to modernize it. Our desi

Re: SStables can't compat automaticly

2015-01-26 Thread Eric Stevens
If you are doing only writes and no reads, then 'cold_reads_to_omit' is probably preventing your cluster from crossing a threshold where it decides it needs to engage in compaction. Setting it to 0.0 should fix this, but remember that you tuned it as you should be able to revert it to default thre

Re: Controlling the MAX SIZE of sstables after compaction

2015-01-26 Thread Eric Stevens
If you're concerned about impacting production performance, the steps of compacting and sstable2json will almost certainly also cause performance problems if performed on the same hardware. You won't get away from a production performance impact as long as you're using production hardware. If you

Re: Tombstone gc after gc grace seconds

2015-01-26 Thread Eric Stevens
My understanding is consistent with Alain's, there's no way to force a tombstone-only compaction, your only option is major compaction. If you're using size tiered, that comes with its own drawbacks. I wonder if there's a technical limitation that prevents introducing a shadowed data cleanup styl

Re: Using Cassandra for geospacial search

2015-01-26 Thread Eric Stevens
We're getting ready for geo-oriented interactions with C*; our plan is to use DSE Solr integration for this purpose. On Mon, Jan 26, 2015 at 8:47 AM, SEGALIS Morgan wrote: > Hi everyone, > > I wanted to know if someone has a feedback using geoHash algorithme with > cassandra ? > > I will have to

Re: Fixtures / CI docker

2015-01-26 Thread Eric Stevens
I don't have directly relevant advice, especially WRT getting a meaningful and coherent subset of your production data - that's probably too closely coupled with your business logic. Perhaps you can run a testing cluster with a default TTL on all your tables of ~2 weeks, feeding it with real produ

Re: Using Cassandra for geospacial search

2015-01-26 Thread Eric Stevens
wrote: > That's actually GREAT news !! + Solr will give a lot of feature to > Cassandra ! > > But while waiting for this huge feature (and wanted for a lot of users I > guess) > I guess that Prefix search will also be useful for using geohash... > > 2015-01-26 18:12 G

Re: Using Cassandra for geospacial search

2015-01-26 Thread Eric Stevens
Using Cassandra triggers is generally a fairly dangerous proposition, and generally not recommended.It's probably a better idea to load your search data with a separate process. On Mon, Jan 26, 2015 at 11:42 AM, Brian Sam-Bodden wrote: > I did an little experiment with a Geohash/Geocells > h

Re: error while bulk loading using copy command

2015-01-29 Thread Eric Stevens
As the error implies, you cannot insert into counters tables, you can only update them as increments or decrements (updating a counter that doesn't exist will create it with the initial delta as if it had started at zero). I would recommend this documentation which describes how to update counters

Re: Performance difference between Regular Statement Vs PreparedStatement

2015-01-29 Thread Eric Stevens
Prepared statements can take advantage of token aware routing which IIRC non-prepared statements cannot in the DS Java Driver, so as your cluster grows you reduce the overhead of statement coordination (assuming you use token aware routing). There should also be less data to transfer for shipping

Re: Performance difference between Regular Statement Vs PreparedStatement

2015-01-29 Thread Eric Stevens
That's not a particularly good setup for load testing, I would try hard not to draw any conclusions from it. Most likely your biggest bottleneck is I/O in your VM's, and any savings from using prepared statements dwarf in comparison to the price of virtualization. Point 2's effects are also minim

Re: Help on modeling a table

2015-02-02 Thread Eric Stevens
Just a minor observation: those field names are extremely long. You store a copy of every field name with every value with only a couple of exceptions: http://www.datastax.com/documentation/cassandra/1.2/cassandra/architecture/architecturePlanningUserData_t.html Your partition key column name (lo

Re: Cassandra on Ceph

2015-02-02 Thread Eric Stevens
Colin, I'm not familiar with Ceph, but it sounds like it's a more sophisticated version of a SAN. Be aware that running Cassandra on absolutely anything other than local disks is an anti-pattern. It will have a profound negative impact on performance, scalability, and reliability of your cluster.

Re: Smart column searching for a particular rowKey

2015-02-03 Thread Eric Stevens
WHERE < + ORDER DESC + LIMIT should be able to accomplish that. On Tue, Feb 3, 2015 at 11:28 AM, Ravi Agrawal wrote: > Hi Guys, > > Need help with this. > > My rowKey is stockName like GOOGLE, APPLE. > > Columns are sorted as per timestamp and they include some set of data > fields like price a

Re: Smart column searching for a particular rowKey

2015-02-04 Thread Eric Stevens
2:44 PM > *To:* user@cassandra.apache.org > *Subject:* RE: Smart column searching for a particular rowKey > > > > Thanks, it does. > > How about in astyanax? > > > > *From:* Eric Stevens [mailto:migh...@gmail.com ] > *Sent:* Tuesday, February 03, 2015 1:49 PM >

Re: Mutable primary key in a table

2015-02-07 Thread Eric Stevens
I'm struggling to think of a model where it makes sense to update a primary key as a typical operation. It suggests, as Adil said, that you may be reasoning wrong about your data model. Maybe you can explain your problem in more detail - what kind of thing has you updating your PK on a regular ba

Re: Mutable primary key in a table

2015-02-08 Thread Eric Stevens
ting user data. > 3. Update the old user row to indicate that it is no longer a valid user. > Actually, you will have to decide on an application policy for old user > names. For example, can they be reused, or are they locked, or... whatever. > > > -- Jack Krupansky > > On

Re: Out of Memory Error While Opening SSTables on Startup

2015-02-10 Thread Eric Stevens
This kind of recovery is definitely not my strong point, so feedback on this approach would certainly be welcome. As I understand it, if you really want to keep that data, you ought to be able to mv it out of the way to get your node online, then move those files in a several thousand at a time, n

Re: Pagination support on Java Driver Query API

2015-02-11 Thread Eric Stevens
> I can't believe that everyone read & process all rows at once (without pagination). Probably not too many people try to read all rows in a table as a single rolling operation with a standard client driver. But those who do would use token() to keep track of where they are and be able to resume

Re: Recommissioned a node

2015-02-11 Thread Eric Stevens
AFAIK it should be ok after the repair completed (it was missing all writes while it was decommissioning and while it was offline, and nobody would have been keeping hinted handoffs for it, so repair was the right thing to do). Unless RF=N you're now due for a cleanup on the other nodes. Generall

Re: changes to metricsReporterConfigFile requires restart of cassandra?

2015-02-11 Thread Eric Stevens
AFAIK yes. If you want just a subset of the metrics, I would suggest exporting them all, and filtering on the Graphite side. On Wed, Feb 11, 2015 at 6:54 AM, Erik Forsberg wrote: > Hi! > > I was pleased to find out that cassandra 2.0.x has added support for > pluggable metrics export, which eve

Re: Recommissioned a node

2015-02-11 Thread Eric Stevens
cause the machine got restarted (with auto_bootstrap set to to true). A > cleaner, and correct, recommission would have just required wiping the data > folder, am I correct? Or would I have needed to change something else in > the node configuration? > > Cheers, > Stefano > &

Re: Pagination support on Java Driver Query API

2015-02-12 Thread Eric Stevens
Your page state then needs to track the last ck1 and last ck2 you saw. Pages 2+ will end up needing to be up to two queries if the first query doesn't fill the page size. CREATE TABLE foo ( partitionkey int, ck1 int, ck2 int, col1 int, col2 int, PRIMARY KEY ((partitionkey), ck1, ck2) )

Re: Recommissioned a node

2015-02-12 Thread Eric Stevens
I definitely find it surprising that a node which was decommissioned is willing to rejoin a cluster. I can't think of any legitimate scenario where you'd want that, and I'm surprised the node doesn't track that it was decommissioned and refuse to rejoin without at least a -D flag to force it. Way

Re: High GC activity on node with 4TB on data

2015-02-12 Thread Eric Stevens
> each node has 256G of memory, 24x1T drives, 2x Xeon CPU I don't have first hand experience running Cassandra on such massive hardware, but it strikes me that these machines are dramatically oversized to be good candidates for Cassandra (though I wonder how many cores are in those CPUs; I'm guess

Re: Pagination support on Java Driver Query API

2015-02-12 Thread Eric Stevens
ol2 > > --+-+-+--+-- > > 1 | 2 | 2 |5 |5 > > 1 | 2 | 1 |4 |4 > > > > Basically you pass ck1 and ck2 values from the last row of the previous > result and tell C* you want results that are greater (

Re: Pagination support on Java Driver Query API

2015-02-12 Thread Eric Stevens
eployed Tomcat clusters. Hence whatever we cache to support > pagination, need to be cached in a distributed way for failover support. > > It (pagination support) is best done at the server side like ROWNUM in SQL > or better done in Java driver to hide the internal details and can be > optimi

Re: Recommissioned a node

2015-02-13 Thread Eric Stevens
I created an issue for this: https://issues.apache.org/jira/browse/CASSANDRA-8801 On Thu, Feb 12, 2015 at 10:18 AM, Robert Coli wrote: > On Thu, Feb 12, 2015 at 7:04 AM, Eric Stevens wrote: > >> IMO, especially with the threat to unrecoverable consistency violations, >>

Re: Adding new node to cluster

2015-02-17 Thread Eric Stevens
> Seed nodes apparently don’t bootstrap That's right, if a node has itself in its own seeds list, it assumes it's a foundational member of the cluster, and it will join immediately with no bootstrap. If you've done this by accident, you should do nodetool decommission on that node, and when it's

Re: Running Cassandra + Spark on AWS - architecture questions

2015-02-22 Thread Eric Stevens
I'm not sure if this is a good use case for you, but you might also consider setting up several keyspaces, one for any data you want available for analytics (such as business object tables), and one for data you don't want to do analytics on (such as custom secondary indices). Maybe a third one fo

Re: Why no virtual nodes for Cassandra on EC2?

2015-02-23 Thread Eric Stevens
Vnodes is officially disrecommended for DSE Solr integration (though a small number isn't ruinous). That might be why they still don't enable them by default. On Feb 21, 2015 3:58 PM, "mck" wrote: > At least the problem of hadoop and vnodes described in CASSANDRA-6091 > doesn't apply to spark. >

Re: Why no virtual nodes for Cassandra on EC2?

2015-02-23 Thread Eric Stevens
y. > > -- Jack Krupansky > > On Mon, Feb 23, 2015 at 11:23 AM, Eric Stevens wrote: > >> Vnodes is officially disrecommended for DSE Solr integration (though a >> small number isn't ruinous). That might be why they still don't enable them >> by default

Re: Why no virtual nodes for Cassandra on EC2?

2015-02-23 Thread Eric Stevens
ing out a mistake in the doc - that statement (for > Search/Solr) was simply a leftover from before 4.6. Besides, it's in the > Analytics section, which is not relevant for Search/Solr anyway. > > -- Jack Krupansky > > On Mon, Feb 23, 2015 at 11:54 AM, Eric Stevens wrote: &g

Re: best practices for time-series data with massive amounts of records

2015-03-07 Thread Eric Stevens
It's probably quite rare for extremely large time series data to be querying the whole set of data. Instead there's almost always a "Between X and Y dates" aspect to nearly every real time query you might have against a table like this (with the exception of "most recent N events"). Because of th

<    1   2   3   >