Re: Fast Writes to Cassandra Failing Through Python Script

2018-03-15 Thread Jonathan Haddad
Generally speaking, you don't need to. I almost never do. I've only set it in situations where I've had a large number of tables and I want to avoid a lot of flushing when commit log segments are removed. Setting it to 128 milliseconds means it's flushing 8 times per second, which gives no benef

Re: replace dead node vs remove node

2018-03-22 Thread Jonathan Haddad
Under normal circumstances this is not true. Take a look at org.apache.cassandra.service.StorageProxy#performWrite, it grabs both the natural endpoints and the pending endpoints (new nodes). They're eventually passed through to org.apache.cassandra.locator.AbstractReplicationStrategy#getWriteResp

Re: replace dead node vs remove node

2018-03-22 Thread Jonathan Haddad
Ah sorry - I misread the original post - for some reason I had it in my head the question was about bootstrap. Carry on. On Thu, Mar 22, 2018 at 8:35 PM Jonathan Haddad wrote: > Under normal circumstances this is not true. > > Take a look at org.apache.cassandra.service.Sto

Re: Using Spark to delete from Transactional Cluster

2018-03-23 Thread Jonathan Haddad
I'm confused as to what the difference between deleting with prepared statements and deleting through spark is? To the best of my knowledge either way it's the same thing - normal deletion with tombstones replicated. Is it that you're doing deletes in the analytics DC instead of your real time on

Re: Update to C* 3.0.14 from 3.0.10

2018-03-23 Thread Jonathan Haddad
3.0.16 is the latest, I recommend going all the way up. About a hundred bug fixes: Jon On Fri, Mar 23, 2018 at 2:22 PM Dmitry Saprykin wrote: > Hi, > > I successfully used 3.0.14 more than a year in production. And moreover > 3

Re: Can "data_file_directories" make use of multiple disks?

2018-03-27 Thread Jonathan Haddad
In Cassandra 3.2 and later, data is partitioned by token range, which should give you even distribution of data. If you're going to go into 3.x, please use the latest 3.11, which at this time is 3.11.2. On Tue, Mar 27, 2018 at 8:05 AM Venkata Hari Krishna Nukala <

Re: Is Cassandra used in Medical industry?

2018-03-29 Thread Jonathan Haddad
I haven't use Vormetric, but have worked with a couple teams doing disk encryption using LUKS: I haven't read through that FDA guideline, and tbh I'm not going to - if there's a specific question you have it would be better to ask it r

Re: Is Cassandra used in Medical industry?

2018-03-29 Thread Jonathan Haddad
If you require a full audit trail then you'll need to do this in your data model. I recommend looking to event sourcing, which is a way of tracking all changes to an entity over its lifetime. Instead of thinking of data as global mutable state,

Re: Latest version and Features

2018-04-11 Thread Jonathan Haddad
Move to the latest 3.0, or if you're feeling a little more adventurous, 3.11.2. 4.0 discussion is happening now, nothing is decided. On Wed, Apr 11, 2018 at 7:35 AM Abdul Patel wrote: > Hi All, > > Our company is planning for upgrading cassandra to maitain the audit > gudilines for patch cycle.

Re: JVM Tuning post

2018-04-11 Thread Jonathan Haddad
Re G1GC in Java 9, yes it's the default, but we explicitly specify the collector when we start Cassandra. Regarding load testing, some folks like cassandra-stress, but personally I think second to production itself, there's nothing better than an environment running the full applications stack wit

Re: Latest version and Features

2018-04-11 Thread Jonathan Haddad
pache/cassandra/blob/trunk/NEWS.txt >>>>> >>>>> You'll find everything you need IMHO >>>>> >>>>> On 11 April 2018 at 17:05, Abdul Patel wrote: >>>>> >>>>>> Thanks. >>>>>> >>

Re: Cassandra datastax cerrification

2018-04-14 Thread Jonathan Haddad
The original question was about prepping. I think that might be a question best suited for datastax, since you’re paying them for the cert. On Sat, Apr 14, 2018 at 9:02 AM Ben Bromhead wrote: > Certification is only as good as the organizations that recognize it. > Identify what you want to get o

Re: 答复: Time serial column family design

2018-04-17 Thread Jonathan Haddad
To add to what Nate suggested, we have an entire blog post on scaling time series data models: Jon On Tue, Apr 17, 2018 at 7:39 PM Nate McCall wrote: > I disagree. Create date as a raw integer is an excellen

Re: Reading Cassandra's Blob from Apache Ignite

2018-04-25 Thread Jonathan Haddad
I think you’ll have better luck with the ignite list, as this looks like an ignite configuration problem. On Wed, Apr 25, 2018 at 3:09 AM wrote: > Dear Community, > > > > I'm trying to read the contents of Cassandra table from Ignite(acting as > cache). The table is given below:: > > CREATE TABLE

Re: Version Upgrade

2018-04-25 Thread Jonathan Haddad
There's no harm in running it during any upgrade, and I always recommend doing it just to be in the habit. My 2 cents. On Wed, Apr 25, 2018 at 3:39 PM Christophe Schmitz <> wrote: > Hi Pranay, > > You only need to upgrade your SSTables when you perform a major Cassandr

Re: Repair of 5GB data vs. disk throughput does not make sense

2018-04-26 Thread Jonathan Haddad
I can't say for sure, because I haven't measured it, but I've seen a combination of readahead + large chunk size with compression cause serious issues with read amplification, although I'm not sure if or how it would apply here. Likely depends on the size of your partitions and the fragmentation o

Re: Adding new nodes to cluster to speedup pending compactions

2018-04-27 Thread Jonathan Haddad
Your compaction time won't improve immediately simply by adding nodes because the old data still needs to be cleaned up. What's your end goal? Why is having a spike in pending compaction tasks following a massive write an issue? Are you seeing a dip in performance, violating an SLA, or do you ju

Re: Switching to TWCS

2018-04-27 Thread Jonathan Haddad
TWCS uses the max timestamp in an sstable to determine what to compact together, it won't anti-compact your data. The goal is to minimize I/O. You'll have to wait for all your mixed-timestamp sstable data to TTL out before TWCS's windowing kicks in optimally.

Re: Solve Busy pool at Cassandra side

2018-05-13 Thread Jonathan Haddad
This error comes from com.datastax.driver.core.HostConnectionPool#enqueue, which is the client side pool. Cassandra can handle more requests, the application needs to be fixed. As per the java docs: /** * Indicates that a connection pool has run out of available connections. * * This happens

Re: Reading from big partitions

2018-05-19 Thread Jonathan Haddad
What disks are you using? How many sstables are you hitting? Did you try tracing the request? On Sat, May 19, 2018 at 8:43 PM onmstester onmstester wrote: > Hi, > Due to some unpredictable behavior in input data i end up with some > hundred partitions having more than 300MB size. Reading any seq

Re: Question About Reaper

2018-05-20 Thread Jonathan Haddad
FWIW the largest deployment I know about is a single reaper instance managing 50 clusters and over 2000 nodes. There might be bigger, but I either don’t know about it or can’t remember. On Sun, May 20, 2018 at 10:04 AM Abdul Patel wrote: > Hi, > > I recently tested reaper and it actually helped

Re: cassandra update vs insert + delete

2018-05-27 Thread Jonathan Haddad
What is a “soft delete”? My 2 cents, if you want to update some information just update it. There’s no need to overthink it. Batches are good if they’re constrained to a single partition, not so hot otherwise. On Sun, May 27, 2018 at 8:19 AM Rahul Singh wrote: > Deletes create tombstones — no

Re: Time Series schema performance

2018-05-29 Thread Jonathan Haddad
I wrote a post on this topic a while ago, might be worth reading over: On Tue, May 29, 2018 at 8:02 AM Jeff Jirsa wrote: > There’s a third option which is doing bucketing by time instead of by hash, which tends

Re: Mongo DB vs Cassandra

2018-05-31 Thread Jonathan Haddad
I haven’t seen any query requirements, which is going to be the thing that makes Cassandra difficult. If you can’t define your queries beforehand, cassandra is a no go. If you just want to store data somewhere, and it’s just CSV, I’d go with a simple blob store like s3 and pick a DB later when you

Re: Compaction strategy for update heavy workload

2018-06-13 Thread Jonathan Haddad
I wouldn't use TWCS if there's updates, you're going to risk having data that's never deleted and really small sstables sticking around forever. If you use really large buckets, what's the point of TWCS? Honestly this is such a small workload you could easily use STCS or LCS and you'd likely neve

Re: Cassandra Client Program not Working with NettySSLOptions

2018-06-19 Thread Jonathan Haddad
Is the server configured to use encryption? On Tue, Jun 19, 2018 at 3:59 AM Jahar Tyagi wrote: > Hi, > > I referred to this link > > to > implement a simple Ca

Re: Incremental Backup Hardlinks

2018-07-19 Thread Jonathan Haddad
The hard links are created after the SSTables have finished writing. On Thu, Jul 19, 2018 at 9:51 AM David Payne wrote: > Hello Cassandra Experts and Committers, > > > > Hopefully this is just a dumb question, but without the skill set to read > the source code, I must ask. > > > > Consider in

Re: Timeout for only one keyspace in cluster

2018-07-23 Thread Jonathan Haddad
You don’t get this guarantee with counters. Do not use them for unique values. Use a UUID instead. On Mon, Jul 23, 2018 at 9:11 AM learner dba wrote: > James, > > Yes, counter is implemented due to valid reasons. We need this value > column to have unique values being used at the time of regis

Reaper 1.2 released

2018-07-24 Thread Jonathan Haddad
Hey folks, Just wanted to share with the list that after a bit of a long wait, we've released Reaper 1.2. We have a short blog post here outlining the new features: With each release we've worked on performance improvements and stabili

Re: Secure data

2018-08-01 Thread Jonathan Haddad
You can also get full disk encryption with LUKS, which I've used before. On Wed, Aug 1, 2018 at 12:36 PM Jeff Jirsa wrote: > EBS encryption worked well on gp2 volumes (never tried it on any others) > > -- > Jeff Jirsa > > > On Aug 1, 2018, at 7:57 AM, Rahul Reddy wrote: > > Hello, > > Any one t

Re: Secure data

2018-08-01 Thread Jonathan Haddad
his last year: > > > We also use encrypted GP2 EBS pretty widely without issue. > > Cheers > Ben > > On Thu, 2 Aug 2018 at 05:38 Jonathan Haddad wrote: > >> You can also get

Re: Apache Cassandra 3.11.3 Question

2018-08-04 Thread Jonathan Haddad
This strategy is a lot more work than just replacing nodes one at a time. For a large cluster it would be months of work instead of a couple days. On Sat, Aug 4, 2018 at 7:04 AM R1 J1 wrote: > Can a cluster having 3.11.0 node(s) accept a 3.11.3 node as a new node > for eventual migration and

Re: Bootstrap OOM issues with Cassandra 3.11.1

2018-08-07 Thread Jonathan Haddad
By default Cassandra is set to generate a heap dump on OOM. It can be a bit tricky to figure out what’s going on exactly but it’s the best evidence you can work with. On Tue, Aug 7, 2018 at 6:30 AM Laszlo Szabo wrote: > Hi, > > Thanks for the fast response! > > We are not using any materialized

Re: TWCS Compaction backed up

2018-08-07 Thread Jonathan Haddad
What's your window size? When you say backed up, how are you measuring that? Are there pending tasks or do you just see more files than you expect? On Tue, Aug 7, 2018 at 4:38 PM Brian Spindler wrote: > Hey guys, quick question: > > I've got a v2.1 cassandra cluster, 12 nodes on aws i3.2xl, co

Compression Tuning Tutorial

2018-08-08 Thread Jonathan Haddad
Hey folks, We've noticed a lot over the years that people create tables usually leaving the default compression parameters, and have spent a lot of time helping teams figure out the right settings for their cluster based on their workload. I finally managed to write some thoughts down along with

Re: Compression Tuning Tutorial

2018-08-09 Thread Jonathan Haddad
stack wasn't > implemented, except lack of resources to do this? > > > Regards, > > Kyrill > -- > *From:* Eric Plowe > *Sent:* Wednesday, August 8, 2018 9:39:44 PM > *To:* > *Subject:* Re: Compression Tuning Tutoria

Java 11 support in Cassandra 4.0 + Early Testing and Feedback

2018-08-16 Thread Jonathan Haddad
Hey folks, As we start to get ready to feature freeze trunk for 4.0, it's going to be important to get a lot of community feedback. This is going to be a big release for a number of reasons. * Virtual tables. Finally a nice way of querying for system metrics & status * Streaming optimizations (

Re: JBOD disk failure - just say no

2018-08-22 Thread Jonathan Haddad
We recently helped a team deal with some JBOD issues, they can be quite painful, and the experience depends a bit on the C* version in use. We wrote a blog post about it (published today): Hope this

Re: [EXTERNAL] Re: adding multiple node to a cluster, cleanup and num_tokens

2018-09-08 Thread Jonathan Haddad
256 tokens is a pretty terrible default setting especially post 3.0. I recommend folks use 4 tokens for new clusters, with some caveats. When you fire up a cluster, there's no way to make the initial tokens be distributed evenly, you'll get random ones. You'll want to set them explicitly using:

Re: [EXTERNAL] Re: adding multiple node to a cluster, cleanup and num_tokens

2018-09-08 Thread Jonathan Haddad
please explain more why should i run that python command and > config allocate_tokens_for_keyspace? i only have one keyspace per cluster. > Im using Network replication strategy, and a rack-aware topology config. > > Sent using Zoho Mail <> > &g

Re: [EXTERNAL] Re: adding multiple node to a cluster, cleanup and num_tokens

2018-09-08 Thread Jonathan Haddad
Shulgin <> wrote: > On Sat, 8 Sep 2018, 14:47 Jonathan Haddad, wrote: > >> 256 tokens is a pretty terrible default setting especially post 3.0. I >> recommend folks use 4 tokens for new clusters, >> > > I wonder why not setting it to a

Re: Using CDC Feature to Stream C* to Kafka (Design Proposal)

2018-09-09 Thread Jonathan Haddad
I'll be honest, I'm having a hard time wrapping my head around an architecture where you use CDC to push data into Kafka. I've worked on plenty of systems that use Kafka as a means of communication, and one of the consumers is a process that stores data in Cassandra. That's pretty normal. Sendin

Re: [EXTERNAL] Re: adding multiple node to a cluster, cleanup and num_tokens

2018-09-09 Thread Jonathan Haddad
DSE calculates which replication factor to use for their > token allocation logic, maybe they guess or take the highest or something. > Cassandra doesn’t - we require you to be explicit, but we could probably do > better here. > > > > On Sep 8, 2018, at 8:17 AM, Oleksandr Shulgin

Re: Large partitions

2018-09-13 Thread Jonathan Haddad
It depends on a number of factors, such as compaction strategy and read patterns. I recommend sticking to the 100MB per partition limit (and I aim for significantly less than that). If you're doing time series with TWCS & TTL'ed data and small enough windows, and you're only querying for a small

Re: SNAPSHOT builds?

2018-09-29 Thread Jonathan Haddad
Hey James, you’ll have to build it. Java 11 is out but the build instructions still apply: On Sat, Sep 29, 2018 at 7:01 AM James Carman wrote: > I am trying to find 4.x SNAPSHOT builds. Are they available anywhere > handy? I'm trying to w

Re: openjdk for cassandra production cluster

2018-10-10 Thread Jonathan Haddad
The warning should be removed (if it hasn’t already), it’s unnecessary at this point On Wed, Oct 10, 2018 at 7:41 AM Prachi Rath wrote: > HI users, > I have created a cassandra cluster with openjdk 1.8.0_181 > version.(cassandra 2.1.17) > started each node, cluster looks healthy,but in the log

Re: TWCS: Repair create new buckets with old data

2018-10-24 Thread Jonathan Haddad
Hey Meg, a couple thoughts. > Set a table level TTL with TWCS, and stop setting it with inserts/updates (insert TTL overrides table level TTL). So, that your entire sstable expires at the same time, as opposed to each insert expiring at its own pace. So that for tombstone clean up, the system ca

Re: Cassandra running Multiple JVM's

2018-10-24 Thread Jonathan Haddad
Another issue you'll need to consider is how the JVM allocates resources towards GC, especially if you're using G1 with a pause time goal. Specifically, if you let it pick it's own numbers for ParallelGCThreads & ConcGCThreads they'll be based on the total number of CPUs, not the number you've rest

Re: Best compaction strategy

2018-10-25 Thread Jonathan Haddad
To add to what Alex suggested, if you know what keys use what TTL you could store them in different tables, with different window settings. Jon On Fri, Oct 26, 2018 at 1:28 AM Alexander Dejanovski wrote: > Hi Raman, > > TWCS is the best compaction strategy for TTL data, even if you have > diffe

Re: Cassandra | Cross Data Centre Replication Status

2018-10-30 Thread Jonathan Haddad
You need to run "nodetool rebuild -- " on each node in the new DC to get the old data to replicate. It doesn't do it automatically because Cassandra has no way of knowing if you're done adding nodes and if it were to migrate automatically, it could cause a lot of problems. Imagine streaming 100 no

Re: [ANNOUNCE] StratIO's Lucene plugin fork

2018-10-30 Thread Jonathan Haddad
Very cool Ben, thanks for sharing! On Tue, Oct 30, 2018 at 6:14 PM Ben Slater wrote: > For anyone who is interested, we’ve published a blog with some more > background on this and some more detail of our ongoing plans: > > >

Re: data modeling appointment scheduling

2018-11-04 Thread Jonathan Haddad
Maybe I’m missing something, but it seems to me that the bucket might be a little overkill for a scheduling system. Do you expect people to have millions of appointments? On Sun, Nov 4, 2018 at 12:46 PM I PVP wrote: > Could you please provide advice on the modeling approach for the following >

Re: data modeling appointment scheduling

2018-11-04 Thread Jonathan Haddad
ointment gets rescheduled ? > > thanks. > > IPVP > > On November 4, 2018 at 7:25:05 PM, Jonathan Haddad ( > wrote: > > Maybe I’m missing something, but it seems to me that the bucket might be a > little overkill for a scheduling system. Do you expect

Re: Multiple cluster for a single application

2018-11-07 Thread Jonathan Haddad
Interesting approach Eric, thanks for sharing that. Regarding this: > I've read documents recommended to use clusters with less than 50 or 100 nodes (Netflix got hundreds of clusters with less 100 nodes on each). Not sure where you read that, but it's nonsense. We work with quite a few clusters

Re: [EXTERNAL] Is Apache Cassandra supports Data at rest

2018-11-14 Thread Jonathan Haddad
Just because Cassandra doesn't do it doesn't mean you aren't able to encrypt your data at rest, and you definitely don't need DSE to do it. I recommend checking out the LUKS project. This, IMO, is a better option than having the data

Re: system_auth keyspace replication factor

2018-11-23 Thread Jonathan Haddad
Any chance you’re logging in with the Cassandra user? It uses quorum reads. On Fri, Nov 23, 2018 at 11:38 AM Vitali Dyachuk wrote: > Hi, > We have recently met a problem when we added 60 nodes in 1 region to the > cluster > and set an RF=60 for the system_auth ks, following this documentation >

Re: multiple node bootstrapping

2018-11-28 Thread Jonathan Haddad
Agree with Jeff here, using auto_bootstrap:false is probably not what you want. Have you increased your streaming throughput? Upgrading to 3.11 might reduce the time by quite a bit: You'd be doing committers a huge favor if you grabbed some hi

Re: upgrade Apache Cassandra 2.1.9 to 3.0.9

2018-12-01 Thread Jonathan Haddad
Dmitry is right. Generally speaking always go with the latest bug fix release. On Sat, Dec 1, 2018 at 10:14 AM Dmitry Saprykin wrote: > See more here > > > On Sat, Dec 1, 2018 at 1:02 PM Dmitry Saprykin > wrote: > >> Ev

Re: Sporadic high IO bandwidth and Linux OOM killer

2018-12-05 Thread Jonathan Haddad
Seeing high kswapd usage means there's a lot of churn in the page cache. It doesn't mean you're using swap, it means the box is spending time clearing pages out of the page cache to make room for the stuff you're reading now. The machines don't have enough memory - they are way undersized for a pr

Re: Cassandra Integrated Auth for JMX

2018-12-16 Thread Jonathan Haddad
Jolokia is running as an agent, which means it runs in process and has access to everything within the JVM. JMX credentials are supplies to the JMX server, which Jolokia is bypassing. You'll need to read up on Jolokia's security if you want to keep using it:

Re: Sub range repair

2019-01-01 Thread Jonathan Haddad
We (the last pickle) maintain an open source tool for dealing with this: On Tue, Jan 1, 2019 at 12:31 PM Rahul Reddy wrote: > Hello, > > Is it possible to find subrange needed for repair in Apache Cassandra like > dse which uses dsetool list_subranges like below doc >

Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values

2019-01-04 Thread Jonathan Haddad
If you're overwriting values, it really doesn't matter much if it's a tombstone or any other value, they still need to be compacted and have the same overhead at read time. Tombstones are problematic when you try to use Cassandra as a queue (or something like a queue) and you need to scan over tho

Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values

2019-01-04 Thread Jonathan Haddad
plicas also know about the deleted rows. With workloads > that generate a lot of tombstones, this can cause performance problems and > even exhaust the server heap. "* > > Regards, > Tomas > > On Fri, 4 Jan 2019, 7:06 pm Jonathan Haddad >> If you're overwriting v

Re: Good way of configuring Apache spark with Apache Cassandra

2019-01-04 Thread Jonathan Haddad
If you absolutely have to use Cassandra as the source of your data, I agree with Dor. That being said, if you're going to be doing a lot of analytics, I recommend using something other than Cassandra with Spark. The performance isn't particularly wonderful and you'll likely get anywhere from 10-5

Re: SSTableMetadata Util

2019-01-07 Thread Jonathan Haddad
Try installing the cassandra-tools package. On Mon, Jan 7, 2019 at 1:20 AM Igor Zubchenok wrote: > Same issue with 3.11.3: > > # find / -name sstable* > /usr/bin/sstableverify > /usr/bin/sstableupgrade > /usr/bin/sstableloader > /usr/bin/sstableutil > /usr/bin/sstablescrub > > only these sstable

Re: How seed nodes are working and how to upgrade/replace them?

2019-01-08 Thread Jonathan Haddad
I've done some gossip simulations in the past and found virtually no difference in the time it takes for messages to propagate in almost any sized cluster. IIRC it always converges by 17 iterations. Thus, I completely agree with Jeff's comment here. If you aren't pushing 800-1000 nodes, it's not

Re: Cassandra and Apache Arrow

2019-01-09 Thread Jonathan Haddad
Where are you seeing that it works with Cassandra? There's no mention of it under, and on the homepage it says only says that a Cassandra developer worked on it. We (unfortunately) don't do anything with it at the moment. On Wed, Jan 9, 2019 at 3:24 PM Tomas

Re: Cassandra and Apache Arrow

2019-01-09 Thread Jonathan Haddad
There is a diagram on the homepage displaying Cassandra (with other > storages) as source of data. > > > Which made me think there should be some integration... > > On Thu, 10 Jan 2019, 12:38 am Jonathan Haddad >> Where are you seeing

Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values

2019-01-09 Thread Jonathan Haddad
Lastname > - new Firstname, old Lastname > > having updates on columns atomically guarantees you to have new Firstname, > new Lastname > > On Fri, Jan 4, 2019 at 8:17 PM Jonathan Haddad wrote: > >> Those are two different cases though. It *sounds like* (again, I may be

Re: Released an ACID-compliant transaction library on top of Cassandra

2019-01-16 Thread Jonathan Haddad
Sounds a bit like RAMP: On Wed, Jan 16, 2019 at 12:51 PM Carl Mueller wrote: > "2) Overview: In essence, the protocol calls for each data item to > maintain the last committed and perhaps also the currently active version, > for the data and r

Re: Datastax Java Driver compatibility

2019-01-22 Thread Jonathan Haddad
The drivers are not maintained by the Cassandra project, it's up to each driver maintainer to list their compatibility. On Tue, Jan 22, 2019 at 10:48 AM Jai Bheemsen Rao Dhanwada <> wrote: > Thanks for the response Amanda, > > Yes we can go with the latest version but we are

Re: High CPU usage on reading single row with Set column with short TTL

2019-01-28 Thread Jonathan Haddad
Your fastest route might be to run a profiler on Cassandra and get some flame graphs. I'm a fan of the async-profiler: Joey Lynch did a nice write up in the documentation on a different process, which I haven't used yet: http://cassandra.apa

Re: datamodelling

2019-02-05 Thread Jonathan Haddad
We (The Last Pickle) wrote a blog post on scaling time series: Rather than an agent_type, you can use a application determined bucket, so that agents with more data use more buckets. That'll keep your partition

Re: Max number of windows when using TWCS

2019-02-11 Thread Jonathan Haddad
Deleting SSTables manually can be useful if you don't know your TTL up front. For example, you have an ETL process that moves your raw Cassandra data into S3 as parquet files, and you want to be sure that process is completed before you delete the data. You could also start out without setting a

Re: Usage of allocate_tokens_for_keyspace for a new cluster

2019-02-14 Thread Jonathan Haddad
Create the first node, setting the tokens manually. Create the keyspace. Add the rest of the nodes with the allocate tokens uncommented. On Thu, Feb 14, 2019 at 11:43 AM DuyHai Doan wrote: > Hello users > > By looking at the mailing list archive, there was already some questions > about the flag

Reaper 1.4 released

2019-02-15 Thread Jonathan Haddad
Hey folks, I'm happy to share we (The Last Pickle) have just released version 1.4 of Reaper. For those of you who aren't aware of the project, it's an open source tool for managing sub-range repairs, originally created by Spotify, which we picked up and adopted about two years ago. There's a blo

Re: [EXTERNAL] RE: SASI queries- cqlsh vs java driver

2019-02-27 Thread Jonathan Haddad
If the goal is arbitrary queries, I'd avoid Cassandra altogether. Don't use DSE Search or Ellesandra, they're two solutions designed to solve problems that are Cassandra first, search second. I'd go straight to elastic search for workloads that are primarily search driven, like you listed above.

Re: Maximum memory usage reached

2019-03-06 Thread Jonathan Haddad
That’s not an error. To the left of the log message is the severity, level INFO. Generally, I don’t recommend running Cassandra on only 2GB ram or for small datasets that can easily fit in memory. Is there a reason why you’re picking Cassandra for this dataset? On Thu, Mar 7, 2019 at 8:04 AM Kyry

Re: cassandra upgrades multi-DC in parallel

2019-03-12 Thread Jonathan Haddad
Nothing prevents it technically, but operationally you might not want to. Personally I’d prefer have the safety net of a dc to fall back on in case there’s an issue with the upgrade. On Wed, Mar 13, 2019 at 7:48 AM Carl Mueller wrote: > If there are multiple DCs in a cluster, is it safe to upgra

Re: To Repair or Not to Repair

2019-03-14 Thread Jonathan Haddad
My coworker Alex (from The Last Pickle) wrote an in depth blog post on TWCS. We recommend not running repair on tables that use TWCS. It's enough of a problem that we added a feature into Reaper to auto-blacklist TWCS / DTCS tables from be

Re: good monitoring tool for cassandra

2019-03-14 Thread Jonathan Haddad
I've worked with several teams using DataDog, folks are pretty happy with it. We (The Last Pickle) did the dashboards for them: Prometheus + Grafana is great if you want to host it yourself. On Fri, Mar 15, 2019 at 12:45 PM Jef

Re: Upgrading to SSD

2016-04-23 Thread Jonathan Haddad
You could do the following instead to minimize server downtime: 1. rsync while the server is running 2. rsync again to get any new files 3. shut server down 4. rsync for the 3rd time 5. change directory in yaml and start back up On Sat, Apr 23, 2016 at 12:23 PM Clint Martin < clintlmar...@coolf

Re: In memory code and query executions

2016-05-04 Thread Jonathan Haddad
Agreed with Nate. This is generally one of those "if you have to ask how it's done, you shouldn't be doing it" ideas. To add to his points above, deploying new versions of you app with this model is an operational nightmare. Now you've tightly coupled new versions of your app to doing a full clu

Re: SS Tables Files Streaming

2016-05-06 Thread Jonathan Haddad
Repairs, bootstamp, decommission. On Fri, May 6, 2016 at 1:16 PM Anubhav Kale wrote: > Hello, > > > > In what scenarios can SS Table files on disk from Node 1 go to Node 2 as > is ? I’m aware this happens in *nodetool rebuild* and I am assuming this > does *not* happen in repairs. Can someone c

Re: Setting bloom_filter_fp_chance < 0.01

2016-05-18 Thread Jonathan Haddad
The impact is it'll get massively bigger with very little performance benefit, if any. You can't get 0 because it's a probabilistic data structure. It tells you either: your data is definitely not here your data has a pretty decent chance of being here but never "it's here for sure" https://en

Re: Autobootstrap in Cassandra

2016-05-23 Thread Jonathan Haddad
find / -name 'cassandra.yaml' -exec grep -nH auto_bootstrap {} \; On Mon, May 23, 2016 at 3:44 PM Rajath Subramanyam wrote: > Hi Cassandra users, > > Is there a way to find if auto_bootstrap is set to false on a Cassandra > node if we didn't know the location of the cassandra.yaml or the cassan

Re: Cassandra monitoring

2016-06-14 Thread Jonathan Haddad
Depends what you want to monitor. I wouldn't use a lesser version of Cassandra for OpsCenter, it doesn't give you a ton you can't get elsewhere and it's not ever going to support OSS > 2.1, so you kind of limit yourself to a pretty old version of Cassandra for a non-good reason. What else do you

Re: Cassandra monitoring

2016-06-14 Thread Jonathan Haddad
the cheaper side. > > > > > On Tue, Jun 14, 2016 at 12:20 PM, Jonathan Haddad > wrote: > >> Depends what you want to monitor. I wouldn't use a lesser version of >> Cassandra for OpsCenter, it doesn't give you a ton you can't get elsewhere >> an

Re: Spark Cassandra Python Connector

2016-06-20 Thread Jonathan Haddad
I wouldn't recommend the TargetHolding lib. It's only useful for working with RDDs which are a terrible idea in Python, as the perf will make you cry with any reasonable sized dataset. The Datastax spark Cassandra connector works with Python + Dataframes without the crazy overhead of RDDs. Docs

Re: Question about hector api documentation

2016-06-24 Thread Jonathan Haddad
+1, do not use Hector. It hasn't had a commit in years and uses the thrift protocol which is now marked deprecated. The DataStax Java driver is recommended, possibly with Achilles to make things a bit nicer. On Thu, Jun 23, 2016 at 9:20 PM Noorul Islam K M wrote: > > The very first line README te

Re: Is my cluster normal?

2016-07-07 Thread Jonathan Haddad
What's your CPU looking like? If it's low, check your IO with iostat or dstat. I know some people have used Ebs and say it's fine but ive been burned too many times. On Thu, Jul 7, 2016 at 6:12 PM Yuan Fang wrote: > Hi Riccardo, > > Very low IO-wait. About 0.3%. > No stolen CPU. It is a casssandr

Re: Is my cluster normal?

2016-07-12 Thread Jonathan Haddad
t; are really inefficient for small reads served off of disk. If you drop the >> compression chunk size (4k, for example), you’ll probably see your read >> throughput increase significantly, which will give you more iops for >> commitlog, so write throughput likely goes up, too. >&g

Re: Is my cluster normal?

2016-07-12 Thread Jonathan Haddad
When you have high system load it means your CPU is waiting for *something*, and in my experience it's usually slow disk. A disk connected over network has been a culprit for me many times. On Tue, Jul 12, 2016 at 12:33 PM Jonathan Haddad wrote: > Can do you do: > > iostat -dmx 2

Re: Are counters faster than CAS or vice versa?

2016-07-20 Thread Jonathan Haddad
Just to make sure I understand, you've got a queue where you can stand missing processing the items in it? On Wed, Jul 20, 2016 at 1:13 PM Kevin Burton wrote: > On Wed, Jul 20, 2016 at 11:53 AM, Jeff Jirsa > wrote: > >> Can you tolerate the value being “close, but not perfectly accurate”? If >>

Re: Cassandra stable version for production

2016-07-20 Thread Jonathan Haddad
If I were starting a new project today, I'd go with 3.0. It's gotten over half a year of bug fixes. I personally would have a hard time putting a tick tock release into production, as you're either getting new features or putting a version in which won't receive any further bug fixes, so unless y

Re: Re : Purging tombstones from a particular row in SSTable

2016-07-29 Thread Jonathan Haddad
I think Aaron Morton may have talked about a couple optimizations that went into 3.0 for these use cases. I don't have a link handy (on my phone, typing quickly) but I think it's on the last pickle blog. On Fri, Jul 29, 2016 at 1:37 PM DuyHai Doan wrote: > @Eric > > Very interesting example. But

Re: [Marketing Mail] Re: Memory leak and lockup on our 2.2.7 Cassandra cluster.

2016-08-03 Thread Jonathan Haddad
Kevin, "Our scheme uses large buckets of content where we write to a bucket/partition for 5 minutes, then move to a new one." Are you writing to a single partition and only that partition for 5 minutes? If so, you should really rethink your data model. This method does not scale as you add node

Re: Mutation of X bytes is too large for the maximum size of Y

2016-08-03 Thread Jonathan Haddad
I haven't verified, so i'm not 100% certain, but I believe you'd get back an exception to the client. Yes, this belongs in the DB, but I don't think you're totally blind to what went wrong. My guess is this exception in the Python driver (but other drivers should have a similar exception): https:

Re: [Marketing Mail] Re: Memory leak and lockup on our 2.2.7 Cassandra cluster.

2016-08-04 Thread Jonathan Haddad
In the future you may find SASI indexes useful for indexing Cassandra data. Shameless blog post plug: Deep technical dive: On Thu, Aug 4, 2016 at 11:45 AM Kevin Burton wrote: > BTW. we

Re: Merging cells in compaction / compression?

2016-08-05 Thread Jonathan Haddad
Hadoop and Cassandra have very different use cases. If the ability to write a custom compression system is the primary factor in how you choose your database I suspect you may run into some trouble. Jon On Fri, Aug 5, 2016 at 6:14 AM Michael Burman wrote: > Hi, > > As Spark is an example of so

  1   2   3   4   5   6   >