Hi, We have a web crawler project currently based on Cassandra ( https://github.com/iParadigms/walker, written in Go and using the gocql driver), with the following relevant usage pattern:
- Big range reads over a CF to grab potentially millions of rows and dispatch new links to crawl - Fast insert of new links (effectively using Cassandra to deduplicate) We ultimately planned on doing the batch processing step (the dispatching) in a system like Spark, but for the time being it is also in Go. We believe this should work fine given that Cassandra now properly allows chunked iteration of columns in a CF. The issue is, periodically while doing a particularly large range read, other operations time out because that node is "busy". In an experimental cluster with only two nodes (and replication factor of 2), I'll get an error like: "Operation timed out - received only 1 responses." Indicating that the second node took too long to reply. At the moment I have the long range reads set to consistency level ANY but the rest of the operations are on QUORUM, so on this cluster they require responses from both nodes. The relevant CF is also using LeveledCompactionStrategy. This happens in both Cassandra 2 and 2.1. Despite this error I don't see any significant I/O, memory consumption, or CPU usage. Here are some of the configuration values I've played with: Increasing timeouts: read_request_timeout_in_ms: 15000 range_request_timeout_in_ms: 30000 write_request_timeout_in_ms: 10000 request_timeout_in_ms: 10000 Getting rid of caches we don't need: key_cache_size_in_mb: 0 row_cache_size_in_mb: 0 Each of the 2 nodes has an HDD for commit log and single HDD I'm using for data. Hence the following thread config (maybe since I/O is not an issue I should increase these?): concurrent_reads: 16 concurrent_writes: 32 concurrent_counter_writes: 32 Because I have a large number columns and aren't doing random I/O I've increased this: column_index_size_in_kb: 2048 It's something of a mystery why this error comes up. Of course with a 3rd node it will get masked if I am doing QUORUM operations, but it still seems like it should not happen, and that there is some kind of head-of-line blocking or other issue in Cassandra. I would like to increase the amount of dispatching I'm doing because of this it bogs it down if I do. Any suggestions for other things we can try here would be appreciated. -dan