These numbers do not match e.g. AWS, so guessing you are using local storage?
*.......* *Making a billion dollar startup is easy: "take a human desire, preferably one that has been around for a really long time … Identify that desire and use modern technology to take out steps."* *.......Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872* On Fri, May 5, 2017 at 12:19 PM, Jonathan Guberman <j...@tineye.com> wrote: > Hello, > > We’re currently testing Cassandra for use as a pure key-object store for > data blobs around 10kB - 60kB each. Our use case is storing on the order of > 10 billion objects with about 5-20 million new writes per day. A written > object will never be updated or deleted. Objects will be read at least > once, some time within 10 days of being written. This will generally happen > as a batch; that is, all of the images written on a particular day will be > read together at the same time. This batch read will only happen one time; > future reads will happen on individual objects, with no grouping, and they > will follow a long-tail distribution, with popular objects read thousands > of times per year but most read never or virtually never. > > I’ve set up a small four node test cluster and have written test scripts > to benchmark writing and reading our data. The table I’ve set up is very > simple: an ascii primary key column with the object ID and a blob column > for the data. All other settings were left at their defaults. > > I’ve found write speeds to be very fast most of the time. However, > periodically, writes will slow to a crawl for anywhere between half an hour > to two hours, after which speeds recover to their previous levels. I assume > this is some sort of data compaction or flushing to disk, but I haven’t > been able to figure out the exact cause. > > Read speeds have been more disappointing. Cached reads are very fast, but > random read speed averages about 2 MB/sec, which is too slow when we need > to read out a batch of several million objects. I don’t think it’s > reasonable to assume that these rows will all still be cached by the time > we need to read them for that first large batch read. > > My general question is whether anyone has any suggestions for how to > improve performance for our use case. More specifically: > > - Is there a way to mitigate or eliminate the huge slowdowns I see when > writing millions of rows? > - Are there settings I should be using in order to maximize read speeds > for random reads? > - Is there a way to design our tables to improve the read speeds for the > initial large batched reads? I was thinking of using a batch ID column that > could be used to retrieve the data for the initial block. However, future > reads would need to be done by the object ID, not the batch ID, so it seems > to me I’d need to duplicate the data, one in a “objects by batch” table, > and the other in a simple “objects” table. Is there a better approach than > this? > > Thank you! > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > For additional commands, e-mail: user-h...@cassandra.apache.org > >