Bucket size is not disclosed. My recommendation is that partitions not be more than about 10 MB (some people say 100MB or 50MB.)
I think I'd recommend a smaller chunk size, like 128K or 256K. I would note that Mongo's GridFS uses 256K chunks. I don't know enough about the finer nuances of Cassandra internal row management to know whether your chunks should be a little less than some power of 2 so that a single row is not just over a power of 2 in size. You may need more heap as well. Maybe you are hitting a high rate of GC that may cause timeout. -- Jack Krupansky On Mon, Feb 8, 2016 at 7:46 PM, Giampaolo Trapasso < [email protected]> wrote: > Sorry Jack for my poor description, > I write 600 times the same array of 1M of bytes to make my life easier. > This allows me to simulate a 600Mb file. It's just a simplification. > Instead of generating 600Mb random array (or reading a real 600Mb file), > and dividing it into 600 chunks, I write the same random array 600 times. > Every chunk corresponds to data field in the table. I realize that blob > parameter of write method can lead to confusion (going to update on github > at least) > > I think that the content of the file is not important for the test itself, > I just need 1MB of data to be written. Let me know if there are some other > unclear spots. > > giampaolo > > > 2016-02-09 1:28 GMT+01:00 Jack Krupansky <[email protected]>: > >> I'm a little lost now. Where are you specifying chunk size, which is what >> should be varying, as opposed to blob size? And what exactly is the number >> of records? Seems like you should be computing number of chunks from blob >> size divided by chunk size. And it still seems like you are writing the >> same data for each chunk. >> >> -- Jack Krupansky >> >> On Mon, Feb 8, 2016 at 5:34 PM, Giampaolo Trapasso < >> [email protected]> wrote: >> >>> I write at every step MyConfig.blobsize number of bytes, that I >>> configured to be from 100000 to 1000000. This allows me to "simulate" the >>> writing of a 600Mb file, as configuration on github ( >>> https://github.com/giampaolotrapasso/cassandratest/blob/master/src/main/resources/application.conf >>> >>> >>> *)* >>> Giampaolo >>> >>> 2016-02-08 23:25 GMT+01:00 Jack Krupansky <[email protected]>: >>> >>>> You appear to be writing the entire bob on each chunk rather than the >>>> slice of the blob. >>>> >>>> -- Jack Krupansky >>>> >>>> On Mon, Feb 8, 2016 at 1:45 PM, Giampaolo Trapasso < >>>> [email protected]> wrote: >>>> >>>>> Hi to all, >>>>> >>>>> I'm trying to put a large binary file (> 500MB) on a C* cluster as >>>>> fast as I can but I get some (many) WriteTimeoutExceptions. >>>>> >>>>> I created a small POC that isolates the problem I'm facing. Here you >>>>> will find the code: https://github.com/giampaolotrapasso/cassandratest, >>>>> >>>>> >>>>> *Main details about it:* >>>>> >>>>> - I try to write the file into chunks (*data* field) <= 1MB (1MB >>>>> is recommended max size for single cell), >>>>> >>>>> >>>>> - Chunks are grouped into buckets. Every bucket is a partition row, >>>>> - Buckets are grouped by UUIDs. >>>>> >>>>> >>>>> - Chunk size and bucket size are configurable from app so I can >>>>> try different configurations and see what happens. >>>>> >>>>> >>>>> - Trying to max throughput, I execute asynch insertions, however >>>>> to avoid too much pressure on the db, after a threshold, I wait at >>>>> least >>>>> for a finished insert to add another (this part is quite raw in my >>>>> code but >>>>> I think it's not so important). Also this parameter is configurable to >>>>> test >>>>> different combinations. >>>>> >>>>> This is the table on db: >>>>> >>>>> CREATE TABLE blobtest.store ( >>>>> uuid uuid, >>>>> bucket bigint, >>>>> start bigint, >>>>> data blob, >>>>> end bigint, >>>>> PRIMARY KEY ((uuid, bucket), start) >>>>> ) >>>>> >>>>> and this is the main code (Scala, but I hope is be generally readable) >>>>> >>>>> val statement = client.session.prepare("INSERT INTO >>>>> blobTest.store(uuid, bucket, start, end, data) VALUES (?, ?, ?, ?, ?) if >>>>> not exists;") >>>>> >>>>> val blob = new Array[Byte](MyConfig.blobSize) >>>>> scala.util.Random.nextBytes(blob) >>>>> >>>>> write(client, >>>>> numberOfRecords = MyConfig.recordNumber, >>>>> bucketSize = MyConfig.bucketSize, >>>>> maxConcurrentWrites = MyConfig.maxFutures, >>>>> blob, >>>>> statement) >>>>> >>>>> where write is >>>>> >>>>> def write(database: Database, numberOfRecords: Int, bucketSize: Int, >>>>> maxConcurrentWrites: Int, >>>>> blob: Array[Byte], statement: PreparedStatement): Unit = { >>>>> >>>>> val uuid: UUID = UUID.randomUUID() >>>>> var count = 0; >>>>> >>>>> //Javish loop >>>>> while (count < numberOfRecords) { >>>>> val record = Record( >>>>> uuid = uuid, >>>>> bucket = count / bucketSize, >>>>> start = ((count % bucketSize)) * blob.length, >>>>> end = ((count % bucketSize) + 1) * blob.length, >>>>> bytes = blob >>>>> ) >>>>> asynchWrite(database, maxConcurrentWrites, statement, record) >>>>> count += 1 >>>>> } >>>>> >>>>> waitDbWrites() >>>>> } >>>>> >>>>> and asynchWrite is just binding to statement >>>>> >>>>> *Problem* >>>>> >>>>> The problem is that when I try to increase the chunck size, or the >>>>> number of asynch insert or the size of the bucket (ie number of chuncks), >>>>> app become unstable since the db starts throwing WriteTimeoutException. >>>>> >>>>> I've tested the stuff on the CCM (4 nodes) and on a EC2 cluster (5 >>>>> nodes, 8GB Heap). Problem seems the same on both enviroments. >>>>> >>>>> On my local cluster, I've tried to change respect to default >>>>> configuration: >>>>> >>>>> concurrent_writes: 128 >>>>> >>>>> write_request_timeout_in_ms: 200000 >>>>> >>>>> other configurations are here: >>>>> https://gist.github.com/giampaolotrapasso/ca21a83befd339075e07 >>>>> >>>>> *Other* >>>>> >>>>> Exceptions seems random, sometimes are at the beginning of the write >>>>> >>>>> *Questions:* >>>>> >>>>> 1. Is my model wrong? Am I missing some important detail? >>>>> >>>>> 2. What are the important information to look at for this kind of >>>>> problem? >>>>> >>>>> 3. Why exceptions are so random? >>>>> >>>>> 4. There is some other C* parameter I can set to assure that >>>>> WriteTimeoutException does not occur? >>>>> >>>>> I hope I provided enough information to get some help. >>>>> >>>>> Thank you in advance for any reply. >>>>> >>>>> >>>>> Giampaolo >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>> >> >
