I'm a little lost now. Where are you specifying chunk size, which is what
should be varying, as opposed to blob size? And what exactly is the number
of records? Seems like you should be computing number of chunks from blob
size divided by chunk size. And it still seems like you are writing the
same data for each chunk.

-- Jack Krupansky

On Mon, Feb 8, 2016 at 5:34 PM, Giampaolo Trapasso <
giampaolo.trapa...@radicalbit.io> wrote:

> I write at every step MyConfig.blobsize number of bytes, that I configured
> to be from 100000 to 1000000. This allows me to "simulate" the writing of a
> 600Mb file, as configuration on github (
> https://github.com/giampaolotrapasso/cassandratest/blob/master/src/main/resources/application.conf
>
>
> *)*
>  Giampaolo
>
> 2016-02-08 23:25 GMT+01:00 Jack Krupansky <jack.krupan...@gmail.com>:
>
>> You appear to be writing the entire bob on each chunk rather than the
>> slice of the blob.
>>
>> -- Jack Krupansky
>>
>> On Mon, Feb 8, 2016 at 1:45 PM, Giampaolo Trapasso <
>> giampaolo.trapa...@radicalbit.io> wrote:
>>
>>> Hi to all,
>>>
>>> I'm trying to put a large binary file (> 500MB) on a C* cluster as fast
>>> as I can but I get some (many) WriteTimeoutExceptions.
>>>
>>> I created a small POC that isolates the problem I'm facing. Here you
>>> will find the code: https://github.com/giampaolotrapasso/cassandratest,
>>>
>>> *Main details about it:*
>>>
>>>    - I try to write the file into chunks (*data* field) <= 1MB (1MB is
>>>    recommended max size for single cell),
>>>
>>>
>>>    - Chunks are grouped into buckets. Every bucket is a partition row,
>>>    - Buckets are grouped by UUIDs.
>>>
>>>
>>>    - Chunk size and bucket size are configurable from app so I can try
>>>    different configurations and see what happens.
>>>
>>>
>>>    - Trying to max throughput, I execute asynch insertions, however to
>>>    avoid too much pressure on the db, after a threshold, I wait at least 
>>> for a
>>>    finished insert to add another (this part is quite raw in my code but I
>>>    think it's not so important). Also this parameter is configurable to test
>>>    different combinations.
>>>
>>> This is the table on db:
>>>
>>> CREATE TABLE blobtest.store (
>>>     uuid uuid,
>>>     bucket bigint,
>>>     start bigint,
>>>     data blob,
>>>     end bigint,
>>>     PRIMARY KEY ((uuid, bucket), start)
>>> )
>>>
>>> and this is the main code (Scala, but I hope is be generally readable)
>>>
>>>     val statement = client.session.prepare("INSERT INTO
>>> blobTest.store(uuid, bucket, start, end, data) VALUES (?, ?, ?, ?, ?) if
>>> not exists;")
>>>
>>>     val blob = new Array[Byte](MyConfig.blobSize)
>>>     scala.util.Random.nextBytes(blob)
>>>
>>>     write(client,
>>>       numberOfRecords = MyConfig.recordNumber,
>>>       bucketSize = MyConfig.bucketSize,
>>>       maxConcurrentWrites = MyConfig.maxFutures,
>>>       blob,
>>>       statement)
>>>
>>> where write is
>>>
>>> def write(database: Database, numberOfRecords: Int, bucketSize: Int,
>>> maxConcurrentWrites: Int,
>>>             blob: Array[Byte], statement: PreparedStatement): Unit = {
>>>
>>>     val uuid: UUID = UUID.randomUUID()
>>>     var count = 0;
>>>
>>>     //Javish loop
>>>     while (count < numberOfRecords) {
>>>       val record = Record(
>>>         uuid = uuid,
>>>         bucket = count / bucketSize,
>>>         start = ((count % bucketSize)) * blob.length,
>>>         end = ((count % bucketSize) + 1) * blob.length,
>>>         bytes = blob
>>>       )
>>>       asynchWrite(database, maxConcurrentWrites, statement, record)
>>>       count += 1
>>>     }
>>>
>>>     waitDbWrites()
>>>   }
>>>
>>> and asynchWrite is just binding to statement
>>>
>>> *Problem*
>>>
>>> The problem is that when I try to increase the chunck size, or the
>>> number of asynch insert or the size of the bucket (ie number of chuncks),
>>> app become unstable since the db starts throwing WriteTimeoutException.
>>>
>>> I've tested the stuff on the CCM (4 nodes) and on a EC2 cluster (5
>>> nodes, 8GB Heap). Problem seems the same on both enviroments.
>>>
>>> On my local cluster, I've tried to change respect to default
>>> configuration:
>>>
>>> concurrent_writes: 128
>>>
>>> write_request_timeout_in_ms: 200000
>>>
>>> other configurations are here:
>>> https://gist.github.com/giampaolotrapasso/ca21a83befd339075e07
>>>
>>> *Other*
>>>
>>> Exceptions seems random, sometimes are at the beginning of the write
>>>
>>> *Questions:*
>>>
>>> 1. Is my model wrong? Am I missing some important detail?
>>>
>>> 2. What are the important information to look at for this kind of
>>> problem?
>>>
>>> 3. Why exceptions are so random?
>>>
>>> 4. There is some other C* parameter I can set to assure that
>>> WriteTimeoutException does not occur?
>>>
>>> I hope I provided enough information to get some help.
>>>
>>> Thank you in advance for any reply.
>>>
>>>
>>> Giampaolo
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

Reply via email to