Hi Rohit,

Thanks a lot for looking at this.  The intention of calculating the data
upfront it to only benchmark the time it takes store in records/sec
eliminating the generation factor from it (which will be different on the
real scenario, reading from HDFS)
I used a profiler today and indeed it's not the storage part, but the
generation that's bloating the memory.  Objects in memory take surprisingly
more space that one would expect based on the data they hold. In my case it
was 2.1x the size of the original data.

Now that  we are talking about this, do you have some figures of how
Calliope compares -performance wise- to a classic Cassandra driver
(DataStax / Astyanax) ?  that would be awesome.

Thanks again!

-kr, Gerard.





On Tue, Jun 17, 2014 at 4:27 PM, tj opensource <opensou...@tuplejump.com>
wrote:

> Dear Gerard,
>
> I just tried the code you posted in the gist (
> https://gist.github.com/maasg/68de6016bffe5e71b78c) and it does give a
> OOM. It is cause of the data being generated locally and then paralellized
> -
>
>
> ----------------------------------------------------------------------------------------------------------------------
>
>
>     val entries = for (i <- 1 to total) yield {
>
>
>       Array(s"devy$i", "aggr", "1000", "sum", (i to i+10).mkString(","))
>
>
>     }
>
>
>
>     val rdd = sc.parallelize(entries,8)
>
>
>
> ----------------------------------------------------------------------------------------------------------------------
>
>
>
> This will generate all the data on the local system and then try to
> partition it.
>
> Instead, we should paralellize the keys (i <- 1 to total) and generate
> data in the map tasks. This is *closer* to what you will get if you
> distribute out a file on a DFS like HDFS/SnackFS.
>
> I have made the change in the script here (
> https://gist.github.com/milliondreams/aac52e08953949057e7d)
>
>
> ----------------------------------------------------------------------------------------------------------------------
>
>
>
>     val rdd = sc.parallelize(1 to total, 8).map(i => Array(s"devy$i", "aggr", 
> "1000", "sum", (i to i+10).mkString(",")))
>
>
> ----------------------------------------------------------------------------------------------------------------------
>
>
>
> I was able to insert 50M records using just over 350M RAM. Attaching the
> log and screenshot.
>
> Let me know if you still face this issue... we can do a screen share and
> resolve thee issue there.
>
> And thanks for using Calliope. I hope it serves your needs.
>
> Cheers,
> Rohit
>
>
> On Mon, Jun 16, 2014 at 9:57 PM, Gerard Maas <gerard.m...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I've been doing some testing with Calliope as a way to do batch load from
>> Spark into Cassandra.
>> My initial results are promising on the performance area, but worrisome
>> on the memory footprint side.
>>
>> I'm generating N records of about 50 bytes each and using the UPDATE
>> mutator to insert them into C*.   I get OOM if my memory is below 1GB per
>> million of records, or about 50Mb of raw data (without counting any
>> RDD/structural overhead).  (See code [1])
>>
>> (so, to avoid confusions: e.g.: I need 4GB RAM to save  4M of 50Byte
>> records to Cassandra)  That's an order of magnitude more than the RAW data.
>>
>> I understood that Calliope builds on top of the Hadoop support of
>> Cassandra, which builds on top of SSTables and sstableloader.
>>
>> I would like to know what's the memory usage factor of Calliope and what
>> parameters could I use to control/tune that.
>>
>> Any experience/advice on that?
>>
>>  -kr, Gerard.
>>
>> [1] https://gist.github.com/maasg/68de6016bffe5e71b78c
>>
>
>

Reply via email to