So you only have 1 cassandra node? If you are interested only in getting the complete work done as fast as possible before you begin reading, take a look at the new bulk loader in cassandra:
http://www.datastax.com/dev/blog/bulk-loading -Jake On Thu, Aug 18, 2011 at 11:03 AM, Paul Loy <ketera...@gmail.com> wrote: > Yeah, we're processing item similarities. So we are writing single columns > at a time. Although we do batch these into 400 mutations before sending to > Cassy. We currently perform almost 2 billion calculations that then write > almost 4 billion columns. > > Once all similarities are calculated, we just grab a slice per item and > create a denormalised vector of similar items (trimmed down to topN and only > those above a certain threshold). This makes lookup super fast as we only > get one column from cassandra. > > So we just want to optimise the crunching and storing phase as that's a > O(n^2) complexity problem. The quicker we can make that the quicker the > whole process works. > > I'm going to try disabling minor compactions as a start. > > > > is the loading disk or cpu or network bound? > > cpu is at 40% free > only one cassy node on the same box as the processor for now so no network > traffic > so I think it's disk access. Will find out for sure tomorrow after the > current test runs. > > Thanks, > > Paul. > > > On Thu, Aug 18, 2011 at 2:23 PM, Jake Luciani <jak...@gmail.com> wrote: > >> Are you writing lots of tiny rows or a few very large rows, are you >> batching mutations? is the loading disk or cpu or network bound? >> >> -Jake >> >> On Thu, Aug 18, 2011 at 7:08 AM, Paul Loy <ketera...@gmail.com> wrote: >> >>> Hi All, >>> >>> I have a program that crunches through around 3 billion calculations. We >>> store the result of each of these in cassandra to later query once in order >>> to create some vectors. Our processing is limited by Cassandra now, rather >>> than the calculations themselves. >>> >>> I was wondering what settings I can change to increase the write >>> throughput. Perhaps disabling all caching, etc, as I won't be able to keep >>> it all in memory anyway and only want to query the results once. >>> >>> Any thoughts would be appreciated, >>> >>> Paul. >>> >>> -- >>> --------------------------------------------- >>> Paul Loy >>> p...@keteracel.com >>> http://uk.linkedin.com/in/paulloy >>> >> >> >> >> -- >> http://twitter.com/tjake >> > > > > -- > --------------------------------------------- > Paul Loy > p...@keteracel.com > http://uk.linkedin.com/in/paulloy > -- http://twitter.com/tjake