Nice one thanks. We're now up to 500k a second on one box which is pretty good (well good enough until our data grows 5 fold). So maybe (un)durable_writes may speed us up some more!!
Cheers, Paul. On Thu, Aug 18, 2011 at 11:40 PM, aaron morton <aa...@thelastpickle.com>wrote: > couple of thoughts, 400 row mutations in a batch may be a bit high. More is > not always better. Watch the TP stats to see if the mutation pool is backing > up excessively. > > Also if you feel like having fun take a look at the durable_writes config > setting for keyspaces, from the cli help… > - durable_writes: When set to false all RowMutations on keyspace will > by-pass CommitLog. > Set to true by default. > > This will remove disk access from the write path. Which sounds OK in your > case. > > When you are doing the reads, the fastest slice predicate is one with no > start, no finish, revered = false > http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/) You can now > reverse the storage ordered of comparators, so if you are getting cols from > the end of the row consider changing the storage order. > > Cheers > > ----------------- > Aaron Morton > Freelance Cassandra Developer > @aaronmorton > http://www.thelastpickle.com > > On 19/08/2011, at 3:43 AM, Paul Loy wrote: > > Yeah, the data after crunching drops to just 65000 columns so one Cassandra > is plenty. That will all go in memory on one box. It's only the crunching > where we have lots of data and then need it arranged in a structured manner. > That's why I don't use flat files that I just append to. I need them in > order of similarity to generate the vectors. > > Bulk loading looks interesting. > > On Thu, Aug 18, 2011 at 4:21 PM, Jake Luciani <jak...@gmail.com> wrote: > >> So you only have 1 cassandra node? >> >> If you are interested only in getting the complete work done as fast as >> possible before you begin reading, take a look at the new bulk loader in >> cassandra: >> >> http://www.datastax.com/dev/blog/bulk-loading >> >> -Jake >> >> >> On Thu, Aug 18, 2011 at 11:03 AM, Paul Loy <ketera...@gmail.com> wrote: >> >>> Yeah, we're processing item similarities. So we are writing single >>> columns at a time. Although we do batch these into 400 mutations before >>> sending to Cassy. We currently perform almost 2 billion calculations that >>> then write almost 4 billion columns. >>> >>> Once all similarities are calculated, we just grab a slice per item and >>> create a denormalised vector of similar items (trimmed down to topN and only >>> those above a certain threshold). This makes lookup super fast as we only >>> get one column from cassandra. >>> >>> So we just want to optimise the crunching and storing phase as that's a >>> O(n^2) complexity problem. The quicker we can make that the quicker the >>> whole process works. >>> >>> I'm going to try disabling minor compactions as a start. >>> >>> >>> > is the loading disk or cpu or network bound? >>> >>> cpu is at 40% free >>> only one cassy node on the same box as the processor for now so no >>> network traffic >>> so I think it's disk access. Will find out for sure tomorrow after the >>> current test runs. >>> >>> Thanks, >>> >>> Paul. >>> >>> >>> On Thu, Aug 18, 2011 at 2:23 PM, Jake Luciani <jak...@gmail.com> wrote: >>> >>>> Are you writing lots of tiny rows or a few very large rows, are you >>>> batching mutations? is the loading disk or cpu or network bound? >>>> >>>> -Jake >>>> >>>> On Thu, Aug 18, 2011 at 7:08 AM, Paul Loy <ketera...@gmail.com> wrote: >>>> >>>>> Hi All, >>>>> >>>>> I have a program that crunches through around 3 billion calculations. >>>>> We store the result of each of these in cassandra to later query once in >>>>> order to create some vectors. Our processing is limited by Cassandra now, >>>>> rather than the calculations themselves. >>>>> >>>>> I was wondering what settings I can change to increase the write >>>>> throughput. Perhaps disabling all caching, etc, as I won't be able to keep >>>>> it all in memory anyway and only want to query the results once. >>>>> >>>>> Any thoughts would be appreciated, >>>>> >>>>> Paul. >>>>> >>>>> -- >>>>> --------------------------------------------- >>>>> Paul Loy >>>>> p...@keteracel.com >>>>> http://uk.linkedin.com/in/paulloy >>>>> >>>> >>>> >>>> >>>> -- >>>> http://twitter.com/tjake >>>> >>> >>> >>> >>> -- >>> --------------------------------------------- >>> Paul Loy >>> p...@keteracel.com >>> http://uk.linkedin.com/in/paulloy >>> >> >> >> >> -- >> http://twitter.com/tjake >> > > > > -- > --------------------------------------------- > Paul Loy > p...@keteracel.com > http://uk.linkedin.com/in/paulloy > > > -- --------------------------------------------- Paul Loy p...@keteracel.com http://uk.linkedin.com/in/paulloy