Re: Suggested settings for number crunching

2011-08-19 Thread Paul Loy
Nice one thanks. We're now up to 500k a second on one box which is pretty good (well good enough until our data grows 5 fold). So maybe (un)durable_writes may speed us up some more!! Cheers, Paul. On Thu, Aug 18, 2011 at 11:40 PM, aaron morton wrote: > couple of thoughts, 400 row mutations in

Re: Suggested settings for number crunching

2011-08-18 Thread aaron morton
couple of thoughts, 400 row mutations in a batch may be a bit high. More is not always better. Watch the TP stats to see if the mutation pool is backing up excessively. Also if you feel like having fun take a look at the durable_writes config setting for keyspaces, from the cli help… - durable

Re: Suggested settings for number crunching

2011-08-18 Thread Paul Loy
Yeah, the data after crunching drops to just 65000 columns so one Cassandra is plenty. That will all go in memory on one box. It's only the crunching where we have lots of data and then need it arranged in a structured manner. That's why I don't use flat files that I just append to. I need them in

Re: Suggested settings for number crunching

2011-08-18 Thread Paul Loy
Yup, we do that. We currently have 200 threads that push mutations into a pool of Mutators (think Pelops - although that was too slow so we rolled our own much lower level version). We have around 50 thrift clients that mutations are them pushed through to cassandra. On Thu, Aug 18, 2011 at 4:35 P

Re: Suggested settings for number crunching

2011-08-18 Thread Jonathan Ellis
Step 0: use multiple threads to insert On Thu, Aug 18, 2011 at 10:03 AM, Paul Loy wrote: > Yeah, we're processing item similarities. So we are writing single columns > at a time. Although we do batch these into 400 mutations before sending to > Cassy. We currently perform almost 2 billion calcula

Re: Suggested settings for number crunching

2011-08-18 Thread Jake Luciani
So you only have 1 cassandra node? If you are interested only in getting the complete work done as fast as possible before you begin reading, take a look at the new bulk loader in cassandra: http://www.datastax.com/dev/blog/bulk-loading -Jake On Thu, Aug 18, 2011 at 11:03 AM, Paul Loy wrote:

Re: Suggested settings for number crunching

2011-08-18 Thread Paul Loy
Yeah, we're processing item similarities. So we are writing single columns at a time. Although we do batch these into 400 mutations before sending to Cassy. We currently perform almost 2 billion calculations that then write almost 4 billion columns. Once all similarities are calculated, we just gr

Re: Suggested settings for number crunching

2011-08-18 Thread Jake Luciani
Are you writing lots of tiny rows or a few very large rows, are you batching mutations? is the loading disk or cpu or network bound? -Jake On Thu, Aug 18, 2011 at 7:08 AM, Paul Loy wrote: > Hi All, > > I have a program that crunches through around 3 billion calculations. We > store the result o

Re: Suggested settings for number crunching

2011-08-18 Thread Sam Overton
Hi Paul, Here's one idea: If you're going to write just once then read once after the writes are finished, then you could disable minor compactions whilst performing the writes. This will mean that all disk IO bandwidth will be available for writes. After you have finished writing, re-enable minor

Re: Suggested settings for number crunching

2011-08-18 Thread Jonathan Ellis
Where is your bottleneck? http://spyced.blogspot.com/2010/01/linux-performance-basics.html On Thu, Aug 18, 2011 at 6:08 AM, Paul Loy wrote: > Hi All, > > I have a program that crunches through around 3 billion calculations. We > store the result of each of these in cassandra to later query once i