Why not decouple the twitter stream processing from the indexing? More than
likely you have a single process consuming the spritzer stream, so you can
put the fetched results in a queue (hornetq, beanstalk, or even a simple
Redis queue) and then have workers pull from the queue and insert into Riak.
You could run one worker per node and thus insert in parallel into all
nodes. If you need free CPU (e.g. for searches), just throttle the workers
to some sane level. If you see the queue getting bigger, add another Riak
node (and thus another local worker).

-jd

2011/6/13 Steve Webb <sw...@gnip.com>

> Ok, I've changed my two VMs to each have:
>
> 3 CPUs, 1GB ram, 120GB disk
>
> I'm ingesting the twitter spritzer stream (about 10-20 tweets per second,
> approx 2k of data per tweet).  One bucket is storing the non-indexed tweets
> in full.  Another bucket is storing the indexed tweet string, id, date and
> username.  A maximum of 20 clients can be hitting the 'cluster' at any one
> time.
>
> I'm using n_val=2 so there is replication going on behind the scenes.
>
> I'm using a hardware load-balancer to distribute the work amongst the two
> nodes and now I'm seeing about 75% CPU usage as opposed to 100% on one node
> and 50% on the replicating-only node.
>
> I've monitored the VM over the last few days and it seems to be mostly
> CPU-bound.  The disk I/O is low.  The Network I/O is low.
>
> Q: Can I change the pre-commit to a post-commit trigger or something
> perhaps or will that make any difference at all?  I'm ok if the tweet stuff
> doesn't get indexed immediately and there's a slight lag in indexing if it
> saves on CPU.
>
>
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to