Hi Steve, Thanks for sending over more details.
The pre- vs. post-commit hook question is a good one. The reason we chose a pre-commit hook over a post-commit hook for Riak Search indexing is because a post commit hook doesn't currently provide back-pressure to the Riak KV side of the system. It would be possible to get yourself into a situation where the queue of objects to index is so large that it exhausts available resources. The pre-commit hook prevents that situation. Everything in your setup below appears correct, so it may be time to look into batching as a way to increase the speed of indexing. Best, Rusty On Mon, Jun 13, 2011 at 1:45 PM, Steve Webb <sw...@gnip.com> wrote: > Ok, I've changed my two VMs to each have: > > 3 CPUs, 1GB ram, 120GB disk > > I'm ingesting the twitter spritzer stream (about 10-20 tweets per second, > approx 2k of data per tweet). One bucket is storing the non-indexed tweets > in full. Another bucket is storing the indexed tweet string, id, date and > username. A maximum of 20 clients can be hitting the 'cluster' at any one > time. > > I'm using n_val=2 so there is replication going on behind the scenes. > > I'm using a hardware load-balancer to distribute the work amongst the two > nodes and now I'm seeing about 75% CPU usage as opposed to 100% on one node > and 50% on the replicating-only node. > > I've monitored the VM over the last few days and it seems to be mostly > CPU-bound. The disk I/O is low. The Network I/O is low. > > Q: Can I change the pre-commit to a post-commit trigger or something > perhaps or will that make any difference at all? I'm ok if the tweet stuff > doesn't get indexed immediately and there's a slight lag in indexing if it > saves on CPU. > > Here's my search schema (the default, I think): > > root@ha2:/var/log/riaksearch# search-cmd show_schema Index > Attempting to restart script through sudo -u riak > > %% Schema for 'Index' > > { > schema, > [ > {version, "1.1"}, > {n_val, 3}, > {default_field, "value"}, > {analyzer_factory, {erlang, text_analyzers, > whitespace_analyzer_factory}} > ], > [ > %% Field names ending in "_num" are indexed as integers > {dynamic_field, [ > {name, "*_num"}, > {type, integer}, > {analyzer_factory, {erlang, text_analyzers, > integer_analyzer_factory}} > ]}, > > %% Field names ending in "_int" are indexed as integers > {dynamic_field, [ > {name, "*_int"}, > {type, integer}, > {analyzer_factory, {erlang, text_analyzers, > integer_analyzer_factory}} > ]}, > > %% Field names ending in "_dt" are indexed as dates > {dynamic_field, [ > {name, "*_dt"}, > {type, date}, > {analyzer_factory, {erlang, text_analyzers, > noop_analyzer_factory}} > ]}, > > %% Field names ending in "_date" are indexed as dates > {dynamic_field, [ > {name, "*_date"}, > {type, date}, > {analyzer_factory, {erlang, text_analyzers, > noop_analyzer_factory}} > ]}, > > %% Field names ending in "_txt" are indexed as full text" > {dynamic_field, [ > {name, "*_txt"}, > {type, string}, > {analyzer_factory, {erlang, text_analyzers, > standard_analyzer_factory}} > ]}, > > %% Field names ending in "_text" are indexed as full text" > {dynamic_field, [ > {name, "*_text"}, > {type, string}, > {analyzer_factory, {erlang, text_analyzers, > standard_analyzer_factory}} > ]}, > > %% Everything else is a string > {dynamic_field, [ > {name, "*"}, > {type, string}, > {analyzer_factory, {erlang, text_analyzers, > whitespace_analyzer_factory}} > ]} > ] > }. > > Here's an indexed record: > > root@ha1:~# curl -s http://ha:8098/riak/gnip/80329247314550784 | json_xs > { > "created_at" : "Mon Jun 13 17:42:39 +0000 2011", > "tweet" : "@NielJDBSimpson yeaah", > "screen_name" : "SophieBieber69" > } > > A non-indexed record: > > root@ha1:~# curl -s http://ha:8098/riak/tweets/80329247314550784 > > "{\"entities\":{\"urls\":[],\"hashtags\":[],\"user_mentions\":[{\"indices\":[0,15],\"screen_name\":\"NielJDBSimpson\",\"name\":\"\\u2665Enielle > Anne\\u2665 > \\u25d5\\u203f\\u25d5\",\"id_str\":\"197405933\",\"id\":197405933}]},\"retweet_count\":0,\"truncated\":false,\"text\":\"@NielJDBSimpson > yeaah\",\"created_at\":\"Mon Jun 13 17:42:39 +0000 > 2011\",\"place\":null,\"in_reply_to_status_id\":77609368182472704,\"coordinates\":null,\"source\":\"web\",\"geo\":null,\"favorited\":false,\"in_reply_to_status_id_str\":\"77609368182472704\",\"id_str\":\"80329247314550784\",\"in_reply_to_screen_name\":\"N > ielJDBSimpson\",\"in_reply_to_user_id_str\":\"197405933\",\"user\":{\"lang\":\"en\",\"created_at\":\"Wed > May 04 17:02:14 +0000 > 2011\",\"profile_text_color\":\"3D1957\",\"profile_image_url\":\"http:\\\/\\\/ > a3.twimg.com > \\\/profile_images\\\/1372500926\\\/IMG01256-20110418-1827_normal.jpg\",\"is_translator\":false,\"statuses_count\":124,\"profile_sidebar_fill_color\":\"7AC3EE\",\"li > sted_count\":0,\"following\":null,\"profile_background_tile\":true,\"friends_count\":425,\"description\":\"I > love Justin Bieber i saw him 23\\\/03\\\/2011 in concert best nite ever. > Never say Never imma beliber.Follow me I will follow back xx > :P\",\"screen_name\":\"SophieBieber69\",\"contributors_enabled\":false,\"verified\":false,\"profile_link_color\":\"FF0000\",\"url\":null,\"profile_sidebar_border_color\":\"65B0DA\",\"default_profile_image\":false,\"time_zone\":null,\"protected\":false,\"i > d_str\":\"293033762\",\"notifications\":null,\"profile_use_background_image\":true,\"favourites_count\":6,\"location\":\"Sheffield\",\"name\":\"Sophie > Bieber > \",\"profile_background_color\":\"642D8B\",\"id\":293033762,\"default_profile\":false,\"show_all_inline_media\":false,\"follow_request_sent\":null,\"geo_enabled\":false,\"profile_background_image_url\":\"http:\\\/\\\/ > a1.twimg.com\\\/images\\\/themes\\\/th > > eme10\\\/bg.gif\",\"utc_offset\":null,\"followers_count\":184},\"id\":80329247314550784,\"contributors\":null,\"retweeted\":false,\"in_reply_to_user_id\":197405933}\r" > > - Steve Webb > > -- Steve Webb - Senior System Administrator for gnip.com > http://twitter.com/GnipWebb > > > On Thu, 9 Jun 2011, Rusty Klophaus wrote: > > Hi Steve, >> >> Riak does best with a lot of memory and a fast disk. Depending on how much >> data you have in the system, putting two nodes into 1GB of memory on a >> single VM may be causing the system to overrun available resources and >> page >> out to disk, and depending on how you've set up your virtualized >> environment, you could be paying extra costs with each disk access, >> compounding the problem. My first recommendation would be to either run >> the >> test again while monitoring disk operations using iostat to see if disk is >> the problem, or to just go ahead and test on bigger hardware. I suspect >> you >> will see much less of a performance difference between the tests once >> there >> are ample resources. >> >> That said, some slowdown is expected when you turn on indexing, as Riak >> Search adds quite a bit of overhead in parsing and tokenizing the >> document, >> and then storing the results. >> >> There are two ways to speed up indexing: >> >> 1. Reduce the size of your documents. If your documents are large, but >> >> you only need one or two fields indexed, you can create smaller >> "surrogate" >> documents with just the fields you need indexed, plus a link back to your >> original document. >> 2. Batch your writes using the Solr interface. Riak Search uses >> >> "term-based partitioning". Term-based partitioning reduces complexity >> during >> queries, at the cost of increased complexity during writes. You can gain >> back some of the lost performance by batching your writes. This allows >> the >> system to plan which messages it sends more intelligently, thus sending >> fewer messages and reducing overhead. The downside here is that you can't >> use the Riak KV interface, you need to switch to the Solr interface. >> >> Would you mind describing a bit more about your the size and shape of your >> data (how many objects, average object size, object format, throughput, >> etc.) and ideally attach your Riak Search schema? >> >> Thanks, >> Rusty >> >> >> On Tue, Jun 7, 2011 at 4:35 PM, Steve Webb <sw...@gnip.com> wrote: >> >> Hey there. >>> >>> I'm inserting twitter spritzer tweets into a bucket that doesn't have a >>> precommit index hook, and a few fields from the tweet into a second >>> bucket >>> that does have the precommit hook. >>> >>> Speeds on the inserts into the indexed bucket are an order or magnitude >>> slower than the non-indexed bucket. >>> >>> I'm using a 1GB ram, 20GB disk vmware VM, 2-node cluster, ubuntu 10.4, >>> riaksearch 0.14.0 with n_val = 2. >>> >>> Is there a way to do a more lazy indexing to where it doesn't slow down >>> inserts so much? >>> >>> - Steve >>> >>> -- >>> Steve Webb - Senior System Administrator for gnip.com >>> http://twitter.com/GnipWebb >>> >>> _______________________________________________ >>> riak-users mailing list >>> riak-users@lists.basho.com >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>> >>> >>
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com