Re: speeding up riaksearch precommit indexing

Rusty Klophaus Wed, 15 Jun 2011 05:30:19 -0700

Hi Steve,

Thanks for sending over more details.


The pre- vs. post-commit hook question is a good one. The reason we chose a
pre-commit hook over a post-commit hook for Riak Search indexing is because
a post commit hook doesn't currently provide back-pressure to the Riak KV
side of the system. It would be possible to get yourself into a situation
where the queue of objects to index is so large that it exhausts available
resources. The pre-commit hook prevents that situation.

Everything in your setup below appears correct, so it may be time to look
into batching as a way to increase the speed of indexing.

Best,
Rusty


On Mon, Jun 13, 2011 at 1:45 PM, Steve Webb <sw...@gnip.com> wrote:

> Ok, I've changed my two VMs to each have:
>
> 3 CPUs, 1GB ram, 120GB disk
>
> I'm ingesting the twitter spritzer stream (about 10-20 tweets per second,
> approx 2k of data per tweet).  One bucket is storing the non-indexed tweets
> in full.  Another bucket is storing the indexed tweet string, id, date and
> username.  A maximum of 20 clients can be hitting the 'cluster' at any one
> time.
>
> I'm using n_val=2 so there is replication going on behind the scenes.
>
> I'm using a hardware load-balancer to distribute the work amongst the two
> nodes and now I'm seeing about 75% CPU usage as opposed to 100% on one node
> and 50% on the replicating-only node.
>
> I've monitored the VM over the last few days and it seems to be mostly
> CPU-bound.  The disk I/O is low.  The Network I/O is low.
>
> Q: Can I change the pre-commit to a post-commit trigger or something
> perhaps or will that make any difference at all?  I'm ok if the tweet stuff
> doesn't get indexed immediately and there's a slight lag in indexing if it
> saves on CPU.
>
> Here's my search schema (the default, I think):
>
> root@ha2:/var/log/riaksearch# search-cmd show_schema Index
> Attempting to restart script through sudo -u riak
>
> %% Schema for 'Index'
>
> {
>    schema,
>    [
>        {version, "1.1"},
>        {n_val, 3},
>        {default_field, "value"},
>        {analyzer_factory, {erlang, text_analyzers,
> whitespace_analyzer_factory}}
>    ],
>    [
>        %% Field names ending in "_num" are indexed as integers
>        {dynamic_field, [
>            {name, "*_num"},
>            {type, integer},
>            {analyzer_factory, {erlang, text_analyzers,
> integer_analyzer_factory}}
>        ]},
>
>        %% Field names ending in "_int" are indexed as integers
>        {dynamic_field, [
>            {name, "*_int"},
>            {type, integer},
>            {analyzer_factory, {erlang, text_analyzers,
> integer_analyzer_factory}}
>        ]},
>
>        %% Field names ending in "_dt" are indexed as dates
>        {dynamic_field, [
>            {name, "*_dt"},
>            {type, date},
>            {analyzer_factory, {erlang, text_analyzers,
> noop_analyzer_factory}}
>        ]},
>
>        %% Field names ending in "_date" are indexed as dates
>        {dynamic_field, [
>            {name, "*_date"},
>            {type, date},
>            {analyzer_factory, {erlang, text_analyzers,
> noop_analyzer_factory}}
>        ]},
>
>        %% Field names ending in "_txt" are indexed as full text"
>        {dynamic_field, [
>            {name, "*_txt"},
>            {type, string},
>            {analyzer_factory, {erlang, text_analyzers,
> standard_analyzer_factory}}
>        ]},
>
>        %% Field names ending in "_text" are indexed as full text"
>        {dynamic_field, [
>            {name, "*_text"},
>            {type, string},
>            {analyzer_factory, {erlang, text_analyzers,
> standard_analyzer_factory}}
>        ]},
>
>        %% Everything else is a string
>        {dynamic_field, [
>            {name, "*"},
>            {type, string},
>            {analyzer_factory, {erlang, text_analyzers,
> whitespace_analyzer_factory}}
>        ]}
>    ]
> }.
>
> Here's an indexed record:
>
> root@ha1:~# curl -s http://ha:8098/riak/gnip/80329247314550784 | json_xs
> {
>   "created_at" : "Mon Jun 13 17:42:39 +0000 2011",
>   "tweet" : "@NielJDBSimpson yeaah",
>   "screen_name" : "SophieBieber69"
> }
>
> A non-indexed record:
>
> root@ha1:~# curl -s http://ha:8098/riak/tweets/80329247314550784
>
> "{\"entities\":{\"urls\":[],\"hashtags\":[],\"user_mentions\":[{\"indices\":[0,15],\"screen_name\":\"NielJDBSimpson\",\"name\":\"\\u2665Enielle
> Anne\\u2665
> \\u25d5\\u203f\\u25d5\",\"id_str\":\"197405933\",\"id\":197405933}]},\"retweet_count\":0,\"truncated\":false,\"text\":\"@NielJDBSimpson
> yeaah\",\"created_at\":\"Mon Jun 13 17:42:39 +0000
> 2011\",\"place\":null,\"in_reply_to_status_id\":77609368182472704,\"coordinates\":null,\"source\":\"web\",\"geo\":null,\"favorited\":false,\"in_reply_to_status_id_str\":\"77609368182472704\",\"id_str\":\"80329247314550784\",\"in_reply_to_screen_name\":\"N
> ielJDBSimpson\",\"in_reply_to_user_id_str\":\"197405933\",\"user\":{\"lang\":\"en\",\"created_at\":\"Wed
> May 04 17:02:14 +0000
> 2011\",\"profile_text_color\":\"3D1957\",\"profile_image_url\":\"http:\\\/\\\/
> a3.twimg.com
> \\\/profile_images\\\/1372500926\\\/IMG01256-20110418-1827_normal.jpg\",\"is_translator\":false,\"statuses_count\":124,\"profile_sidebar_fill_color\":\"7AC3EE\",\"li
> sted_count\":0,\"following\":null,\"profile_background_tile\":true,\"friends_count\":425,\"description\":\"I
> love Justin Bieber i saw him 23\\\/03\\\/2011 in concert best nite ever.
> Never say Never imma beliber.Follow me I will follow back xx
> :P\",\"screen_name\":\"SophieBieber69\",\"contributors_enabled\":false,\"verified\":false,\"profile_link_color\":\"FF0000\",\"url\":null,\"profile_sidebar_border_color\":\"65B0DA\",\"default_profile_image\":false,\"time_zone\":null,\"protected\":false,\"i
> d_str\":\"293033762\",\"notifications\":null,\"profile_use_background_image\":true,\"favourites_count\":6,\"location\":\"Sheffield\",\"name\":\"Sophie
> Bieber
> \",\"profile_background_color\":\"642D8B\",\"id\":293033762,\"default_profile\":false,\"show_all_inline_media\":false,\"follow_request_sent\":null,\"geo_enabled\":false,\"profile_background_image_url\":\"http:\\\/\\\/
> a1.twimg.com\\\/images\\\/themes\\\/th
>
> eme10\\\/bg.gif\",\"utc_offset\":null,\"followers_count\":184},\"id\":80329247314550784,\"contributors\":null,\"retweeted\":false,\"in_reply_to_user_id\":197405933}\r"
>
> - Steve Webb
>
> -- Steve Webb - Senior System Administrator for gnip.com
> http://twitter.com/GnipWebb
>
>
> On Thu, 9 Jun 2011, Rusty Klophaus wrote:
>
>  Hi Steve,
>>
>> Riak does best with a lot of memory and a fast disk. Depending on how much
>> data you have in the system, putting two nodes into 1GB of memory on a
>> single VM may be causing the system to overrun available resources and
>> page
>> out to disk, and depending on how you've set up your virtualized
>> environment, you could be paying extra costs with each disk access,
>> compounding the problem. My first recommendation would be to either run
>> the
>> test again while monitoring disk operations using iostat to see if disk is
>> the problem, or to just go ahead and test on bigger hardware. I suspect
>> you
>> will see much less of a performance difference between the tests once
>> there
>> are ample resources.
>>
>> That said, some slowdown is expected when you turn on indexing, as Riak
>> Search adds quite a bit of overhead in parsing and tokenizing the
>> document,
>> and then storing the results.
>>
>> There are two ways to speed up indexing:
>>
>>  1. Reduce the size of your documents. If your documents are large, but
>>
>>  you only need one or two fields indexed, you can create smaller
>> "surrogate"
>>  documents with just the fields you need indexed, plus a link back to your
>>  original document.
>>  2. Batch your writes using the Solr interface. Riak Search uses
>>
>>  "term-based partitioning". Term-based partitioning reduces complexity
>> during
>>  queries, at the cost of increased complexity during writes.  You can gain
>>  back some of the lost performance by batching your writes. This allows
>> the
>>  system to plan which messages it sends more intelligently, thus sending
>>  fewer messages and reducing overhead. The downside here is that you can't
>>  use the Riak KV interface, you need to switch to the Solr interface.
>>
>> Would you mind describing a bit more about your the size and shape of your
>> data (how many objects, average object size, object format, throughput,
>> etc.) and ideally attach your Riak Search schema?
>>
>> Thanks,
>> Rusty
>>
>>
>> On Tue, Jun 7, 2011 at 4:35 PM, Steve Webb <sw...@gnip.com> wrote:
>>
>>  Hey there.
>>>
>>> I'm inserting twitter spritzer tweets into a bucket that doesn't have a
>>> precommit index hook, and a few fields from the tweet into a second
>>> bucket
>>> that does have the precommit hook.
>>>
>>> Speeds on the inserts into the indexed bucket are an order or magnitude
>>> slower than the non-indexed bucket.
>>>
>>> I'm using a 1GB ram, 20GB disk vmware VM, 2-node cluster, ubuntu 10.4,
>>> riaksearch 0.14.0 with n_val = 2.
>>>
>>> Is there a way to do a more lazy indexing to where it doesn't slow down
>>> inserts so much?
>>>
>>> - Steve
>>>
>>> --
>>> Steve Webb - Senior System Administrator for gnip.com
>>> http://twitter.com/GnipWebb
>>>
>>> _______________________________________________
>>> riak-users mailing list
>>> riak-users@lists.basho.com
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>
>>>
>>

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: speeding up riaksearch precommit indexing

Reply via email to