[PERFORM] Tsearch2 performance on big database
Hi, I'm looking for a *fast* solution to search thru ~ 4 million records of book descriptions. I've installed PostgreSQL 8.0.1 on a dual opteron server with 8G of memory, running Linux 2.6. I haven't done a lot of tuning on PostgreSQL itself, but here's the settings I have changed so far: shared_buffers = 2000 (anything much bigger says the kernel doesnt allow it, still have to look into that) effective_cache_size = 32768 Here's my table: ilab=# \d books Table "public.books" Column | Type |Modifiers ---++-- recordnumber | integer| not null default nextval('books_recordnumber_seq'::text) membernumber | integer| not null default 0 booknumber| character varying(20) | not null default ''::character varying author| character varying(60) | not null default ''::character varying titel | text | not null description | character varying(100) | not null default ''::character varying descriprest | text | not null price | bigint | not null default 0::bigint keywords | character varying(100) | not null default ''::character varying dollarprice | bigint | not null default 0::bigint countrynumber | smallint | not null default 0::smallint entrydate | date | not null status| smallint | not null default 0::smallint recordtype| smallint | not null default 0::smallint bookflags | smallint | not null default 0::smallint year | smallint | not null default 0::smallint firstedition | smallint | not null default 0::smallint dustwrapper | smallint | not null default 0::smallint signed| smallint | not null default 0::smallint cover | smallint | not null default 0::smallint specialfield | smallint | not null default 0::smallint idxfti| tsvector | Indexes: "recordnumber_idx" unique, btree (recordnumber) "idxfti_idx" gist (idxfti) idxfti is a tsvector of concatenated description and descriprest. ilab=# select avg(character_length(description)),avg(character_length(descriprest)) from books; avg | avg -+-- 89.1596992873947218 | 133.0468689304200538 Queries take forever to run. Right now we run a MySQL server, on which we maintain our own indices (we split the description fields by word and have different tables for words and the bookdescriptions they appear in). For example, a query for the word 'terminology' on our MySQL search takes 5.8 seconds and returns 375 results. The same query on postgresql using the tsearch2 index takes 30802.105 ms and returns 298 results. How do I speed this up? Should I change settings, add or change indexes or.. what? Rick Jansen -- Systems Administrator for Rockingstone IT http://www.rockingstone.com http://www.megabooksearch.com - Search many book listing sites at once ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] Tsearch2 performance on big database
Mike Rylander wrote: On Tue, 22 Mar 2005 15:36:11 +0300 (MSK), Oleg Bartunov wrote: use something like echo "15000" > /proc/sys/kernel/shmmax to increase shared memory. In your case you could dedicate much more memory. Regards, Oleg Thanks, I'll check that out. And Oleg should know. Unless I'm mistaken, he (co)wrote tsearch2. Other than shared buffers, I can't imagine what could be causing that kind of slowness. EXPLAIN ANALYZE, please? ilab=# explain analyze select count(titel) from books where idxfti @@ to_tsquery('default', 'buckingham | palace'); QUERY PLAN Aggregate (cost=35547.99..35547.99 rows=1 width=56) (actual time=125968.119..125968.120 rows=1 loops=1) -> Index Scan using idxfti_idx on books (cost=0.00..35525.81 rows=8869 width=56) (actual time=0.394..125958.245 rows=3080 loops=1) Index Cond: (idxfti @@ '\'buckingham\' | \'palac\''::tsquery) Total runtime: 125968.212 ms (4 rows) Time: 125969.264 ms ilab=# > As an example of what I think you *should* be seeing, I have a similar > box (4 procs, but that doesn't matter for one query) and I can search > a column with tens of millions of rows in around a second. > That sounds very promising, I'd love to get those results.. could you tell me what your settings are, howmuch memory you have and such? Thanks. Rick -- Systems Administrator for Rockingstone IT http://www.rockingstone.com http://www.megabooksearch.com - Search many book listing sites at once ---(end of broadcast)--- TIP 9: the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [PERFORM] Tsearch2 performance on big database
Oleg Bartunov wrote: Mike, no comments before Rick post tsearch configs and increased buffers ! Union shouldn't be faster than (term1|term2). tsearch2 internals description might help you understanding tsearch2 limitations. See http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals Also, don't miss my notes: http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes Oleg Thanks Oleg, i've seen those pages before :) I've set shared_buffers to 45000 now (yes thats probably very much, isn't it?) and it already seems a lot quicker. How do I find out what my tsearch config is? I followed the intro (http://www.sai.msu.su/~megera/oddmuse/index.cgi/tsearch-v2-intro) and applied it to our books table, thats all, didnt change anything else about configs. On Tue, 22 Mar 2005, Mike Rylander wrote: Ahh... I should have qualified my claim. I am creating a google-esqe search interface and almost every query uses '&' as the term joiner. 'AND' queries and one-term queries are orders of magnitude faster than 'OR' queries, and fortunately are the expected default for most users. (Think, "I typed in these words, therefore I want to match these words"...) An interesting test may be to time multiple queries independently, one for each search term, and see if the combined cost is less than a single 'OR' search. If so, you could use UNION to join the results. Well I just asked my colleges and OR queries arent used by us anyway, so I'll test for AND queries instead. However, the example you originally gave ('terminology') should be very fast. On a comparable query ("select count(value) from metabib.full_rec where index_vector @@ to_tsquery('default','jane');") I get 12ms. ilab=# select count(*) from books where idxfti @@ to_tsquery('default', 'jane'); count --- 4093 (1 row) Time: 217395.820 ms :( ilab=# explain analyze select count(*) from books where idxfti @@ to_tsquery('default', 'jane'); QUERY PLAN - Aggregate (cost=16591.95..16591.95 rows=1 width=0) (actual time=4634.931..4634.932 rows=1 loops=1) -> Index Scan using idxfti_idx on books (cost=0.00..16581.69 rows=4102 width=0) (actual time=0.395..4631.454 rows=4093 loops=1) Index Cond: (idxfti @@ '\'jane\''::tsquery) Total runtime: 4635.023 ms (4 rows) Time: 4636.028 ms ilab=# 16G of RAM on a dedicated machine. shared_buffers = 15000 # min 16, at least max_connections*2, 8KB each work_mem = 10240# min 64, size in KB maintenance_work_mem = 100 # min 1024, size in KB # big m_w_m for loading data... random_page_cost = 2.5 # units are one sequential page fetch cost # fast drives, and tons of RAM Right.. well I'll try copying these settings, see how that works out, thanks :) Rick -- Systems Administrator for Rockingstone IT http://www.rockingstone.com http://www.megabooksearch.com - Search many book listing sites at once ---(end of broadcast)--- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])
Re: [PERFORM] Tsearch2 performance on big database
Oleg Bartunov wrote: On Tue, 22 Mar 2005, Rick Jansen wrote: Hmm, default configuration is too eager, you index every lexem using simple dictionary) ! Probably, it's too much. Here is what I have for my russian configuration in dictionary database: default_russian | lword| {en_ispell,en_stem} default_russian | lpart_hword | {en_ispell,en_stem} default_russian | lhword | {en_ispell,en_stem} default_russian | nlword | {ru_ispell,ru_stem} default_russian | nlpart_hword | {ru_ispell,ru_stem} default_russian | nlhword | {ru_ispell,ru_stem} Notice, I index only russian and english words, no numbers, url, etc. You may just delete unwanted rows in pg_ts_cfgmap for your configuration, but I'd recommend just update them setting dict_name to NULL. For example, to not indexing integers: update pg_ts_cfgmap set dict_name=NULL where ts_name='default_russian' and tok_alias='int'; voc=# select token,dict_name,tok_type,tsvector from ts_debug('Do you have +7 bucks'); token | dict_name | tok_type | tsvector +-+--+-- Do | {en_ispell,en_stem} | lword| you| {en_ispell,en_stem} | lword| have | {en_ispell,en_stem} | lword| +7 | | int | bucks | {en_ispell,en_stem} | lword| 'buck' Only 'bucks' gets indexed :) Hmm, probably I should add this into documentation. What about word statistics (# of unique words, for example). I'm now following the guide to add the ispell dictionary and I've updated most of the rows setting dict_name to NULL: ts_name | tok_alias | dict_name -+--+--- default | lword| {en_stem} default | nlword | {simple} default | word | {simple} default | part_hword | {simple} default | nlpart_hword | {simple} default | lpart_hword | {en_stem} default | hword| {simple} default | lhword | {en_stem} default | nlhword | {simple} These are left, but I have no idea what a 'hword' or 'nlhword' or any other of these tokens are. Anyway, how do I find out the number of unique words or other word statistics? Rick -- Systems Administrator for Rockingstone IT http://www.rockingstone.com http://www.megabooksearch.com - Search many book listing sites at once ---(end of broadcast)--- TIP 8: explain analyze is your friend
Re: [PERFORM] Tsearch2 performance on big database
Oleg Bartunov wrote: > from my notes > http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes > > It's usefull to see words statistics, for example, to check how good > your dictionaries work or how did you configure pg_ts_cfgmap. Also, you > may notice probable stop words relevant for your collection. Tsearch > provides stat() function: > > ... > > Don't hesitate to read it and if you find some bugs or know better wording > I'd be glad to improve my notes. > Thanks, but that stat() query takes way too long.. I let it run for like 4 hours and still nothing. The database I am testing tsearch2 on is also the production database (mysql) server so I have to be careful not to use too many resources :o Anyway, here's my pg_ts_cfgmap now (well the relevant bits): default_english | lhword | {en_ispell,en_stem} default_english | lpart_hword | {en_ispell,en_stem} default_english | lword| {en_ispell,en_stem} Is it normal that queries for single words (or perhaps they are words that are common) take a really long time? Like this: ilab=# explain analyze select count(*) from books where description_fti @@ to_tsquery('default', 'hispanic'); QUERY PLAN -- Aggregate (cost=20369.81..20369.81 rows=1 width=0) (actual time=261512.031..261512.031 rows=1 loops=1) -> Index Scan using idxfti_idx on books (cost=0.00..20349.70 rows=8041 width=0) (actual time=45777.760..261509.288 rows=674 loops=1) Index Cond: (description_fti @@ '\'hispan\''::tsquery) Total runtime: 261518.529 ms (4 rows) ilab=# explain analyze select titel from books where description_fti @@ to_tsquery('default', 'buckingham & palace'); QUERY PLAN Index Scan using idxfti_idx on books (cost=0.00..20349.70 rows=8041 width=57) (actual time=18992.045..48863.385 rows=185 loops=1) Index Cond: (description_fti @@ '\'buckingham\' & \'palac\''::tsquery) Total runtime: 48863.874 ms (3 rows) I dont know what happened, these queries were a lot faster 2 days ago..what the feck is going on?! Rick -- Systems Administrator for Rockingstone IT http://www.rockingstone.com http://www.megabooksearch.com - Search many book listing sites at once ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [PERFORM] Tsearch2 performance on big database
Oleg Bartunov wrote: stat() is indeed a bigdog, it was designed for developers needs, so we recommend to save results in table. Anyway, here's my pg_ts_cfgmap now (well the relevant bits): default_english | lhword | {en_ispell,en_stem} default_english | lpart_hword | {en_ispell,en_stem} default_english | lword| {en_ispell,en_stem} Is it normal that queries for single words (or perhaps they are words that are common) take a really long time? Like this: 'hispanic' isn't common, I see you get only 674 rows and 'buckingham & palace' returns 185 rows. Did you run 'vacuum analyze' ? I see a big discrepancy between estimated rows (8041) and actual rows. Yes, I did a vacuum analyze right before executing these queries. I'm going to recreate the gist index now, and do a vacuum full analyze after that.. see if that makes a difference. Rick -- Systems Administrator for Rockingstone IT http://www.rockingstone.com http://www.megabooksearch.com - Search many book listing sites at once ---(end of broadcast)--- TIP 5: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq