[PERFORM] Tsearch2 performance on big database

2005-03-22 Thread Rick Jansen
Hi,
I'm looking for a *fast* solution to search thru ~ 4 million records of 
book descriptions. I've installed PostgreSQL 8.0.1 on a dual opteron 
server with 8G of memory, running Linux 2.6. I haven't done a lot of 
tuning on PostgreSQL itself, but here's the settings I have changed so far:

shared_buffers = 2000 (anything much bigger says the kernel doesnt allow 
 it, still have to look into that)
effective_cache_size = 32768

Here's my table:
ilab=# \d books
   Table "public.books"
Column |  Type  |Modifiers
---++--
 recordnumber  | integer| not null default 
nextval('books_recordnumber_seq'::text)
 membernumber  | integer| not null default 0
 booknumber| character varying(20)  | not null default 
''::character varying
 author| character varying(60)  | not null default 
''::character varying
 titel | text   | not null
 description   | character varying(100) | not null default 
''::character varying
 descriprest   | text   | not null
 price | bigint | not null default 0::bigint
 keywords  | character varying(100) | not null default 
''::character varying
 dollarprice   | bigint | not null default 0::bigint
 countrynumber | smallint   | not null default 0::smallint
 entrydate | date   | not null
 status| smallint   | not null default 0::smallint
 recordtype| smallint   | not null default 0::smallint
 bookflags | smallint   | not null default 0::smallint
 year  | smallint   | not null default 0::smallint
 firstedition  | smallint   | not null default 0::smallint
 dustwrapper   | smallint   | not null default 0::smallint
 signed| smallint   | not null default 0::smallint
 cover | smallint   | not null default 0::smallint
 specialfield  | smallint   | not null default 0::smallint
 idxfti| tsvector   |
Indexes:
"recordnumber_idx" unique, btree (recordnumber)
"idxfti_idx" gist (idxfti)

idxfti is a tsvector of concatenated description and descriprest.
ilab=# select 
avg(character_length(description)),avg(character_length(descriprest)) 
from books;
 avg | avg
-+--
 89.1596992873947218 | 133.0468689304200538

Queries take forever to run. Right now we run a MySQL server, on which 
we maintain our own indices (we split the description fields by word and 
have different tables for words and the bookdescriptions they appear in).

For example, a query for the word 'terminology' on our MySQL search 
takes 5.8 seconds and returns 375 results. The same query on postgresql 
using the tsearch2 index takes 30802.105 ms and returns 298 results.

How do I speed this up? Should I change settings, add or change indexes 
or.. what?

Rick Jansen
--
Systems Administrator for Rockingstone IT
http://www.rockingstone.com
http://www.megabooksearch.com - Search many book listing sites at once
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] Tsearch2 performance on big database

2005-03-22 Thread Rick Jansen
Mike Rylander wrote:
On Tue, 22 Mar 2005 15:36:11 +0300 (MSK), Oleg Bartunov  wrote:
use something like
echo "15000" > /proc/sys/kernel/shmmax
to increase shared memory. In your case you could dedicate much more
memory.
   Regards,
   Oleg

Thanks, I'll check that out.
And Oleg should know.  Unless I'm mistaken, he (co)wrote tsearch2. 
Other than shared buffers, I can't imagine what could be causing that
kind of slowness.  EXPLAIN ANALYZE, please?

ilab=# explain analyze select count(titel) from books where idxfti @@ 
to_tsquery('default', 'buckingham | palace');
   QUERY PLAN

 Aggregate  (cost=35547.99..35547.99 rows=1 width=56) (actual 
time=125968.119..125968.120 rows=1 loops=1)
   ->  Index Scan using idxfti_idx on books  (cost=0.00..35525.81 
rows=8869 width=56) (actual time=0.394..125958.245 rows=3080 loops=1)
 Index Cond: (idxfti @@ '\'buckingham\' | \'palac\''::tsquery)
 Total runtime: 125968.212 ms
(4 rows)

Time: 125969.264 ms
ilab=#
> As an example of what I think you *should* be seeing, I have a similar
> box (4 procs, but that doesn't matter for one query) and I can search
> a column with tens of millions of rows in around a second.
>
That sounds very promising, I'd love to get those results.. could you 
tell me what your settings are, howmuch memory you have and such? Thanks.

Rick
--
Systems Administrator for Rockingstone IT
http://www.rockingstone.com
http://www.megabooksearch.com - Search many book listing sites at once
---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
 joining column's datatypes do not match


Re: [PERFORM] Tsearch2 performance on big database

2005-03-22 Thread Rick Jansen
Oleg Bartunov wrote:
Mike,
no comments before Rick post tsearch configs and increased buffers !
Union shouldn't be faster than (term1|term2).
tsearch2 internals description might help you understanding tsearch2 
limitations.
See  http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_internals
Also, don't miss my notes:
http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes

Oleg
Thanks Oleg, i've seen those pages before :) I've set shared_buffers to 
45000 now (yes thats probably very much, isn't it?) and it already seems 
a lot quicker.

How do I find out what my tsearch config is? I followed the intro 
(http://www.sai.msu.su/~megera/oddmuse/index.cgi/tsearch-v2-intro) and 
applied it to our books table, thats all, didnt change anything else 
about configs.


On Tue, 22 Mar 2005, Mike Rylander wrote:
Ahh... I should have qualified my claim.  I am creating a google-esqe
search interface and almost every query uses '&' as the term joiner.
'AND' queries and one-term queries are orders of magnitude faster than
'OR' queries, and fortunately are the expected default for most users.
(Think, "I typed in these words, therefore I want to match these
words"...)  An interesting test may be to time multiple queries
independently, one for each search term, and see if the combined cost
is less than a single 'OR' search.  If so, you could use UNION to join
the results.
Well I just asked my colleges and OR queries arent used by us anyway, so 
I'll test for AND queries instead.

However, the example you originally gave ('terminology') should be
very fast.  On a comparable query ("select count(value) from
metabib.full_rec where index_vector @@ to_tsquery('default','jane');")
I get 12ms.
ilab=# select count(*) from books where idxfti @@ to_tsquery('default', 
'jane');
 count
---
  4093
(1 row)
Time: 217395.820 ms

:(
ilab=# explain analyze select count(*) from books where idxfti @@ 
to_tsquery('default', 'jane');
 QUERY PLAN
-
 Aggregate  (cost=16591.95..16591.95 rows=1 width=0) (actual 
time=4634.931..4634.932 rows=1 loops=1)
   ->  Index Scan using idxfti_idx on books  (cost=0.00..16581.69 
rows=4102 width=0) (actual time=0.395..4631.454 rows=4093 loops=1)
 Index Cond: (idxfti @@ '\'jane\''::tsquery)
 Total runtime: 4635.023 ms
(4 rows)

Time: 4636.028 ms
ilab=#
16G of RAM on a dedicated machine.
shared_buffers = 15000  # min 16, at least max_connections*2, 
8KB each
work_mem = 10240# min 64, size in KB
maintenance_work_mem = 100  # min 1024, size in KB
# big m_w_m for loading data...

random_page_cost = 2.5  # units are one sequential page fetch 
cost
# fast drives, and tons of RAM

Right.. well I'll try copying these settings, see how that works out, 
thanks :)

Rick
--
Systems Administrator for Rockingstone IT
http://www.rockingstone.com
http://www.megabooksearch.com - Search many book listing sites at once
---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
   (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] Tsearch2 performance on big database

2005-03-23 Thread Rick Jansen
Oleg Bartunov wrote:
On Tue, 22 Mar 2005, Rick Jansen wrote:
Hmm, default configuration is too eager, you index every lexem using 
simple dictionary) ! Probably, it's too much. Here is what I have for my 
russian configuration in dictionary database:

 default_russian | lword| {en_ispell,en_stem}
 default_russian | lpart_hword  | {en_ispell,en_stem}
 default_russian | lhword   | {en_ispell,en_stem}
 default_russian | nlword   | {ru_ispell,ru_stem}
 default_russian | nlpart_hword | {ru_ispell,ru_stem}
 default_russian | nlhword  | {ru_ispell,ru_stem}
Notice, I index only russian and english words, no numbers, url, etc.
You may just delete unwanted rows in pg_ts_cfgmap for your configuration,
but I'd recommend just update them setting dict_name to NULL.
For example, to not indexing integers:
update pg_ts_cfgmap set dict_name=NULL where ts_name='default_russian' 
and tok_alias='int';

voc=# select token,dict_name,tok_type,tsvector from ts_debug('Do you 
have +7 bucks');
 token  |  dict_name  | tok_type | tsvector 
+-+--+--
 Do | {en_ispell,en_stem} | lword|
 you| {en_ispell,en_stem} | lword|
 have   | {en_ispell,en_stem} | lword|
 +7 | | int  |
 bucks  | {en_ispell,en_stem} | lword| 'buck'

Only 'bucks' gets indexed :)
Hmm, probably I should add this into documentation.
What about word statistics (# of unique words, for example).
I'm now following the guide to add the ispell dictionary and I've 
updated most of the rows setting dict_name to NULL:

 ts_name |  tok_alias   | dict_name
-+--+---
 default | lword| {en_stem}
 default | nlword   | {simple}
 default | word | {simple}
 default | part_hword   | {simple}
 default | nlpart_hword | {simple}
 default | lpart_hword  | {en_stem}
 default | hword| {simple}
 default | lhword   | {en_stem}
 default | nlhword  | {simple}
These are left, but I have no idea what a 'hword' or 'nlhword' or any 
other of these tokens are.

Anyway, how do I find out the number of unique words or other word 
statistics?

Rick
--
Systems Administrator for Rockingstone IT
http://www.rockingstone.com
http://www.megabooksearch.com - Search many book listing sites at once
---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Tsearch2 performance on big database

2005-03-24 Thread Rick Jansen
Oleg Bartunov wrote:
> from my notes
> http://www.sai.msu.su/~megera/oddmuse/index.cgi/Tsearch_V2_Notes
>
> It's usefull to see words statistics, for example, to check how good
> your dictionaries work or how did you configure pg_ts_cfgmap. Also, you
> may notice probable stop words relevant for your collection. Tsearch
> provides stat() function:
>
> ...
>
> Don't hesitate to read it and if you find some bugs or know better 
wording
> I'd be glad to improve my notes.
>

Thanks, but that stat() query takes way too long.. I let it run for like
4 hours and still nothing. The database I am testing tsearch2 on is also
the production database (mysql) server so I have to be careful not to
use too many resources :o
Anyway, here's my pg_ts_cfgmap now (well the relevant bits):
default_english | lhword   | {en_ispell,en_stem}
default_english | lpart_hword  | {en_ispell,en_stem}
default_english | lword| {en_ispell,en_stem}
Is it normal that queries for single words (or perhaps they are words
that are common) take a really long time? Like this:
ilab=# explain analyze select count(*) from books where description_fti 
@@ to_tsquery('default', 'hispanic');
QUERY PLAN
--
 Aggregate  (cost=20369.81..20369.81 rows=1 width=0) (actual 
time=261512.031..261512.031 rows=1 loops=1)
   ->  Index Scan using idxfti_idx on books  (cost=0.00..20349.70 
rows=8041 width=0) (actual time=45777.760..261509.288 rows=674 loops=1)
 Index Cond: (description_fti @@ '\'hispan\''::tsquery)
 Total runtime: 261518.529 ms
(4 rows)

ilab=# explain analyze select titel from books where description_fti @@ 
to_tsquery('default', 'buckingham & palace'); 
  QUERY PLAN

 Index Scan using idxfti_idx on books  (cost=0.00..20349.70 rows=8041 
width=57) (actual time=18992.045..48863.385 rows=185 loops=1)
   Index Cond: (description_fti @@ '\'buckingham\' & \'palac\''::tsquery)
 Total runtime: 48863.874 ms
(3 rows)

I dont know what happened, these queries were a lot faster 2 days 
ago..what the feck is going on?!

Rick
--
Systems Administrator for Rockingstone IT
http://www.rockingstone.com
http://www.megabooksearch.com - Search many book listing sites at once
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faq


Re: [PERFORM] Tsearch2 performance on big database

2005-03-24 Thread Rick Jansen
Oleg Bartunov wrote:
stat() is indeed a bigdog, it was designed for developers needs,
so we recommend to save results in table.
Anyway, here's my pg_ts_cfgmap now (well the relevant bits):
default_english | lhword   | {en_ispell,en_stem}
default_english | lpart_hword  | {en_ispell,en_stem}
default_english | lword| {en_ispell,en_stem}
Is it normal that queries for single words (or perhaps they are words
that are common) take a really long time? Like this:
'hispanic' isn't common, I see you get only  674 rows and 'buckingham & 
palace'  returns 185 rows. Did you run 'vacuum analyze' ?
I see a big discrepancy between estimated rows (8041) and actual rows.


Yes, I did a vacuum analyze right before executing these queries.
I'm going to recreate the gist index now, and do a vacuum full analyze 
after that.. see if that makes a difference.

Rick
--
Systems Administrator for Rockingstone IT
http://www.rockingstone.com
http://www.megabooksearch.com - Search many book listing sites at once
---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?
  http://www.postgresql.org/docs/faq