Re: [PERFORM] tsearch2 headline and postgresql.conf

2006-01-22 Thread Oleg Bartunov

You didn't provides us any query with explain analyze.
Just to make sure you're fine.

Oleg
On Sun, 22 Jan 2006, [EMAIL PROTECTED] wrote:


Hi folks,

I'm not sure if this is the right place for this but thought I'd ask.  I'm 
relateively new to postgres having only used it on 3 projects and am just 
delving into the setup and admin for the second time.


I decided to try tsearch2 for this project's search requirements but am 
having trouble attaining adequate performance.  I think I've nailed it down 
to trouble with the headline() function in tsearch2. 
In short, there is a crawler that grabs HTML docs and places them in a 
database.  The search is done using tsearch2 pretty much installed according 
to instructions.  I have read a couple online guides suggested by this list 
for tuning the postgresql.conf file.  I only made modest adjustments because 
I'm not working with top-end hardware and am still uncertain of the actual 
impact of the different paramenters.


I've been learning 'explain' and over the course of reading I have done 
enough query tweaking to discover the source of my headache seems to be 
headline().


On a query of 429 documents, of which the avg size of the stripped down 
document as stored is 21KB, and the max is 518KB (an anomaly), tsearch2 
performs exceptionally well returning most queries in about 100ms.


On the other hand, following the tsearch2 guide which suggests returning that 
first portion as a subquery and then generating the headline() from those 
results, I see the query increase to 4 seconds!


This seems to be directly related to document size.  If I filter out that 
518KB doc along with some 100KB docs by returning "substring( stripped_text 
FROM 0 FOR 5) AS stripped_text" I decrease the time to 1.4 seconds, but 
increase the risk of not getting a headline.


Seeing as how this problem is directly tied to document size, I'm wondering 
if there are any specific settings in postgresql.conf that may help, or is 
this just a fact of life for the headline() function?  Or, does anyone know 
what the problem is and how to overcome it?


---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

 http://www.postgresql.org/docs/faq



Regards,
Oleg
_
Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru),
Sternberg Astronomical Institute, Moscow University, Russia
Internet: oleg@sai.msu.su, http://www.sai.msu.su/~megera/
phone: +007(495)939-16-83, +007(495)939-23-83

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
  subscribe-nomail command to [EMAIL PROTECTED] so that your
  message can get through to the mailing list cleanly


Re: [PERFORM] Suspending SELECTs

2006-01-22 Thread August Zajonc
Alessandro Baretta wrote:
> 
> What I could do relatively easily is instantiate a thread to iteratively
> scan a traditional cursor N rows at a time, retrieving only record keys,
> and finally send them to the query-cache-manager. The application thread
> would then scan through the cursor results by fetching the rows
> associated to a given "page" of keys. I would have to keep the full
> cursor keyset in the application server's session state, but, hopefully,
> this is not nearly as bad as storing the entire recordset.
> 
> Alex
> 
> 
> 

Alessandro,

I've very much enjoyed reading your thoughts and the problem your facing
and everyone's responses.

Since you control the middle layer, could you not use a cookie to keep a
cursor open on the middle layer and tie into it on subsequent queries?

If you are concerned with too many connections open, you could timeout
the sessions quickly and recreate the cursor if someone came back. If
they waited 5 minutes to make the next query, certainly they could wait
a few extra seconds to offset and reacquire a cursor?

The hitlist idea was also a nice one if the size of the data returned is
not overwhelming and does not need to track the underlying db at all
(ie, no refresh).

Mark had a number of good general suggestions though, and I'd like to
echo the materialized views as an option that I could see a lot of uses
for (and have worked around in the past with SELECT INTO's and like).

Interesting stuff.

- August

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] tsearch2 headline and postgresql.conf

2006-01-22 Thread pgsql-performance

Oleg Bartunov wrote:


You didn't provides us any query with explain analyze.
Just to make sure you're fine.

Oleg
On Sun, 22 Jan 2006, [EMAIL PROTECTED] wrote:


Hi folks,

I'm not sure if this is the right place for this but thought I'd 
ask.  I'm relateively new to postgres having only used it on 3 
projects and am just delving into the setup and admin for the second 
time.


I decided to try tsearch2 for this project's search requirements but 
am having trouble attaining adequate performance.  I think I've 
nailed it down to trouble with the headline() function in tsearch2. 
In short, there is a crawler that grabs HTML docs and places them in 
a database.  The search is done using tsearch2 pretty much installed 
according to instructions.  I have read a couple online guides 
suggested by this list for tuning the postgresql.conf file.  I only 
made modest adjustments because I'm not working with top-end hardware 
and am still uncertain of the actual impact of the different 
paramenters.


I've been learning 'explain' and over the course of reading I have 
done enough query tweaking to discover the source of my headache 
seems to be headline().


On a query of 429 documents, of which the avg size of the stripped 
down document as stored is 21KB, and the max is 518KB (an anomaly), 
tsearch2 performs exceptionally well returning most queries in about 
100ms.


On the other hand, following the tsearch2 guide which suggests 
returning that first portion as a subquery and then generating the 
headline() from those results, I see the query increase to 4 seconds!


This seems to be directly related to document size.  If I filter out 
that 518KB doc along with some 100KB docs by returning "substring( 
stripped_text FROM 0 FOR 5) AS stripped_text" I decrease the time 
to 1.4 seconds, but increase the risk of not getting a headline.


Seeing as how this problem is directly tied to document size, I'm 
wondering if there are any specific settings in postgresql.conf that 
may help, or is this just a fact of life for the headline() 
function?  Or, does anyone know what the problem is and how to 
overcome it?




Regards,
Oleg
_





Hi Oleg,

Thanks for taking time to look at this.  Pardon my omission, I was 
writing that email rather late at night.


The following results from 'explain analyze' are from my
devlopment machine which is a dual PIII 600MHz running Debian
Linux and Postgres 8.1.2.  512 MB RAM.  The production machine
yields similar results but it is a virtual server so the
resources are rather unpredictable.  It is a quad processor and
has a larger result set in it's DB.


The original query is:
explain analyze
SELECT url, title, headline(stripped_text,q,
 'MaxWords=75, MinWords=25, 
StartSel=!!!REPLACE_ME!!!,StopSel=!!!/REPLACE_ME!!!'),
   rank, to_char(timezone('CST', date_last_changed), 'DD Mon ') AS 
date_last_changed

FROM
( SELECT url_id, url, title, stripped_text, date_last_changed, q, 
rank(index_text, q) AS rank
 FROM (web_page w LEFT JOIN url u USING (url_id)), 
to_tsquery('big&search') AS q

 WHERE (index_text <> '') AND (index_text @@ q) AND (w.url_id NOT IN (1,2))
 AND (url NOT LIKE '%badurl.com%')
 ORDER BY rank DESC, date_last_changed DESC
 LIMIT 10 OFFSET 0
) AS useless
;


...and the resultant output of EXPLAIN ANALYZE is:

Subquery Scan useless  (cost=8.02..8.04 rows=1 width=624) (actual 
time=769.131..2769.320 rows=10 loops=1)
  ->  Limit  (cost=8.02..8.02 rows=1 width=282) (actual 
time=566.798..566.932 rows=10 loops=1)
->  Sort  (cost=8.02..8.02 rows=1 width=282) (actual 
time=566.792..566.870 rows=10 loops=1)

  Sort Key: rank(w.index_text, q.q), w.date_last_changed
  ->  Nested Loop  (cost=2.00..8.01 rows=1 width=282) 
(actual time=4.068..563.128 rows=178 loops=1)
->  Nested Loop  (cost=2.00..4.96 rows=1 width=221) 
(actual time=3.179..388.610 rows=179 loops=1)
  ->  Function Scan on q  (cost=0.00..0.01 
rows=1 width=32) (actual time=0.025..0.028 rows=1 loops=1)
  ->  Bitmap Heap Scan on web_page w  
(cost=2.00..4.94 rows=1 width=189) (actual time=3.123..387.547 rows=179 
loops=1)
Filter: ((w.index_text <> ''::tsvector) 
AND (w.url_id <> 1) AND (w.url_id <> 2) AND (w.index_text @@ "outer".q))
->  Bitmap Index Scan on 
idx_index_text  (cost=0.00..2.00 rows=1 width=0) (actual 
time=1.173..1.173 rows=277 loops=1)
  Index Cond: (w.index_text @@ 
"outer".q)
->  Index Scan using pk_url on url u  
(cost=0.00..3.03 rows=1 width=65) (actual time=0.044..0.049 rows=1 
loops=179)

  Index Cond: ("outer".url_id = u.url_id)
  Filter: (url !~~ '%badurl.com%'::text)
Total runtime: 2771.023 ms
(15 rows)
-