Does it run any differently if you split out the tag? select count(*) from search_rm where to_tsvector('english', textsearch) @@ to_tsquery('english', 'wommmman & batt') and to_tsvector('english', textsearch) @@ 'ftx1'::tsquery
Steve On 16 July 2010 05:22, Howard Rogers <h...@diznix.com> wrote: > On Fri, Jul 16, 2010 at 10:23 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > > Howard Rogers <h...@diznix.com> writes: > >> I have 10 million rows in a table, with full text index created on one > >> of the columns. I submit this query: > > > >> ims=# select count(*) from search_rm > >> ims-# where to_tsvector('english', textsearch) > >> ims-# @@ to_tsquery('english', 'woman & beach & ball'); > >> count > >> ------- > >> 646 > >> (1 row) > >> Time: 107.570 ms > > > >> ...and those are excellent times. But if I alter the query to read: > > > >> ims=# select count(*) from search_rm > >> where to_tsvector('english', textsearch) > >> @@ to_tsquery('english', 'woman & beach & ftx1'); > >> count > >> ------- > >> 38343 > >> (1 row) > >> Time: 640.985 ms > > > >> ...then, as you see, it slows the query down by a factor of about 6, > > > > ... um, but it increased the number of matching rows by a factor of > > almost 60. I think your complaint of poor scaling is misplaced. > > > >> which is not so good! The problem is that we need to be able to search > >> for "ftx1", since that's a flag we put in our document records to tell > >> us the file type, and we need to be able to retrieve different file > >> types at different times. > > > > You might want to rethink how you're doing that --- it seems like a file > > type flag ought to be a separate column rather than a word in a text > > field. > > > > regards, tom lane > > > > -- > > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > > To make changes to your subscription: > > http://www.postgresql.org/mailpref/pgsql-general > > > > OK, Tom: I did actually account for the number of rows difference > before I posted, though I accept I didn't show you that. So here goes: > > ims=# select count(*) > ims-# from search_rm > ims-# where to_tsvector('english', textsearch) @@ > to_tsquery('english', 'wommmman & batt & ftxa') > ims-# limit 20; > count > ------- > 0 > (1 row) > > Time: 0.593 ms > ims=# select count(*) > from search_rm > where to_tsvector('english', textsearch) @@ to_tsquery('english', > 'wommmman & batt & ftx1') > limit 20; > count > ------- > 0 > (1 row) > > Time: 489.362 ms > > Both queries return zero rows. One takes an awful lot longer than the > other. The only difference between them is that one searches for > 'ftx1' and the other searches for 'ftx0'. My complaint of poor > scalability (actually, it was an enquiry about the role of dictionary > types!) is valid, I think. As a PostgreSQL newbie, I'm happy to accept > that I've done something plonkingly stupid to account for these > results, but I'd then like to know what it is I've done wrong! A > simple scale-up of the number of hits isn't, however, the problem, I > don't think. > > With this amount of data, and with 45 different document attributes > that may or may not be searched for, some of them involving names and > places and dates, some just yes/no flags, it is utterly impossible to > have them as separate attribute columns and search on them with > anything like decent performance. We adopted this approach with Oracle > Text two years ago precisely because it was the only way to keep > web-based searches of 10,000,000 records coming back in less than a > second. So, no, we're not going to re-think the storage of 'attribute > data' as part of the searchable keyword field, though I'm more than > prepared to alter the precise format of that data if it helps > PostgreSQL any. > > That said, however, we have people supplying us with document > references in the form DA3-76374YY-001, so alpha-numerics simply have > to be searchable with good speed, and I can't always magic-away the > alpha-numeric components, even if I wanted to. > > So, I would still like to know if this performance difference when > encountering alpha-numeric "words" is dictionary-related, and if so > what I can do to fix that, please. > > Cheers, > HJR > > -- > Sent via pgsql-general mailing list (pgsql-general@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general >