I have attached a patch that emits parts of a host token, a url token, an email token and a file token. Further, it makes sure that a host/url/email/file token and the first part-token are at the same position in tsvector.
The two major changes are: 1. Tokenization changes: The patch exploits the special handlers in the text parser to reset the parser position to the start of a host/url/email/file token when it finds one. Special handlers were already used for extracting host and urlpath from a full url. So this is more of an extension of the same idea. 2. Position changes: We do not advance position when we encounter a host/url/email/file token. As a result the first part of that token aligns with the token itself. Attachments: tokens_output.txt: sample queries and results with the patch token_v1.patch: patch wrt cvs head Currently, the patch output parts of the tokens as normal tokens like WORD, NUMWORD etc. Tom argued earlier that this will break backward-compatibility and so it should be outputted as parts of the respective tokens. If there is an agreement over what Tom says, then the current patch can be modified to output subtokens as parts. However, before I complicate the patch with that, I wanted to get feedback on any other major problem with the patch. -Sushant. On Mon, 2010-08-02 at 10:20 -0400, Tom Lane wrote: > Sushant Sinha <sushant...@gmail.com> writes: > >> This would needlessly increase the number of tokens. Instead you'd > >> better make it work like compound word support, having just "wikipedia" > >> and "org" as tokens. > > > The current text parser already returns url and url_path. That already > > increases the number of unique tokens. I am only asking for adding of > > normal english words as well so that if someone types only "wikipedia" > > he gets a match. > > The suggestion to make it work like compound words is still a good one, > ie given wikipedia.org you'd get back > > host wikipedia.org > host-part wikipedia > host-part org > > not just the "host" token as at present. > > Then the user could decide whether he needed to index hostname > components or not, by choosing whether to forward hostname-part > tokens to a dictionary or just discard them. > > If you submit a patch that tries to force the issue by classifying > hostname parts as plain words, it'll probably get rejected out of > hand on backwards-compatibility grounds. > > regards, tom lane
1. FILEPATH testdb=# SELECT ts_debug('/stuff/index.html'); ts_debug -------------------------------------------------------------------------------- -- (file,"File or path name",/stuff/index.html,{simple},simple,{/stuff/index.html} ) (blank,"Space symbols",/,{},,) (asciiword,"Word, all ASCII",stuff,{english_stem},english_stem,{stuff}) (blank,"Space symbols",/,{},,) (asciiword,"Word, all ASCII",index,{english_stem},english_stem,{index}) (blank,"Space symbols",.,{},,) (asciiword,"Word, all ASCII",html,{english_stem},english_stem,{html}) SELECT to_tsvector('english', '/stuff/index.html'); to_tsvector ---------------------------------------------------- '/stuff/index.html':0 'html':2 'index':1 'stuff':0 (1 row) 2. URL testdb=# SELECT ts_debug('http://example.com/stuff/index.html'); ts_debug -------------------------------------------------------------------------------- ------- (protocol,"Protocol head",http://,{},,) (url,URL,example.com/stuff/index.html,{simple},simple,{example.com/stuff/index. html}) (host,Host,example.com,{simple},simple,{example.com}) (asciiword,"Word, all ASCII",example,{english_stem},english_stem,{exampl}) (blank,"Space symbols",.,{},,) (asciiword,"Word, all ASCII",com,{english_stem},english_stem,{com}) (url_path,"URL path",/stuff/index.html,{simple},simple,{/stuff/index.html}) (blank,"Space symbols",/,{},,) (asciiword,"Word, all ASCII",stuff,{english_stem},english_stem,{stuff}) (blank,"Space symbols",/,{},,) (asciiword,"Word, all ASCII",index,{english_stem},english_stem,{index}) (blank,"Space symbols",.,{},,) (asciiword,"Word, all ASCII",html,{english_stem},english_stem,{html}) (13 rows) testdb=# SELECT to_tsvector('english', 'http://example.com/stuff/index.html'); to_tsvector -------------------------------------------------------------------------------- ---------------------------------------- '/stuff/index.html':2 'com':1 'exampl':0 'example.com':0 'example.com/stuff/ind ex.html':0 'html':4 'index':3 'stuff':2 3. EMAIL testdb=# SELECT ts_debug('sush...@foo.bar'); ts_debug ----------------------------------------------------------------------------- (email,"Email address",sush...@foo.bar,{simple},simple,{sush...@foo.bar}) (asciiword,"Word, all ASCII",sushant,{english_stem},english_stem,{sushant}) (blank,"Space symbols",@,{},,) (asciiword,"Word, all ASCII",foo,{english_stem},english_stem,{foo}) (blank,"Space symbols",.,{},,) (asciiword,"Word, all ASCII",bar,{english_stem},english_stem,{bar}) testdb=# SELECT to_tsvector('english', 'sush...@foo.bar'); to_tsvector ------------------------------------------------- 'bar':2 'foo':1 'sushant':0 'sush...@foo.bar':0 4. HOST testdb=# SELECT ts_debug('foo.bar.com'); ts_debug --------------------------------------------------------------------- (host,Host,foo.bar.com,{simple},simple,{foo.bar.com}) (asciiword,"Word, all ASCII",foo,{english_stem},english_stem,{foo}) (blank,"Space symbols",.,{},,) (asciiword,"Word, all ASCII",bar,{english_stem},english_stem,{bar}) (blank,"Space symbols",.,{},,) (asciiword,"Word, all ASCII",com,{english_stem},english_stem,{com}) testdb=# SELECT to_tsvector('english', 'foo.bar.com'); to_tsvector ----------------------------------------- 'bar':1 'com':2 'foo':0 'foo.bar.com':0
? .swp ? GNUmakefile ? config.log ? config.status ? src/Makefile.global ? src/backend/postgres ? src/backend/snowball/snowball_create.sql ? src/backend/tsearch/.wparser_def.c.swp ? src/backend/utils/probes.h ? src/backend/utils/mb/conversion_procs/conversion_create.sql ? src/bin/initdb/initdb ? src/bin/pg_config/pg_config ? src/bin/pg_controldata/pg_controldata ? src/bin/pg_ctl/pg_ctl ? src/bin/pg_dump/pg_dump ? src/bin/pg_dump/pg_dumpall ? src/bin/pg_dump/pg_restore ? src/bin/pg_resetxlog/pg_resetxlog ? src/bin/psql/psql ? src/bin/scripts/clusterdb ? src/bin/scripts/createdb ? src/bin/scripts/createlang ? src/bin/scripts/createuser ? src/bin/scripts/dropdb ? src/bin/scripts/droplang ? src/bin/scripts/dropuser ? src/bin/scripts/reindexdb ? src/bin/scripts/vacuumdb ? src/include/pg_config.h ? src/include/stamp-h ? src/interfaces/ecpg/compatlib/exports.list ? src/interfaces/ecpg/compatlib/libecpg_compat.so.3.3 ? src/interfaces/ecpg/ecpglib/exports.list ? src/interfaces/ecpg/ecpglib/libecpg.so.6.3 ? src/interfaces/ecpg/include/ecpg_config.h ? src/interfaces/ecpg/include/stamp-h ? src/interfaces/ecpg/pgtypeslib/exports.list ? src/interfaces/ecpg/pgtypeslib/libpgtypes.so.3.2 ? src/interfaces/ecpg/preproc/ecpg ? src/interfaces/libpq/exports.list ? src/interfaces/libpq/libpq.so.5.4 ? src/port/pg_config_paths.h ? src/test/regress/log ? src/test/regress/pg_regress ? src/test/regress/regression.diffs ? src/test/regress/regression.out ? src/test/regress/results ? src/test/regress/testtablespace ? src/test/regress/tmp_check ? src/test/regress/expected/constraints.out ? src/test/regress/expected/copy.out ? src/test/regress/expected/create_function_1.out ? src/test/regress/expected/create_function_2.out ? src/test/regress/expected/largeobject.out ? src/test/regress/expected/largeobject_1.out ? src/test/regress/expected/misc.out ? src/test/regress/expected/tablespace.out ? src/test/regress/sql/constraints.sql ? src/test/regress/sql/copy.sql ? src/test/regress/sql/create_function_1.sql ? src/test/regress/sql/create_function_2.sql ? src/test/regress/sql/largeobject.sql ? src/test/regress/sql/misc.sql ? src/test/regress/sql/tablespace.sql ? src/timezone/zic Index: src/backend/tsearch/ts_parse.c =================================================================== RCS file: /projects/cvsroot/pgsql/src/backend/tsearch/ts_parse.c,v retrieving revision 1.17 diff -u -r1.17 ts_parse.c --- src/backend/tsearch/ts_parse.c 26 Feb 2010 02:01:05 -0000 1.17 +++ src/backend/tsearch/ts_parse.c 1 Sep 2010 05:59:35 -0000 @@ -19,7 +19,7 @@ #include "tsearch/ts_utils.h" #define IGNORE_LONGLEXEME 1 - +#define COMPLEX_TOKEN(x) ( x == 4 || x == 5 || x == 6 || x == 18 || x == 17 || x == 18 || x == 19) /* * Lexize subsystem */ @@ -407,8 +407,6 @@ { TSLexeme *ptr = norms; - prs->pos++; /* set pos */ - while (ptr->lexeme) { if (prs->curwords == prs->lenwords) @@ -429,6 +427,10 @@ prs->curwords++; } pfree(norms); + + if (!COMPLEX_TOKEN(type)) + prs->pos++; /* set pos */ + } } while (type > 0); Index: src/backend/tsearch/wparser_def.c =================================================================== RCS file: /projects/cvsroot/pgsql/src/backend/tsearch/wparser_def.c,v retrieving revision 1.33 diff -u -r1.33 wparser_def.c --- src/backend/tsearch/wparser_def.c 19 Aug 2010 05:57:34 -0000 1.33 +++ src/backend/tsearch/wparser_def.c 1 Sep 2010 05:59:36 -0000 @@ -23,7 +23,7 @@ /* Define me to enable tracing of parser behavior */ -/* #define WPARSER_TRACE */ +#define WPARSER_TRACE /* Output token categories */ @@ -249,7 +249,8 @@ TParserPosition *state; bool ignore; bool wanthost; - + int partstop; + TParserState afterpart; /* silly char */ char c; @@ -617,7 +618,32 @@ } return 1; } +static int +p_ispartbingo(TParser *prs) +{ + int ret = 0; + if (prs->partstop > 0) + { + ret = 1; + if (prs->partstop <= prs->state->posbyte) + { + prs->state->state = prs->afterpart; + prs->partstop = 0; + } + else + prs->state->state = TPS_Base; + } + return ret; +} +static int +p_ispart(TParser *prs) +{ + if (prs->partstop > 0) + return 1; + else + return 0; +} /* deliberately suppress unused-function complaints for the above */ void _make_compiler_happy(void); @@ -688,6 +714,21 @@ } static void +SpecialPart(TParser *prs) +{ + prs->partstop = prs->state->posbyte; + prs->state->posbyte -= prs->state->lenbytetoken; + prs->state->poschar -= prs->state->lenchartoken; + prs->afterpart = TPS_Base; +} +static void +SpecialUrlPart(TParser *prs) +{ + SpecialPart(prs); + prs->afterpart = TPS_InURLPathStart; +} + +static void SpecialVerVersion(TParser *prs) { prs->state->posbyte -= prs->state->lenbytetoken; @@ -1057,6 +1098,7 @@ {p_iseqC, '-', A_PUSH, TPS_InSignedIntFirst, 0, NULL}, {p_iseqC, '+', A_PUSH, TPS_InSignedIntFirst, 0, NULL}, {p_iseqC, '&', A_PUSH, TPS_InXMLEntityFirst, 0, NULL}, + {p_ispart, 0, A_NEXT, TPS_InSpace, 0, NULL}, {p_iseqC, '~', A_PUSH, TPS_InFileTwiddle, 0, NULL}, {p_iseqC, '/', A_PUSH, TPS_InFileFirst, 0, NULL}, {p_iseqC, '.', A_PUSH, TPS_InPathFirstFirst, 0, NULL}, @@ -1068,6 +1110,7 @@ {p_isEOF, 0, A_BINGO, TPS_Base, NUMWORD, NULL}, {p_isalnum, 0, A_NEXT, TPS_InNumWord, 0, NULL}, {p_isspecial, 0, A_NEXT, TPS_InNumWord, 0, NULL}, + {p_ispartbingo, 0, A_BINGO, TPS_Null, NUMWORD, NULL}, {p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL}, {p_iseqC, '/', A_PUSH, TPS_InFileFirst, 0, NULL}, {p_iseqC, '.', A_PUSH, TPS_InFileNext, 0, NULL}, @@ -1078,6 +1121,7 @@ static const TParserStateActionItem actionTPS_InAsciiWord[] = { {p_isEOF, 0, A_BINGO, TPS_Base, ASCIIWORD, NULL}, {p_isasclet, 0, A_NEXT, TPS_Null, 0, NULL}, + {p_ispartbingo, 0, A_BINGO, TPS_Null, ASCIIWORD, NULL}, {p_iseqC, '.', A_PUSH, TPS_InHostFirstDomain, 0, NULL}, {p_iseqC, '.', A_PUSH, TPS_InFileNext, 0, NULL}, {p_iseqC, '-', A_PUSH, TPS_InHostFirstAN, 0, NULL}, @@ -1105,13 +1149,14 @@ static const TParserStateActionItem actionTPS_InUnsignedInt[] = { {p_isEOF, 0, A_BINGO, TPS_Base, UNSIGNEDINT, NULL}, {p_isdigit, 0, A_NEXT, TPS_Null, 0, NULL}, + {p_isasclet, 0, A_PUSH, TPS_InHost, 0, NULL}, + {p_isalpha, 0, A_NEXT, TPS_InNumWord, 0, NULL}, + {p_isspecial, 0, A_NEXT, TPS_InNumWord, 0, NULL}, + {p_ispartbingo, 0, A_BINGO, TPS_Null, UNSIGNEDINT, NULL}, {p_iseqC, '.', A_PUSH, TPS_InHostFirstDomain, 0, NULL}, {p_iseqC, '.', A_PUSH, TPS_InUDecimalFirst, 0, NULL}, {p_iseqC, 'e', A_PUSH, TPS_InMantissaFirst, 0, NULL}, {p_iseqC, 'E', A_PUSH, TPS_InMantissaFirst, 0, NULL}, - {p_isasclet, 0, A_PUSH, TPS_InHost, 0, NULL}, - {p_isalpha, 0, A_NEXT, TPS_InNumWord, 0, NULL}, - {p_isspecial, 0, A_NEXT, TPS_InNumWord, 0, NULL}, {p_iseqC, '/', A_PUSH, TPS_InFileFirst, 0, NULL}, {NULL, 0, A_BINGO, TPS_Base, UNSIGNEDINT, NULL} }; @@ -1418,7 +1463,7 @@ }; static const TParserStateActionItem actionTPS_InHostDomain[] = { - {p_isEOF, 0, A_BINGO | A_CLRALL, TPS_Base, HOST, NULL}, + {p_isEOF, 0, A_BINGO | A_CLRALL, TPS_Base, HOST, SpecialPart}, {p_isasclet, 0, A_NEXT, TPS_InHostDomain, 0, NULL}, {p_isdigit, 0, A_PUSH, TPS_InHost, 0, NULL}, {p_iseqC, ':', A_PUSH, TPS_InPortFirst, 0, NULL}, @@ -1427,9 +1472,9 @@ {p_iseqC, '.', A_PUSH, TPS_InHostFirstDomain, 0, NULL}, {p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL}, {p_isdigit, 0, A_POP, TPS_Null, 0, NULL}, - {p_isstophost, 0, A_BINGO | A_CLRALL, TPS_InURLPathStart, HOST, NULL}, + {p_isstophost, 0, A_BINGO | A_CLRALL, TPS_Base, HOST, SpecialUrlPart}, {p_iseqC, '/', A_PUSH, TPS_InFURL, 0, NULL}, - {NULL, 0, A_BINGO | A_CLRALL, TPS_Base, HOST, NULL} + {NULL, 0, A_BINGO | A_CLRALL, TPS_Base, HOST, SpecialPart} }; static const TParserStateActionItem actionTPS_InPortFirst[] = { @@ -1439,11 +1484,11 @@ }; static const TParserStateActionItem actionTPS_InPort[] = { - {p_isEOF, 0, A_BINGO | A_CLRALL, TPS_Base, HOST, NULL}, + {p_isEOF, 0, A_BINGO | A_CLRALL, TPS_Base, HOST, SpecialPart}, {p_isdigit, 0, A_NEXT, TPS_InPort, 0, NULL}, - {p_isstophost, 0, A_BINGO | A_CLRALL, TPS_InURLPathStart, HOST, NULL}, + {p_isstophost, 0, A_BINGO | A_CLRALL, TPS_Base, HOST, SpecialUrlPart}, {p_iseqC, '/', A_PUSH, TPS_InFURL, 0, NULL}, - {NULL, 0, A_BINGO | A_CLRALL, TPS_Base, HOST, NULL} + {NULL, 0, A_BINGO | A_CLRALL, TPS_Base, HOST, SpecialPart} }; static const TParserStateActionItem actionTPS_InHostFirstAN[] = { @@ -1457,6 +1502,7 @@ {p_isEOF, 0, A_POP, TPS_Null, 0, NULL}, {p_isdigit, 0, A_NEXT, TPS_InHost, 0, NULL}, {p_isasclet, 0, A_NEXT, TPS_InHost, 0, NULL}, + {p_ispartbingo, 0, A_BINGO | A_CLRALL, TPS_Null, WORD_T, NULL}, {p_iseqC, '@', A_PUSH, TPS_InEmail, 0, NULL}, {p_iseqC, '.', A_PUSH, TPS_InHostFirstDomain, 0, NULL}, {p_iseqC, '-', A_PUSH, TPS_InHostFirstAN, 0, NULL}, @@ -1466,7 +1512,7 @@ static const TParserStateActionItem actionTPS_InEmail[] = { {p_isstophost, 0, A_POP, TPS_Null, 0, NULL}, - {p_ishost, 0, A_BINGO | A_CLRALL, TPS_Base, EMAIL, NULL}, + {p_ishost, 0, A_BINGO | A_CLRALL, TPS_Base, EMAIL, SpecialPart}, {NULL, 0, A_POP, TPS_Null, 0, NULL} }; @@ -1507,22 +1553,22 @@ }; static const TParserStateActionItem actionTPS_InPathSecond[] = { - {p_isEOF, 0, A_BINGO | A_CLEAR, TPS_Base, FILEPATH, NULL}, + {p_isEOF, 0, A_BINGO | A_CLEAR, TPS_Base, FILEPATH, SpecialPart}, {p_iseqC, '/', A_NEXT | A_PUSH, TPS_InFileFirst, 0, NULL}, - {p_iseqC, '/', A_BINGO | A_CLEAR, TPS_Base, FILEPATH, NULL}, - {p_isspace, 0, A_BINGO | A_CLEAR, TPS_Base, FILEPATH, NULL}, + {p_iseqC, '/', A_BINGO | A_CLEAR, TPS_Base, FILEPATH, SpecialPart}, + {p_isspace, 0, A_BINGO | A_CLEAR, TPS_Base, FILEPATH, SpecialPart}, {NULL, 0, A_POP, TPS_Null, 0, NULL} }; static const TParserStateActionItem actionTPS_InFile[] = { - {p_isEOF, 0, A_BINGO, TPS_Base, FILEPATH, NULL}, + {p_isEOF, 0, A_BINGO, TPS_Base, FILEPATH, SpecialPart}, {p_isasclet, 0, A_NEXT, TPS_InFile, 0, NULL}, {p_isdigit, 0, A_NEXT, TPS_InFile, 0, NULL}, {p_iseqC, '.', A_PUSH, TPS_InFileNext, 0, NULL}, {p_iseqC, '_', A_NEXT, TPS_InFile, 0, NULL}, {p_iseqC, '-', A_NEXT, TPS_InFile, 0, NULL}, {p_iseqC, '/', A_PUSH, TPS_InFileFirst, 0, NULL}, - {NULL, 0, A_BINGO, TPS_Base, FILEPATH, NULL} + {NULL, 0, A_BINGO, TPS_Base, FILEPATH, SpecialPart} }; static const TParserStateActionItem actionTPS_InFileNext[] = { @@ -1544,9 +1590,9 @@ }; static const TParserStateActionItem actionTPS_InURLPath[] = { - {p_isEOF, 0, A_BINGO, TPS_Base, URLPATH, NULL}, + {p_isEOF, 0, A_BINGO, TPS_Base, URLPATH, SpecialPart}, {p_isurlchar, 0, A_NEXT, TPS_InURLPath, 0, NULL}, - {NULL, 0, A_BINGO, TPS_Base, URLPATH, NULL} + {NULL, 0, A_BINGO, TPS_Base, URLPATH, SpecialPart} }; static const TParserStateActionItem actionTPS_InFURL[] = { Index: src/test/regress/expected/tsdicts.out =================================================================== RCS file: /projects/cvsroot/pgsql/src/test/regress/expected/tsdicts.out,v retrieving revision 1.6 diff -u -r1.6 tsdicts.out --- src/test/regress/expected/tsdicts.out 14 Aug 2009 14:53:20 -0000 1.6 +++ src/test/regress/expected/tsdicts.out 1 Sep 2010 05:59:37 -0000 @@ -236,9 +236,9 @@ word, numword, asciiword, hword, numhword, asciihword, hword_part, hword_numpart, hword_asciipart WITH ispell, english_stem; SELECT to_tsvector('ispell_tst', 'Booking the skies after rebookings for footballklubber from a foot'); - to_tsvector ----------------------------------------------------------------------------------------------------- - 'ball':7 'book':1,5 'booking':1,5 'foot':7,10 'football':7 'footballklubber':7 'klubber':7 'sky':3 + to_tsvector +--------------------------------------------------------------------------------------------------- + 'ball':6 'book':0,4 'booking':0,4 'foot':6,9 'football':6 'footballklubber':6 'klubber':6 'sky':2 (1 row) SELECT to_tsquery('ispell_tst', 'footballklubber'); @@ -260,9 +260,9 @@ ALTER TEXT SEARCH CONFIGURATION hunspell_tst ALTER MAPPING REPLACE ispell WITH hunspell; SELECT to_tsvector('hunspell_tst', 'Booking the skies after rebookings for footballklubber from a foot'); - to_tsvector ----------------------------------------------------------------------------------------------------- - 'ball':7 'book':1,5 'booking':1,5 'foot':7,10 'football':7 'footballklubber':7 'klubber':7 'sky':3 + to_tsvector +--------------------------------------------------------------------------------------------------- + 'ball':6 'book':0,4 'booking':0,4 'foot':6,9 'football':6 'footballklubber':6 'klubber':6 'sky':2 (1 row) SELECT to_tsquery('hunspell_tst', 'footballklubber'); @@ -285,21 +285,21 @@ asciiword, hword_asciipart, asciihword WITH synonym, english_stem; SELECT to_tsvector('synonym_tst', 'Postgresql is often called as postgres or pgsql and pronounced as postgre'); - to_tsvector ---------------------------------------------------- - 'call':4 'often':3 'pgsql':1,6,8,12 'pronounc':10 + to_tsvector +-------------------------------------------------- + 'call':3 'often':2 'pgsql':0,5,7,11 'pronounc':9 (1 row) SELECT to_tsvector('synonym_tst', 'Most common mistake is to write Gogle instead of Google'); - to_tsvector ----------------------------------------------------------- - 'common':2 'googl':7,10 'instead':8 'mistak':3 'write':6 + to_tsvector +--------------------------------------------------------- + 'common':1 'googl':6,9 'instead':7 'mistak':2 'write':5 (1 row) SELECT to_tsvector('synonym_tst', 'Indexes or indices - Which is right plural form of index?'); - to_tsvector ----------------------------------------------- - 'form':8 'index':1,3,10 'plural':7 'right':6 + to_tsvector +--------------------------------------------- + 'form':7 'index':0,2,9 'plural':6 'right':5 (1 row) SELECT to_tsquery('synonym_tst', 'Index & indices'); @@ -319,18 +319,18 @@ SELECT to_tsvector('thesaurus_tst', 'one postgres one two one two three one'); to_tsvector ---------------------------------- - '1':1,5 '12':3 '123':4 'pgsql':2 + '1':0,4 '12':2 '123':3 'pgsql':1 (1 row) SELECT to_tsvector('thesaurus_tst', 'Supernovae star is very new star and usually called supernovae (abbrevation SN)'); - to_tsvector -------------------------------------------------------------- - 'abbrev':10 'call':8 'new':4 'sn':1,9,11 'star':5 'usual':7 + to_tsvector +------------------------------------------------------------ + 'abbrev':9 'call':7 'new':3 'sn':0,8,10 'star':4 'usual':6 (1 row) SELECT to_tsvector('thesaurus_tst', 'Booking tickets is looking like a booking a tickets'); - to_tsvector -------------------------------------------------------- - 'card':3,10 'invit':2,9 'like':6 'look':5 'order':1,8 + to_tsvector +------------------------------------------------------ + 'card':2,9 'invit':1,8 'like':5 'look':4 'order':0,7 (1 row) Index: src/test/regress/expected/tsearch.out =================================================================== RCS file: /projects/cvsroot/pgsql/src/test/regress/expected/tsearch.out,v retrieving revision 1.18 diff -u -r1.18 tsearch.out --- src/test/regress/expected/tsearch.out 28 Apr 2010 02:04:16 -0000 1.18 +++ src/test/regress/expected/tsearch.out 1 Sep 2010 05:59:37 -0000 @@ -263,34 +263,90 @@ 1 | qwe 12 | @ 19 | efd.r + 1 | efd + 12 | . + 1 | r 12 | ' 14 | http:// 6 | www.com + 1 | www + 12 | . + 1 | com 12 | / 14 | http:// 5 | aew.werc.ewr/?ad=qwe&dw 6 | aew.werc.ewr + 1 | aew + 12 | . + 1 | werc + 12 | . + 1 | ewr 18 | /?ad=qwe&dw + 12 | /? + 1 | ad + 12 | = + 1 | qwe + 12 | & + 1 | dw 12 | 5 | 1aew.werc.ewr/?ad=qwe&dw 6 | 1aew.werc.ewr + 2 | 1aew + 12 | . + 1 | werc + 12 | . + 1 | ewr 18 | /?ad=qwe&dw + 12 | /? + 1 | ad + 12 | = + 1 | qwe + 12 | & + 1 | dw 12 | 6 | 2aew.werc.ewr + 2 | 2aew + 12 | . + 1 | werc + 12 | . + 1 | ewr 12 | 14 | http:// 5 | 3aew.werc.ewr/?ad=qwe&dw 6 | 3aew.werc.ewr + 2 | 3aew + 12 | . + 1 | werc + 12 | . + 1 | ewr 18 | /?ad=qwe&dw + 12 | /? + 1 | ad + 12 | = + 1 | qwe + 12 | & + 1 | dw 12 | 14 | http:// 6 | 4aew.werc.ewr + 2 | 4aew + 12 | . + 1 | werc + 12 | . + 1 | ewr 12 | 14 | http:// 5 | 5aew.werc.ewr:8100/? 6 | 5aew.werc.ewr:8100 + 2 | 5aew + 12 | . + 1 | werc + 12 | . + 1 | ewr + 12 | : + 22 | 8100 18 | /? - 12 | + 12 | /? 1 | ad 12 | = 1 | qwe @@ -299,11 +355,41 @@ 12 | 5 | 6aew.werc.ewr:8100/?ad=qwe&dw 6 | 6aew.werc.ewr:8100 + 2 | 6aew + 12 | . + 1 | werc + 12 | . + 1 | ewr + 12 | : + 22 | 8100 18 | /?ad=qwe&dw + 12 | /? + 1 | ad + 12 | = + 1 | qwe + 12 | & + 1 | dw 12 | 5 | 7aew.werc.ewr:8100/?ad=qwe&dw=%20%32 6 | 7aew.werc.ewr:8100 + 2 | 7aew + 12 | . + 1 | werc + 12 | . + 1 | ewr + 12 | : + 22 | 8100 18 | /?ad=qwe&dw=%20%32 + 12 | /? + 1 | ad + 12 | = + 1 | qwe + 12 | & + 1 | dw + 12 | =% + 22 | 20 + 12 | % + 22 | 32 12 | 7 | +4.0e-10 12 | @@ -320,6 +406,11 @@ 20 | 5.005 12 | 4 | teo...@stack.net + 1 | teodor + 12 | @ + 1 | stack + 12 | . + 1 | net 12 | 16 | qwe-wer 11 | qwe @@ -349,20 +440,51 @@ 12 | + | 19 | /usr/local/fff + 12 | / + 1 | usr + 12 | / + 1 | local + 12 | / + 1 | fff 12 | 19 | /awdf/dwqe/4325 + 12 | / + 1 | awdf + 12 | / + 1 | dwqe + 12 | / + 22 | 4325 12 | 19 | rewt/ewr + 1 | rewt + 12 | / + 1 | ewr 12 | 1 | wefjn 12 | 19 | /wqe-324/ewr + 12 | / + 1 | wqe + 21 | -324 + 12 | / + 1 | ewr 12 | 19 | gist.h + 1 | gist + 12 | . + 1 | h 12 | 19 | gist.h.c + 1 | gist + 12 | . + 1 | h + 12 | . + 1 | c 12 | 19 | gist.c + 1 | gist + 12 | . + 1 | c 12 | . 1 | readline 12 | @@ -393,14 +515,14 @@ 12 | 12 | <> 1 | qwerty -(133 rows) +(255 rows) SELECT to_tsvector('english', '345 q...@efd.r '' http://www.com/ http://aew.werc.ewr/?ad=qwe&dw 1aew.werc.ewr/?ad=qwe&dw 2aew.werc.ewr http://3aew.werc.ewr/?ad=qwe&dw http://4aew.werc.ewr http://5aew.werc.ewr:8100/? ad=qwe&dw 6aew.werc.ewr:8100/?ad=qwe&dw 7aew.werc.ewr:8100/?ad=qwe&dw=%20%32 +4.0e-10 qwe qwe qwqwe 234.435 455 5.005 teo...@stack.net qwe-wer asdf <fr>qwer jf sdjk<we hjwer <werrwe> ewr1> ewri2 <a href="qwe<qwe>"> /usr/local/fff /awdf/dwqe/4325 rewt/ewr wefjn /wqe-324/ewr gist.h gist.h.c gist.c. readline 4.2 4.2. 4.2, readline-4.2 readline-4.2. 234 <i <b> wow < jqw <> qwerty'); - to_tsvector ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- - '+4.0e-10':28 '-4.2':60,62 '/?':18 '/?ad=qwe&dw':7,10,14,24 '/?ad=qwe&dw=%20%32':27 '/awdf/dwqe/4325':48 '/usr/local/fff':47 '/wqe-324/ewr':51 '1aew.werc.ewr':9 '1aew.werc.ewr/?ad=qwe&dw':8 '234':63 '234.435':32 '2aew.werc.ewr':11 '345':1 '3aew.werc.ewr':13 '3aew.werc.ewr/?ad=qwe&dw':12 '4.2':56,57,58 '455':33 '4aew.werc.ewr':15 '5.005':34 '5aew.werc.ewr:8100':17 '5aew.werc.ewr:8100/?':16 '6aew.werc.ewr:8100':23 '6aew.werc.ewr:8100/?ad=qwe&dw':22 '7aew.werc.ewr:8100':26 '7aew.werc.ewr:8100/?ad=qwe&dw=%20%32':25 'ad':19 'aew.werc.ewr':6 'aew.werc.ewr/?ad=qwe&dw':5 'asdf':39 'dw':21 'efd.r':3 'ewr1':45 'ewri2':46 'gist.c':54 'gist.h':52 'gist.h.c':53 'hjwer':44 'jf':41 'jqw':66 'qwe':2,20,29,30,37 'qwe-wer':36 'qwer':40 'qwerti':67 'qwqwe':31 'readlin':55,59,61 'rewt/ewr':49 'sdjk':42 'teo...@stack.net':35 'wefjn':50 'wer':38 'wow':65 'www.com':4 + to_tsvector +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- + '+4.0e-10':53 '-324':84 '-4.2':98,100 '/?':34 '/?ad=qwe&dw':9,15,24,41 '/?ad=qwe&dw=%20%32':48 '/awdf/dwqe/4325':77 '/usr/local/fff':74 '/wqe-324/ewr':83 '1aew':12 '1aew.werc.ewr':12 '1aew.werc.ewr/?ad=qwe&dw':12 '20':51 '234':101 '234.435':57 '2aew':18 '2aew.werc.ewr':18 '32':52 '345':0 '3aew':21 '3aew.werc.ewr':21 '3aew.werc.ewr/?ad=qwe&dw':21 '4.2':94,95,96 '4325':79 '455':58 '4aew':27 '4aew.werc.ewr':27 '5.005':59 '5aew':30 '5aew.werc.ewr:8100':30 '5aew.werc.ewr:8100/?':30 '6aew':37 '6aew.werc.ewr:8100':37 '6aew.werc.ewr:8100/?ad=qwe&dw':37 '7aew':44 '7aew.werc.ewr:8100':44 '7aew.werc.ewr:8100/?ad=qwe&dw=%20%32':44 '8100':33,40,47 'ad':9,15,24,34,41,48 'aew':6 'aew.werc.ewr':6 'aew.werc.ewr/?ad=qwe&dw':6 'asdf':66 'awdf':77 'c':90,92 'com':5 'dw':11,17,26,36,43,50 'dwqe':78 'efd':2 'efd.r':2 'ewr':8,14,20,23,29,32,39,46,81,85 'ewr1':72 'ewri2':73 'fff':76 'gist':86,88,91 'gist.c':91 'gist.h':86 'gist.h.c':88 'h':87,89 'hjwer':71 'jf':68 'jqw':104 'local':75 'net':62 'qwe':1,10,16,25,35,42,49,54,55,64 'qwe-wer':63 'qwer':67 'qwerti':105 'qwqwe':56 'r':3 'readlin':93,97,99 'rewt':80 'rewt/ewr':80 'sdjk':69 'stack':61 'teodor':60 'teo...@stack.net':60 'usr':74 'wefjn':82 'wer':65 'werc':7,13,19,22,28,31,38,45 'wow':103 'wqe':83 'www':4 'www.com':4 (1 row) SELECT length(to_tsvector('english', '345 q...@efd.r '' http://www.com/ http://aew.werc.ewr/?ad=qwe&dw 1aew.werc.ewr/?ad=qwe&dw 2aew.werc.ewr http://3aew.werc.ewr/?ad=qwe&dw http://4aew.werc.ewr http://5aew.werc.ewr:8100/? ad=qwe&dw 6aew.werc.ewr:8100/?ad=qwe&dw 7aew.werc.ewr:8100/?ad=qwe&dw=%20%32 +4.0e-10 qwe qwe qwqwe 234.435 455 5.005 teo...@stack.net qwe-wer asdf <fr>qwer jf sdjk<we hjwer <werrwe> ewr1> ewri2 <a href="qwe<qwe>"> @@ -408,7 +530,7 @@ <i <b> wow < jqw <> qwerty')); length -------- - 53 + 85 (1 row) -- ts_debug @@ -428,41 +550,83 @@ -- check parsing of URLs SELECT * from ts_debug('english', 'http://www.harewoodsolutions.co.uk/press.aspx</span>'); - alias | description | token | dictionaries | dictionary | lexemes -----------+---------------+----------------------------------------+--------------+------------+------------------------------------------ - protocol | Protocol head | http:// | {} | | - url | URL | www.harewoodsolutions.co.uk/press.aspx | {simple} | simple | {www.harewoodsolutions.co.uk/press.aspx} - host | Host | www.harewoodsolutions.co.uk | {simple} | simple | {www.harewoodsolutions.co.uk} - url_path | URL path | /press.aspx | {simple} | simple | {/press.aspx} - tag | XML tag | </span> | {} | | -(5 rows) + alias | description | token | dictionaries | dictionary | lexemes +-----------+-----------------+----------------------------------------+----------------+--------------+------------------------------------------ + protocol | Protocol head | http:// | {} | | + url | URL | www.harewoodsolutions.co.uk/press.aspx | {simple} | simple | {www.harewoodsolutions.co.uk/press.aspx} + host | Host | www.harewoodsolutions.co.uk | {simple} | simple | {www.harewoodsolutions.co.uk} + asciiword | Word, all ASCII | www | {english_stem} | english_stem | {www} + blank | Space symbols | . | {} | | + asciiword | Word, all ASCII | harewoodsolutions | {english_stem} | english_stem | {harewoodsolut} + blank | Space symbols | . | {} | | + asciiword | Word, all ASCII | co | {english_stem} | english_stem | {co} + blank | Space symbols | . | {} | | + asciiword | Word, all ASCII | uk | {english_stem} | english_stem | {uk} + url_path | URL path | /press.aspx | {simple} | simple | {/press.aspx} + blank | Space symbols | / | {} | | + asciiword | Word, all ASCII | press | {english_stem} | english_stem | {press} + blank | Space symbols | . | {} | | + asciiword | Word, all ASCII | aspx | {english_stem} | english_stem | {aspx} + tag | XML tag | </span> | {} | | +(16 rows) SELECT * from ts_debug('english', 'http://aew.wer0c.ewr/id?ad=qwe&dw<span>'); - alias | description | token | dictionaries | dictionary | lexemes -----------+---------------+----------------------------+--------------+------------+------------------------------ - protocol | Protocol head | http:// | {} | | - url | URL | aew.wer0c.ewr/id?ad=qwe&dw | {simple} | simple | {aew.wer0c.ewr/id?ad=qwe&dw} - host | Host | aew.wer0c.ewr | {simple} | simple | {aew.wer0c.ewr} - url_path | URL path | /id?ad=qwe&dw | {simple} | simple | {/id?ad=qwe&dw} - tag | XML tag | <span> | {} | | -(5 rows) + alias | description | token | dictionaries | dictionary | lexemes +-----------+-------------------+----------------------------+----------------+--------------+------------------------------ + protocol | Protocol head | http:// | {} | | + url | URL | aew.wer0c.ewr/id?ad=qwe&dw | {simple} | simple | {aew.wer0c.ewr/id?ad=qwe&dw} + host | Host | aew.wer0c.ewr | {simple} | simple | {aew.wer0c.ewr} + asciiword | Word, all ASCII | aew | {english_stem} | english_stem | {aew} + blank | Space symbols | . | {} | | + asciiword | Word, all ASCII | wer | {english_stem} | english_stem | {wer} + word | Word, all letters | 0c | {english_stem} | english_stem | {0c} + blank | Space symbols | . | {} | | + asciiword | Word, all ASCII | ewr | {english_stem} | english_stem | {ewr} + url_path | URL path | /id?ad=qwe&dw | {simple} | simple | {/id?ad=qwe&dw} + blank | Space symbols | / | {} | | + asciiword | Word, all ASCII | id | {english_stem} | english_stem | {id} + blank | Space symbols | ? | {} | | + asciiword | Word, all ASCII | ad | {english_stem} | english_stem | {ad} + blank | Space symbols | = | {} | | + asciiword | Word, all ASCII | qwe | {english_stem} | english_stem | {qwe} + blank | Space symbols | & | {} | | + asciiword | Word, all ASCII | dw | {english_stem} | english_stem | {dw} + tag | XML tag | <span> | {} | | +(19 rows) SELECT * from ts_debug('english', 'http://5aew.werc.ewr:8100/?'); - alias | description | token | dictionaries | dictionary | lexemes -----------+---------------+----------------------+--------------+------------+------------------------ - protocol | Protocol head | http:// | {} | | - url | URL | 5aew.werc.ewr:8100/? | {simple} | simple | {5aew.werc.ewr:8100/?} - host | Host | 5aew.werc.ewr:8100 | {simple} | simple | {5aew.werc.ewr:8100} - url_path | URL path | /? | {simple} | simple | {/?} -(4 rows) + alias | description | token | dictionaries | dictionary | lexemes +-----------+-------------------+----------------------+----------------+--------------+------------------------ + protocol | Protocol head | http:// | {} | | + url | URL | 5aew.werc.ewr:8100/? | {simple} | simple | {5aew.werc.ewr:8100/?} + host | Host | 5aew.werc.ewr:8100 | {simple} | simple | {5aew.werc.ewr:8100} + word | Word, all letters | 5aew | {english_stem} | english_stem | {5aew} + blank | Space symbols | . | {} | | + asciiword | Word, all ASCII | werc | {english_stem} | english_stem | {werc} + blank | Space symbols | . | {} | | + asciiword | Word, all ASCII | ewr | {english_stem} | english_stem | {ewr} + blank | Space symbols | : | {} | | + uint | Unsigned integer | 8100 | {simple} | simple | {8100} + url_path | URL path | /? | {simple} | simple | {/?} + blank | Space symbols | /? | {} | | +(12 rows) SELECT * from ts_debug('english', '5aew.werc.ewr:8100/?xx'); - alias | description | token | dictionaries | dictionary | lexemes -----------+-------------+------------------------+--------------+------------+-------------------------- - url | URL | 5aew.werc.ewr:8100/?xx | {simple} | simple | {5aew.werc.ewr:8100/?xx} - host | Host | 5aew.werc.ewr:8100 | {simple} | simple | {5aew.werc.ewr:8100} - url_path | URL path | /?xx | {simple} | simple | {/?xx} -(3 rows) + alias | description | token | dictionaries | dictionary | lexemes +-----------+-------------------+------------------------+----------------+--------------+-------------------------- + url | URL | 5aew.werc.ewr:8100/?xx | {simple} | simple | {5aew.werc.ewr:8100/?xx} + host | Host | 5aew.werc.ewr:8100 | {simple} | simple | {5aew.werc.ewr:8100} + word | Word, all letters | 5aew | {english_stem} | english_stem | {5aew} + blank | Space symbols | . | {} | | + asciiword | Word, all ASCII | werc | {english_stem} | english_stem | {werc} + blank | Space symbols | . | {} | | + asciiword | Word, all ASCII | ewr | {english_stem} | english_stem | {ewr} + blank | Space symbols | : | {} | | + uint | Unsigned integer | 8100 | {simple} | simple | {8100} + url_path | URL path | /?xx | {simple} | simple | {/?xx} + blank | Space symbols | /? | {} | | + asciiword | Word, all ASCII | xx | {english_stem} | english_stem | {xx} +(12 rows) -- to_tsquery SELECT to_tsquery('english', 'qwe & sKies '); @@ -1002,7 +1166,7 @@ SELECT to_tsvector('SKIES My booKs'); to_tsvector ---------------------------- - 'books':3 'my':2 'skies':1 + 'books':2 'my':1 'skies':0 (1 row) SELECT plainto_tsquery('SKIES My booKs'); @@ -1021,7 +1185,7 @@ SELECT to_tsvector('SKIES My booKs'); to_tsvector ------------------ - 'book':3 'sky':1 + 'book':2 'sky':0 (1 row) SELECT plainto_tsquery('SKIES My booKs'); @@ -1075,20 +1239,20 @@ select * from pendtest where 'ipsu:*'::tsquery @@ ts; ts -------------------- - 'ipsum':2 'lore':1 + 'ipsum':1 'lore':0 (1 row) select * from pendtest where 'ipsa:*'::tsquery @@ ts; ts -------------------- - 'ipsam':2 'lore':1 + 'ipsam':1 'lore':0 (1 row) select * from pendtest where 'ips:*'::tsquery @@ ts; ts -------------------- - 'ipsam':2 'lore':1 - 'ipsum':2 'lore':1 + 'ipsam':1 'lore':0 + 'ipsum':1 'lore':0 (2 rows) select * from pendtest where 'ipt:*'::tsquery @@ ts;
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers