On Mon, Feb 2, 2009 at 4:34 PM, Oleg Bartunov <o...@sai.msu.su> wrote:
> On Mon, 2 Feb 2009, Oleg Bartunov wrote: > > On Mon, 2 Feb 2009, Mohamed wrote: >> >> Hehe, ok.. >>> I don't know either but I took some lines from Al-Jazeera : >>> http://aljazeera.net/portal >>> >>> just made the change you said and created it successfully and tried this >>> : >>> >>> select ts_lexize('ayaspell', '?????? ??????? ????? ????? ?? ???? >>> ????????? >>> ?????') >>> >>> but I got nothing... :( >>> >> >> Mohamed, what did you expect from ts_lexize ? Please, provide us valuable >> information, else we can't help you. >> > What I expected was something to be returned. After all they are valid words taken from an article. (perhaps you don't see the words, but only ???... ) Am I wrong to expect something ? Should I go for setting up the configuration completly first? SELECT ts_lexize('norwegian_ispell', 'overbuljongterningpakkmesterassistent'); {over,buljong,terning,pakk,mester,assistent} Check out this article if you need a sample. http://www.aljazeera.net/NR/exeres/103CFC06-0195-47FD-A29F-2C84B5A15DD0.htm > >> >>> Is there a way of making sure that words not recognized also gets >>> indexed/searched for ? (Not that I think this is the problem) >>> >> >> yes >> > > Read > http://www.postgresql.org/docs/8.3/static/textsearch-dictionaries.html > "A text search configuration binds a parser together with a set of > dictionaries to process the parser's output tokens. For each token type that > the parser can return, a separate list of dictionaries is specified by the > configuration. When a token of that type is found by the parser, each > dictionary in the list is consulted in turn, until some dictionary > recognizes it as a known word. If it is identified as a stop word, or if no > dictionary recognizes the token, it will be discarded and not indexed or > searched for. The general rule for configuring a list of dictionaries is to > place first the most narrow, most specific dictionary, then the more general > dictionaries, > finishing with a very general dictionary, like a Snowball stemmer or > simple, which recognizes everything." > Ok, but I don't have Thesaurus or a Snowball to fall back on. So when words that are words but for some reason is not recognized "it will be discarded and not indexed or searched for." which I consider a problem since I don't trust my configuration to cover everything. Is this not a valid concern? > > quick example: > > CREATE TEXT SEARCH CONFIGURATION arabic ( > COPY = english > ); > > =# \dF+ arabic > Text search configuration "public.arabic" > Parser: "pg_catalog.default" > Token | Dictionaries > -----------------+-------------- > asciihword | english_stem > asciiword | english_stem > email | simple > file | simple > float | simple > host | simple > hword | english_stem > hword_asciipart | english_stem > hword_numpart | simple > hword_part | english_stem > int | simple > numhword | simple > numword | simple > sfloat | simple > uint | simple > url | simple > url_path | simple > version | simple > word | english_stem > > Then you can alter this configuration. Yes, I figured thats the next step but thought I should get the lexize to work first? What do you think? Just a thought, say I have this : ALTER TEXT SEARCH CONFIGURATION pg ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part WITH pga_ardict, ar_ispell, ar_stem; is it possible to keep adding dictionaries, to get both arabic and english matches on the same column (arabic people tend to mix), like this : ALTER TEXT SEARCH CONFIGURATION pg ALTER MAPPING FOR asciiword, asciihword, hword_asciipart, word, hword, hword_part WITH pga_ardict, ar_ispell, ar_stem, pg_english_dict, english_ispell, english_stem; Will something like that work ? / Moe