Hi! I dug through my list-archives - I actually used to have the very same problem that you described: special chars being swallowed by tsearch2-functions. The source of the problem was that I had INITDB'ed my cluster with [EMAIL PROTECTED] as locale, whereas my databases used Unicode encoding. This does not work correctly. I had to dump, initdb to the correct UTF-8-locale (de_DE.UTF-8 in my case) and reload to get tsearch2 to work correctly. You may find the original discussion here: http://archives.postgresql.org/pgsql-general/2004-07/msg00620.php If you wish to find out which locale was used during INITDB for your cluster, you may use the pg_controldata program that's supplied with PostgreSQL.
Kind regards Markus > -----Ursprüngliche Nachricht----- > Von: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] Im Auftrag von > Dawid Kuroczko > Gesendet: Mittwoch, 17. November 2004 17:17 > An: Pgsql General > Betreff: [GENERAL] Tsearch2 and Unicode? > > I'm trying to use tsearch2 with database which is in > 'UNICODE' encoding. > It works fine for English text, but as I intend to search > Polish texts I did: > > insert into pg_ts_cfg('default_polish', 'default', > 'pl_PL.UTF-8'); (and I updated other pg_ts_* tables as > written in manual). > > However, Polish-specific chars are being eaten alive, it seems. > I.e. doing select to_tsvector('default_polish', body) from > messages; results in list of words but with national chars stripped... > > I wonder, am I doing something wrong, or just tsearch2 > doesn't grok Unicode, despite the locales setting? This also > is a good question regarding ispell_dict and its feelings > regarding Unicode, but that's another story. > > Assuming Unicode unsupported means I should perhaps... oh, > convert the data to iso8859 prior feeding it to_tsvector()... > interesting idea, but so far I have failed to actually do > it. Maybe store the data as 'bytea' and add a column with > encoding information (assuming I don't want to recreate whole > database with new encoding, and that I want to use unicode > for some columns (so I don't have to keep encoding with every > text everywhere...). > > And while we are at it, how do you feel -- an extra column > with tsvector and its index -- would it be OK to keep it away > from my data (so I can safely get rid of them if need be)? > [ I intend to keep index of around 2 000 000 records, few KBs > of text each ]... > > Regards, > Dawid Kuroczko > > ---------------------------(end of > broadcast)--------------------------- > TIP 5: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faqs/FAQ.html > ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly