Re: [GENERAL] Tsearch2 and Unicode?

Markus Wollny Mon, 22 Nov 2004 05:25:27 -0800

 Hi!

I dug through my list-archives - I actually used to have the very same problem 
that you described: special chars being swallowed by tsearch2-functions. The 
source of the problem was that I had INITDB'ed my cluster with [EMAIL 
PROTECTED] as locale, whereas my databases used Unicode encoding. This does not 
work correctly. I had to dump, initdb to the correct UTF-8-locale (de_DE.UTF-8 
in my case) and reload to get tsearch2 to work correctly. You may find the 
original discussion here: 
http://archives.postgresql.org/pgsql-general/2004-07/msg00620.php
If you wish to find out which locale was used during INITDB for your cluster, 
you may use the pg_controldata program that's supplied with PostgreSQL.


Kind regards

   Markus



> -----Urspr�ngliche Nachricht-----
> Von: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] Im Auftrag von 
> Dawid Kuroczko
> Gesendet: Mittwoch, 17. November 2004 17:17
> An: Pgsql General
> Betreff: [GENERAL] Tsearch2 and Unicode?
> 
> I'm trying to use tsearch2 with database which is in 
> 'UNICODE' encoding.
> It works fine for English text, but as I intend to search 
> Polish texts I did:
> 
> insert into pg_ts_cfg('default_polish', 'default', 
> 'pl_PL.UTF-8'); (and I updated other pg_ts_* tables as 
> written in manual).
> 
> However, Polish-specific chars are being eaten alive, it seems.
> I.e. doing select to_tsvector('default_polish', body) from 
> messages; results in list of words but with national chars stripped...
> 
> I wonder, am I doing something wrong, or just tsearch2 
> doesn't grok Unicode, despite the locales setting?  This also 
> is a good question regarding ispell_dict and its feelings 
> regarding Unicode, but that's another story.
> 
> Assuming Unicode unsupported means I should perhaps... oh, 
> convert the data to iso8859 prior feeding it to_tsvector()... 
>  interesting idea, but so far I have failed to actually do 
> it.  Maybe store the data as 'bytea' and add a column with 
> encoding information (assuming I don't want to recreate whole 
> database with new encoding, and that I want to use unicode 
> for some columns (so I don't have to keep encoding with every 
> text everywhere...).
> 
> And while we are at it, how do you feel -- an extra column 
> with tsvector and its index -- would it be OK to keep it away 
> from my data (so I can safely get rid of them if need be)?
> [ I intend to keep index of around 2 000 000 records, few KBs 
> of text each ]...
> 
>   Regards,
>       Dawid Kuroczko
> 
> ---------------------------(end of 
> broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
> 
>                http://www.postgresql.org/docs/faqs/FAQ.html
> 

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
      subscribe-nomail command to [EMAIL PROTECTED] so that your
      message can get through to the mailing list cleanly

Re: [GENERAL] Tsearch2 and Unicode?

Reply via email to