Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

Heikki Linnakangas Tue, 03 Sep 2013 00:52:31 -0700

On 03.09.2013 05:28, Boguk, Maksym wrote:

Target usage:  ability to store UTF8 national characters in some
selected fields inside a single-byte encoded database.
For sample if I have a ru-RU.koi8r encoded database with mostly Russian
text inside,  it would be nice to be able store an Japanese text in one
field without converting the whole database to UTF8 (convert such
database to UTF8 easily could almost double the database size even if
only one field in whole database will use any symbols outside of
ru-RU.koi8r encoding).

Ok.

What has been done:

1)Addition of new string data types NATIONAL CHARACTER and NATIONAL
CHARACTER VARIABLE.
These types differ from the char/varchar data types in one important
respect:  NATIONAL string types are always have UTF8 encoding even
(independent from used database encoding).

I don't like the approach of adding a new data type for this. Theencoding used for a text field should be an implementation detail, notsomething that's exposed to users at the schema-level. A separate datatype makes an nvarchar field behave slightly differently from text, forexample when it's passed to and from functions. It will also requiredrivers and client applications to know about it.

What need to be done:

1)Full set of string functions and operators for NATIONAL types (we
could not use generic text functions because they assume that the stings
will have database encoding).
Now only basic set implemented.
2)Need implement some way to define default collation for a NATIONAL
types.
3)Need implement some way to input UTF8 characters into NATIONAL types
via SQL  (there are serious open problem... it will be defined later in
the text).

Yeah, all of these issues stem from the fact that the NATIONAL types areseparate from text.

I think we should take a completely different approach to this. Twoalternatives spring to mind:

1. Implement a new encoding. The new encoding would be some variant ofUTF-8 that encodes languages like Russian more efficiently. Then justuse that in the whole database. Something like SCSU(http://www.unicode.org/reports/tr6/) should do the trick, although I'mnot sure if SCSU can be used as a server-encoding. A lot of code relieson the fact that a server encoding must have the high bit set in allbytes that are part of a multi-byte character. That's why SJIS forexample can only be used as a client-encoding. But surely you could comeup with some subset or variant of SCSU which satisfies that requirement.

2. Compress the column. Simply do "ALTER TABLE foo ALTER COLUMN bar SETSTORAGE MAIN". That will make Postgres compress that field. That mightnot be very efficient for compressing short cyrillic text encoded inUTF-8 today, but that could be improved. There has been discussion onsupporting more compression algorithms in the past, and one suchalgorithm could be again something like SCSU.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] UTF8 national character data type support WIP patch and list of open issues.

Reply via email to