On Thu, 2024-02-29 at 17:02 -0800, Jeff Davis wrote:
> Attached is an implementation of a per-database option STRICT_UNICODE
> which enforces the use of assigned code points only.
I'm withdrawing this patch due to lack of interest.
Regards,
Jeff Davis
On Thu, 2024-02-29 at 17:02 -0800, Jeff Davis wrote:
> Attached is an implementation of a per-database option STRICT_UNICODE
> which enforces the use of assigned code points only.
The CF app doesn't seem to point at the latest patch:
https://www.postgresql.org/message-id/a0e85aca6e03042881924c4b3
On Mon, 2023-10-02 at 16:06 -0400, Robert Haas wrote:
> It seems to me that this overlooks one of the major points of Jeff's
> proposal, which is that we don't reject text input that contains
> unassigned code points. That decision turns out to be really painful.
Attached is an implementation of a
On Sat, 4 Nov 2023 at 10:57, Thomas Munro wrote:
>
> On Fri, Nov 3, 2023 at 9:01 PM David Rowley wrote:
> > On Fri, 3 Nov 2023 at 20:49, Jeff Davis wrote:
> > > I think I just need to add unicode_category.c to @pgcommonallfiles in
> > > Mkvcbuild.pm. I'll do a trial commit tomorrow and see if th
On Fri, Nov 3, 2023 at 9:01 PM David Rowley wrote:
> On Fri, 3 Nov 2023 at 20:49, Jeff Davis wrote:
> > On Fri, 2023-11-03 at 10:51 +1300, Thomas Munro wrote:
> > > bowerbird and hammerkop didn't like commit a02b37fc. They're still
> > > using the old 3rd build system that is not tested by CI.
On 2023-10-04 23:32, Chapman Flack wrote:
Well, for what reason does anybody run PG now with the encoding set
to anything besides UTF-8? I don't really have my finger on that pulse.
Could it be that it bloats common strings in their local script, and
with enough of those to store, it could matter
On Fri, 2023-11-03 at 17:11 +0700, John Naylor wrote:
> On Sat, Oct 28, 2023 at 4:15 AM Jeff Davis wrote:
> >
> > I plan to commit something like v3 early next week unless someone
> > else
> > has additional comments or I missed a concern.
>
> Hi Jeff, is the CF entry titled "Unicode character g
On Fri, 2023-11-03 at 21:01 +1300, David Rowley wrote:
> Thomas mentioned this to me earlier today. After looking I also
> concluded that unicode_category.c needed to be added to
> @pgcommonallfiles. After looking at the time, I didn't expect you to
> be around so opted just to push that to fix the
On Sat, Oct 28, 2023 at 4:15 AM Jeff Davis wrote:
>
> I plan to commit something like v3 early next week unless someone else
> has additional comments or I missed a concern.
Hi Jeff, is the CF entry titled "Unicode character general category
functions" ready to be marked committed?
On Fri, 3 Nov 2023 at 20:49, Jeff Davis wrote:
>
> On Fri, 2023-11-03 at 10:51 +1300, Thomas Munro wrote:
> > bowerbird and hammerkop didn't like commit a02b37fc. They're still
> > using the old 3rd build system that is not tested by CI. It's due
> > for
> > removal in the 17 cycle IIUC but in t
On Fri, 2023-11-03 at 10:51 +1300, Thomas Munro wrote:
> bowerbird and hammerkop didn't like commit a02b37fc. They're still
> using the old 3rd build system that is not tested by CI. It's due
> for
> removal in the 17 cycle IIUC but in the meantime I guess the new
> codegen script needs to be inv
On Wed, Oct 04, 2023 at 01:15:03PM -0700, Jeff Davis wrote:
> > The fact that there are multiple types of normalization and multiple
> > notions of equality doesn't make this easier.
And then there's text that isn't normalized to any of them.
> NFC is really the only one that makes sense.
Yes.
On Tue, Oct 17, 2023 at 05:07:40PM +0200, Daniel Verite wrote:
> > * Add a per-database option to enforce only storing assigned unicode
> > code points.
>
> There's a problem in the fact that the set of assigned code points is
> expanding with every Unicode release, which happens about every year.
On Wed, Oct 04, 2023 at 01:16:22PM -0400, Robert Haas wrote:
> There's a very popular commercial database where, or so I have been
> led to believe, any byte sequence at all is accepted when you try to
> put values into the database. [...]
In other circles we call this "just-use-8".
ZFS, for exam
On Fri, Oct 06, 2023 at 02:37:06PM -0400, Robert Haas wrote:
> > Sure, because TEXT in PG doesn't have codeset+encoding as part of it --
> > it's whatever the database's encoding is. Collation can and should be a
> > porperty of a column, since for Unicode it wouldn't be reasonable to
> > make tha
bowerbird and hammerkop didn't like commit a02b37fc. They're still
using the old 3rd build system that is not tested by CI. It's due for
removal in the 17 cycle IIUC but in the meantime I guess the new
codegen script needs to be invoked by something under src/tools/msvc?
varlena.obj : error LN
On Mon, 2023-10-16 at 20:32 -0700, Jeff Davis wrote:
> On Wed, 2023-10-11 at 08:56 +0200, Peter Eisentraut wrote:
> > We need to be careful about precise terminology. "Valid" has a
> > defined
> > meaning for Unicode. A byte sequence can be valid or not as UTF-
> > 8.
> > But
> > a string cont
On Tue, 2023-10-17 at 17:07 +0200, Daniel Verite wrote:
> There's a problem in the fact that the set of assigned code points is
> expanding with every Unicode release, which happens about every year.
>
> If we had this option in Postgres 11 released in 2018 it would use
> Unicode 11, and in 2023 t
On Tue, Oct 17, 2023 at 11:38 AM Isaac Morland wrote:
> On Tue, 17 Oct 2023 at 11:15, Robert Haas wrote:
>> Are code points assigned from a gapless sequence? That is, is the
>> implementation of codepoint_is_assigned(char) just 'codepoint <
>> SOME_VALUE' and SOME_VALUE increases over time?
>
> N
On Tue, 17 Oct 2023 at 11:15, Robert Haas wrote:
> Are code points assigned from a gapless sequence? That is, is the
> implementation of codepoint_is_assigned(char) just 'codepoint <
> SOME_VALUE' and SOME_VALUE increases over time?
>
Not even close. Code points are organized in blocks, e.g. fo
On Tue, Oct 17, 2023 at 11:07 AM Daniel Verite wrote:
> There's a problem in the fact that the set of assigned code points is
> expanding with every Unicode release, which happens about every year.
>
> If we had this option in Postgres 11 released in 2018 it would use
> Unicode 11, and in 2023 thi
Jeff Davis wrote:
> I believe the patch has utility as-is, but I've been brainstorming a
> few more ideas that could build on it:
>
> * Add a per-database option to enforce only storing assigned unicode
> code points.
There's a problem in the fact that the set of assigned code points is
On Wed, 2023-10-11 at 08:51 +0200, Peter Eisentraut wrote:
> I don't see how this would really work in practice. Whether your
> data
> has unassigned code points or not, when the collations are updated to
> the next Unicode version, the collations will have a new version
> number,
> and so you n
On Wed, 2023-10-11 at 08:56 +0200, Peter Eisentraut wrote:
> On 11.10.23 03:08, Jeff Davis wrote:
> > * unicode_is_valid(text): returns true if all codepoints are
> > assigned, false otherwise
>
> We need to be careful about precise terminology. "Valid" has a
> defined
> meaning for Unicode.
On 11.10.23 03:08, Jeff Davis wrote:
* unicode_is_valid(text): returns true if all codepoints are
assigned, false otherwise
We need to be careful about precise terminology. "Valid" has a defined
meaning for Unicode. A byte sequence can be valid or not as UTF-8. But
a string containing u
On 10.10.23 16:02, Robert Haas wrote:
On Tue, Oct 10, 2023 at 2:44 AM Peter Eisentraut wrote:
Can you restate what this is supposed to be for? This thread appears to
have morphed from "let's normalize everything" to "let's check for
unassigned code points", but I'm not sure what we are aiming
On Tue, Oct 10, 2023 at 2:44 AM Peter Eisentraut wrote:
> Can you restate what this is supposed to be for? This thread appears to
> have morphed from "let's normalize everything" to "let's check for
> unassigned code points", but I'm not sure what we are aiming for now.
Jeff can say what he want
On 06.10.23 19:22, Jeff Davis wrote:
On Fri, 2023-10-06 at 09:58 +0200, Peter Eisentraut wrote:
If you want to be rigid about it, you also need to consider whether
the
Unicode version used by the ICU library in use matches the one used
by
the in-core tables.
What problem are you concerned about
On 07.10.23 03:18, Jeff Davis wrote:
On Wed, 2023-10-04 at 13:16 -0400, Robert Haas wrote:
At minimum I think we need to have some internal functions to check
for
unassigned code points. That belongs in core, because we generate
the
unicode tables from a specific version.
That's a good idea.
P
On Fri, Oct 6, 2023 at 3:07 PM Jeff Davis wrote:
> On Fri, 2023-10-06 at 13:33 -0400, Robert Haas wrote:
> > What I think people really want is a whole column in
> > some encoding that isn't the normal one for that database.
>
> Do people really want that? I'd be curious to know why.
Because it's
On Fri, 6 Oct 2023, 21:08 Jeff Davis, wrote:
> On Fri, 2023-10-06 at 13:33 -0400, Robert Haas wrote:
> > What I think people really want is a whole column in
> > some encoding that isn't the normal one for that database.
>
> Do people really want that? I'd be curious to know why.
>
One reason so
On Fri, 6 Oct 2023 at 15:07, Jeff Davis wrote:
> On Fri, 2023-10-06 at 13:33 -0400, Robert Haas wrote:
> > What I think people really want is a whole column in
> > some encoding that isn't the normal one for that database.
>
> Do people really want that? I'd be curious to know why.
>
> A lot of m
On Fri, 2023-10-06 at 13:33 -0400, Robert Haas wrote:
> What I think people really want is a whole column in
> some encoding that isn't the normal one for that database.
Do people really want that? I'd be curious to know why.
A lot of modern projects are simply declaring UTF-8 to be the "one true
On Fri, Oct 6, 2023 at 2:25 PM Nico Williams wrote:
> > > > Well, that would be making the encoding a per-value property, rather
> > > > than a per-column property like collation as I proposed. I can't see
> > >
> > > On-disk it would be just a property of the type, not part of the value.
> >
> >
On Fri, Oct 06, 2023 at 02:17:32PM -0400, Robert Haas wrote:
> On Fri, Oct 6, 2023 at 1:38 PM Nico Williams wrote:
> > On Fri, Oct 06, 2023 at 01:33:06PM -0400, Robert Haas wrote:
> > > On Thu, Oct 5, 2023 at 3:15 PM Nico Williams
> > > wrote:
> > > > Text+encoding can be just like bytea with a
On Fri, Oct 6, 2023 at 1:38 PM Nico Williams wrote:
> On Fri, Oct 06, 2023 at 01:33:06PM -0400, Robert Haas wrote:
> > On Thu, Oct 5, 2023 at 3:15 PM Nico Williams wrote:
> > > Text+encoding can be just like bytea with a one- or two-byte prefix
> > > indicating what codeset+encoding it's in. Tha
On Thu, 2023-10-05 at 14:52 -0500, Nico Williams wrote:
> This is just how you encode the type of the string. You have any
> number
> of options. The point is that already PG can encode binary data, so
> if
> how to encode text of disparate encodings on the wire, building on
> top
> of the encodi
On Fri, Oct 06, 2023 at 01:33:06PM -0400, Robert Haas wrote:
> On Thu, Oct 5, 2023 at 3:15 PM Nico Williams wrote:
> > Text+encoding can be just like bytea with a one- or two-byte prefix
> > indicating what codeset+encoding it's in. That'd be how to encode
> > such text values on the wire, though
On Thu, Oct 5, 2023 at 3:15 PM Nico Williams wrote:
> Text+encoding can be just like bytea with a one- or two-byte prefix
> indicating what codeset+encoding it's in. That'd be how to encode
> such text values on the wire, though on disk the column's type should
> indicate the codeset+encoding, so
On Fri, 2023-10-06 at 09:58 +0200, Peter Eisentraut wrote:
> If you want to be rigid about it, you also need to consider whether
> the
> Unicode version used by the ICU library in use matches the one used
> by
> the in-core tables.
What problem are you concerned about here? I thought about it an
On 05.10.23 19:30, Jeff Davis wrote:
Agreed, at least until we understand the set of users per-column
encoding is important to. I acknowledge that the presence of per-column
encoding in the standard is some kind of signal there, but not enough
by itself to justify something so invasive.
The per
On 03.10.23 21:54, Jeff Davis wrote:
Here, Jeff mentions normalization, but I think it's a major issue
with
collation support. If new code points are added, users can put them
into the database before they are known to the collation library, and
then when they become known to the collation librar
On Thu, Oct 05, 2023 at 03:49:37PM -0400, Tom Lane wrote:
> Nico Williams writes:
> > Text+encoding can be just like bytea with a one- or two-byte prefix
> > indicating what codeset+encoding it's in. That'd be how to encode
> > such text values on the wire, though on disk the column's type should
Nico Williams writes:
> Text+encoding can be just like bytea with a one- or two-byte prefix
> indicating what codeset+encoding it's in. That'd be how to encode
> such text values on the wire, though on disk the column's type should
> indicate the codeset+encoding, so no need to add a prefix to th
On Thu, 2023-10-05 at 09:10 -0400, Isaac Morland wrote:
> In the case you describe, the users don’t have text at all; they have
> bytes, and a vague belief about what encoding the bytes might be in
> and therefore what characters they are intended to represent. The
> correct way to store that in th
On Thu, Oct 05, 2023 at 07:31:54AM -0400, Robert Haas wrote:
> [...] On the other hand, to do that in PostgreSQL, we'd need to
> propagate the character set/encoding information into all of the
> places that currently get the typmod and collation, and that is not a
> small number of places. It's a
On Thu, 2023-10-05 at 07:31 -0400, Robert Haas wrote:
> It's a lot of infrastructure for the project to carry
> around for a feature that's probably only going to continue to become
> less relevant.
Agreed, at least until we understand the set of users per-column
encoding is important to. I acknow
On Thu, 5 Oct 2023 at 07:32, Robert Haas wrote:
> But I do think that sometimes users are reluctant to perform encoding
> conversions on the data that they have. Sometimes they're not
> completely certain what encoding their data is in, and sometimes
> they're worried that the encoding conversio
On Wed, Oct 4, 2023 at 9:02 PM Isaac Morland wrote:
>> > What about characters not in UTF-8?
>>
>> Honestly I'm not clear on this topic. Are the "private use" areas in
>> unicode enough to cover use cases for characters not recognized by
>> unicode? Which encodings in postgres can represent charac
On Wed, 4 Oct 2023 at 17:37, Jeff Davis wrote:
> On Wed, 2023-10-04 at 14:14 -0400, Isaac Morland wrote:
> > Always store only UTF-8 in the database
>
> What problem does that solve? I don't see our encoding support as a big
> source of problems, given that database-wide UTF-8 already works fine.
On Wed, Oct 04, 2023 at 04:01:26PM -0700, Jeff Davis wrote:
> On Wed, 2023-10-04 at 16:15 -0500, Nico Williams wrote:
> > Better that than TEXT blobs w/ the encoding given by the `CREATE
> > DATABASE` or `initdb` default!
>
> From an engineering perspective, yes, per-column encodings would be
> mo
On Wed, 2023-10-04 at 16:15 -0500, Nico Williams wrote:
> Better that than TEXT blobs w/ the encoding given by the `CREATE
> DATABASE` or `initdb` default!
>From an engineering perspective, yes, per-column encodings would be
more flexible. But I still don't understand who exactly would use that,
a
On Wed, Oct 04, 2023 at 05:32:50PM -0400, Chapman Flack wrote:
> Well, for what reason does anybody run PG now with the encoding set
> to anything besides UTF-8? I don't really have my finger on that pulse.
Because they still have databases that didn't use UTF-8 10 or 20 years
ago that they haven'
On Wed, 2023-10-04 at 14:14 -0400, Isaac Morland wrote:
> Always store only UTF-8 in the database
What problem does that solve? I don't see our encoding support as a big
source of problems, given that database-wide UTF-8 already works fine.
In fact, some postgres features only work with UTF-8.
I
On 2023-10-04 16:38, Jeff Davis wrote:
On Wed, 2023-10-04 at 14:02 -0400, Chapman Flack wrote:
The SQL standard would have me able to:
CREATE TABLE foo (
a CHARACTER VARYING CHARACTER SET UTF8,
b CHARACTER VARYING CHARACTER SET LATIN1
)
and so on
Is there a use case for that? UTF-8 is
On Wed, Oct 04, 2023 at 01:38:15PM -0700, Jeff Davis wrote:
> On Wed, 2023-10-04 at 14:02 -0400, Chapman Flack wrote:
> > The SQL standard would have me able to:
> >
> > [...]
> > _UTF8'Hello, world!' and _LATIN1'Hello, world!'
>
> Is there a use case for that? UTF-8 is able to encode any unicode
On Wed, 2023-10-04 at 14:02 -0400, Chapman Flack wrote:
> The SQL standard would have me able to:
>
> CREATE TABLE foo (
> a CHARACTER VARYING CHARACTER SET UTF8,
> b CHARACTER VARYING CHARACTER SET LATIN1
> )
>
> and so on, and write character literals like
>
> _UTF8'Hello, world!' and _L
On Wed, 2023-10-04 at 13:16 -0400, Robert Haas wrote:
> any byte sequence at all is accepted when you try to
> put values into the database.
We support SQL_ASCII, which allows something similar.
> At any rate, if we were to go in the direction of rejecting code
> points that aren't yet assigned,
On Wed, 4 Oct 2023 at 14:05, Chapman Flack wrote:
> On 2023-10-04 13:47, Robert Haas wrote:
>
> The SQL standard would have me able to:
>
> CREATE TABLE foo (
>a CHARACTER VARYING CHARACTER SET UTF8,
>b CHARACTER VARYING CHARACTER SET LATIN1
> )
>
> and so on, and write character litera
On Wed, Oct 4, 2023 at 2:02 PM Chapman Flack wrote:
> Clearly, part of the job would involve making the wire protocol
> able to transmit binary values and identify their encodings.
Right. Which unfortunately is moving the goal posts into the
stratosphere compared to any other work mentioned so fa
On 2023-10-04 13:47, Robert Haas wrote:
On Wed, Oct 4, 2023 at 1:27 PM Nico Williams
wrote:
A UTEXT type would be helpful for specifying that the text must be
Unicode (in which transform?) even if the character data encoding for
the database is not UTF-8.
That's actually pretty thorny ... bec
On Wed, Oct 4, 2023 at 1:27 PM Nico Williams wrote:
> A UTEXT type would be helpful for specifying that the text must be
> Unicode (in which transform?) even if the character data encoding for
> the database is not UTF-8.
That's actually pretty thorny ... because right now client_encoding
specifi
On Tue, Sep 12, 2023 at 03:47:10PM -0700, Jeff Davis wrote:
> The idea is to have a new data type, say "UTEXT", that normalizes the
> input so that it can have an improved notion of equality while still
> using memcmp().
A UTEXT type would be helpful for specifying that the text must be
Unicode (i
On Tue, Oct 3, 2023 at 3:54 PM Jeff Davis wrote:
> I assume you mean because we reject invalid byte sequences? Yeah, I'm
> sure that causes a problem for some (especially migrations), but it's
> difficult for me to imagine a database working well with no rules at
> all for the the basic data types
On Tue, Oct 03, 2023 at 03:34:44PM -0700, Jeff Davis wrote:
> On Tue, 2023-10-03 at 15:15 -0500, Nico Williams wrote:
> > Ugh, My client is not displying 'a' correctly
>
> Ugh. Is that an argument in favor of normalization or against?
Heheh, well, it's an argument in favor of more software gettin
On Mon, 2023-10-02 at 10:47 +0200, Peter Eisentraut wrote:
> I think a better direction here would be to work toward making
> nondeterministic collations usable on the global/database level and
> then
> encouraging users to use those.
>
> It's also not clear which way the performance tradeoffs w
On Tue, 2023-10-03 at 15:15 -0500, Nico Williams wrote:
> Ugh, My client is not displying 'a' correctly
Ugh. Is that an argument in favor of normalization or against?
I've also noticed that some fonts render the same character a bit
differently depending on the constituent code points. For instan
On Tue, Oct 03, 2023 at 12:15:10PM -0700, Jeff Davis wrote:
> On Mon, 2023-10-02 at 15:27 -0500, Nico Williams wrote:
> > I think you misunderstand Unicode normalization and equivalence.
> > There is no standard Unicode `normalize()` that would cause the
> > above equality predicate to be true. I
On Mon, 2023-10-02 at 16:06 -0400, Robert Haas wrote:
> It seems to me that this overlooks one of the major points of Jeff's
> proposal, which is that we don't reject text input that contains
> unassigned code points. That decision turns out to be really painful.
Yeah, because we lose forward-comp
On Mon, 2023-10-02 at 15:27 -0500, Nico Williams wrote:
> I think you misunderstand Unicode normalization and equivalence.
> There
> is no standard Unicode `normalize()` that would cause the above
> equality
> predicate to be true. If you normalize to NFD (normal form
> decomposed)
> then a _pref
On Tue, Sep 12, 2023 at 03:47:10PM -0700, Jeff Davis wrote:
> One of the frustrations with using the "C" locale (or any deterministic
> locale) is that the following returns false:
>
> SELECT 'á' = 'á'; -- false
>
> because those are the unicode sequences U&'\0061\0301' and U&'\00E1',
> respec
On Mon, Oct 2, 2023 at 3:42 PM Peter Eisentraut wrote:
> I think a better direction here would be to work toward making
> nondeterministic collations usable on the global/database level and then
> encouraging users to use those.
It seems to me that this overlooks one of the major points of Jeff's
On 13.09.23 00:47, Jeff Davis wrote:
The idea is to have a new data type, say "UTEXT", that normalizes the
input so that it can have an improved notion of equality while still
using memcmp().
I think a new type like this would obviously be suboptimal because it's
nonstandard and most people wo
One of the frustrations with using the "C" locale (or any deterministic
locale) is that the following returns false:
SELECT 'á' = 'á'; -- false
because those are the unicode sequences U&'\0061\0301' and U&'\00E1',
respectively, so memcmp() returns non-zero. But it's really the same
character
74 matches
Mail list logo