Re: [HACKERS] invalidly encoded strings

Tom Lane Mon, 10 Sep 2007 06:57:31 -0700

Andrew Dunstan <[EMAIL PROTECTED]> writes:
> Perhaps we're talking at cross purposes.


> The problem with doing encoding validation in scan.l is that it lacks 
> context. Null bytes are only the tip of the bytea iceberg, since any 
> arbitrary sequence of bytes can be valid for a bytea.

If you think that, then we're definitely talking at cross purposes.
I assert we should require the post-scanning value of a string literal
to be valid in the database encoding.  If you want to produce an
arbitrary byte sequence within a bytea value, the way to get there is
for the bytea input function to do the de-escaping, not for the string
literal parser to do it.

The current situation where there is overlapping functionality is a bit
unfortunate, but once standard_conforming_strings is on by default,
it'll get a lot easier to work with.  I'm not eager to contort APIs
throughout the backend in order to produce a more usable solution for
the standard_conforming_strings = off case, given the expected limited
lifespan of that usage.

The only reason I was considering not doing it in scan.l is that
scan.l's behavior ideally shouldn't depend on any changeable variables.
But until there's some prospect of database_encoding actually being
mutable at run time, there's not much point in investing a lot of sweat
on that either.

> I still don't see why it's OK for us to do validation from the foo_recv 
> functions but not the corresponding foo_in functions.

Basically, a CSTRING handed to an input function should already have
been encoding-validated by somebody.  The most obvious reason why this
must be so is the embedded-null problem, but in general it will already
have been validated (typically as part of a larger string such as the
whole SQL query or whole COPY data line), and doing that over is
pointless and expensive.  On the other hand, the entire point of a recv
function is that it gets raw data that no one else in the backend knows
the format of; so if the data is to be considered textual, the recv
function has to be the one that considers it so and invokes appropriate
conversion or validation.

The reason backslash escapes in string literals are a problem is that
they can produce incorrect-encoding results from what had been a
validated string.

>  At least in the 
> short term that would provide us with fairly complete protection against 
> accepting invalidly encoded data into the database, once we fix up 
> chr(), without having to mess with the scanner, parser, COPY code etc. 

Instead, we have to mess with an unknown number of UDTs ...

                        regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?

               http://www.postgresql.org/docs/faq

Re: [HACKERS] invalidly encoded strings

Reply via email to