Re: [patch] glib20, UTF-8 and string collation

Alexander Nedotsukov Wed, 09 Jan 2008 22:00:27 -0800

Alexandre "Sunny" Kovalenko wrote:

On Thu, 2008-01-10 at 13:18 +0900, Alexander Nedotsukov wrote:
Alexandre,
The problem you exposed have its roots in Evo code. g_utf8_* stuffdefined to work on *utf-8* strings only and have undefined behaviour onMBCS strings. It may sound stupid but crashes are allowed in this case:-) Even we apply your latest patch the true problem solution will beonly postponed. We have to continue with Evo source. Find subject parserpart and ensure that it will be utf-8 encoded string at the end.
<rant>
While I see this as the noble goal, I fail to understand why asserts are
OK 10 lines above my latest patch, but not in this specific place. Nor
does this latest patch mask any issues in Evo -- you still get
Glib-CRITICAL assert. Hell, you even get assert without the patch if
your CHARSET is ASCII, but core dump if your CHARSET is xx_YY.UTF-8. If
we are talking noble goals here, how about some consistency in the error
handling?
</rant>

Those asserts does trivial arguments validation. Yours are slightly morethan trivial which may be against existing glib practice. In any case Iam against introducing such change locally. It lead to differentexecution paths in callers compared to other platforms. Also in acurrent form it suffer form potential memory leaks (guess where).Please save your energy for debates with glib developers if you want to.Even you convince hundredths like me it still will not make a validargument for them ;-)

On the more productive note: I have picked up more glib programming
today than I had before (or cared to) -- it's easy to improve when you
started from the veritable zero ;) However, it might not be sufficient
to do what needs to be done to Evolution. One thing you can help me
with, is to suggest glib function, I can feed arbitrary string of bytes
(either counted or zero-terminated) and it will tell me whether this is
valid UTF-8. g_print() apparently does this somehow, so there must be a
way.

See g_utf8_validate().

As promised, I'll crawl back under my rock and wait for rainy weekend to
read some Evolution code.

You better kick Evo authors into $some_painful_part_of_the_body :-)

All the best,
Alexander.

BTW: What happened to "no top posting" rule?

Alexandre "Sunny" Kovalenko wrote:

On Wed, 2008-01-09 at 20:16 -0500, Joe Marcus Clarke wrote:

On Wed, 2008-01-09 at 19:40 -0500, Alexandre "Sunny" Kovalenko wrote:

On Wed, 2008-01-09 at 12:35 -0500, Joe Marcus Clarke wrote:

On Wed, 2008-01-09 at 10:53 -0500, Alexandre "Sunny" Kovalenko wrote:

I have seen recent commit WRT string collation in devel/glib20 by
marcus, so I have decided to check if there is an interest to fix SEGV
in g_utf8_collate when it is given 8-bit non-UTF-8 string(s) to collate.

Any commits I have made in the area of UTF-8 are completely accidental.
I am not the UTF-8 guy.  Both bland and jylefort have expressed interest
in this.  Perhaps one of them will comment.

I hope so. Just in case, they would decide to, I have reduced the

situation to the small program below. I get

GLib-CRITICAL **: g_convert: assertion `str != NULL' failed

and no core dump from this simple program, whereas Evolution manages to
pass NULL to strcoll further down in g_utf8_collate and get SEGV for its
pains.

That sounds like a no-no for Evolution to be dereferencing a NULL
pointer.  Hopefully they'd fix this to prevent the problem.

It's not Evolution, it is glib, specifically g_utf8_collate, which would
call strcoll(3) blindly on the return of g_utf8_normalize inside
gunicollate.c. And now, I can get core dumped out of this simple program
as well, merely by setting CHARSET=en_US.UTF-8 (I had it is ASCII in the
terminal window, which would trigger different path within
g_utf8_collate).

Conversely, if the answer still is "Evolution should not have done
that", I will happily crawl back under my rock and keep my patch
locally.

I can't imagine you're alone in this.  But then again, any Cyrillic mail
that comes my way is always spam, so what do I know.

More importantly, it is UTF-8 spam -- in order to trigger this, you need
KOI8-R or CP1251, and in the sorted column to boot. I suspect that
Latin1 or ShiftJIS would do the trick too.

Now, how about this: would you be amenable to this Really Harmless(tm)
patch, which merely adds error checking along the lines used in the same
function, about dozen lines up ;)

--- glib/gunicollate.c.B 2008-01-09 20:48:25.000000000 -0500
+++ glib/gunicollate.c  2008-01-09 20:49:35.000000000 -0500
@@ -166,6 +166,9 @@
   str1_norm = g_utf8_normalize (str1, -1, G_NORMALIZE_ALL_COMPOSE);
   str2_norm = g_utf8_normalize (str2, -1, G_NORMALIZE_ALL_COMPOSE);

+ g_return_val_if_fail (str1_norm != NULL, 0);

+  g_return_val_if_fail (str2_norm != NULL, 0);
+
   if (g_get_charset (&charset))
     {
       result = strcoll (str1_norm, str2_norm);

I can add it to your files/extra-patch-glib_gunicollate.c, or packageit separately -- I really hate it when I start Evolution after portupgrade

to write some E-mails real quick, only to find out that I have forgotten
to patch glib... again.


_______________________________________________
freebsd-ports@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-ports
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Re: [patch] glib20, UTF-8 and string collation

Reply via email to