Hi Pádraig, Bo, On 2008-05-08, when I mentioned the possibility to have a filter program that reads from standard input and writes the canonicalized output to standard output, you liked this idea: <http://lists.gnu.org/archive/html/bug-coreutils/2008-05/msg00062.html> <http://lists.gnu.org/archive/html/bug-coreutils/2008-05/msg00063.html>
I have now added to gnulib a module for Unicode normalization of streams of Unicode characters. It's called 'uninorm/filter'; the API is declared in uninorm.h: ------------------------------------------------------------------------------- /* Normalization of a stream of Unicode characters. A "stream of Unicode characters" is essentially a function that accepts an ucs4_t argument repeatedly, optionally combined with a function that "flushes" the stream. */ /* Data type of a stream of Unicode characters that normalizes its input according to a given normalization form and passes the normalized character sequence to the encapsulated stream of Unicode characters. */ struct uninorm_filter; /* Create and return a normalization filter for Unicode characters. The pair (stream_func, stream_data) is the encapsulated stream. stream_func (stream_data, uc) receives the Unicode character uc and returns 0 if successful, or -1 with errno set upon failure. Return the new filter, or NULL with errno set upon failure. */ extern struct uninorm_filter * uninorm_filter_create (uninorm_t nf, int (*stream_func) (void *stream_data, ucs4_t uc), void *stream_data); /* Stuff a Unicode character into a normalizing filter. Return 0 if successful, or -1 with errno set upon failure. */ extern int uninorm_filter_write (struct uninorm_filter *filter, ucs4_t uc); /* Bring data buffered in the filter to its destination, the encapsulated stream. Return 0 if successful, or -1 with errno set upon failure. Note! If after calling this function, additional characters are written into the filter, the resulting character sequence in the encapsulated stream will not necessarily be normalized. */ extern int uninorm_filter_flush (struct uninorm_filter *filter); /* Bring data buffered in the filter to its destination, the encapsulated stream, then close and free the filter. Return 0 if successful, or -1 with errno set upon failure. */ extern int uninorm_filter_free (struct uninorm_filter *filter); ------------------------------------------------------------------------------- With this, you can easily create a program that reads UTF-8 from stdin and outputs it as canonicalized UTF-8 on stdout: - create a "stream" that takes a Unicode character and outputs it to stdout. (Gnulib module 'unistr/u8-uctomb'.) - Wrap a Unicode normalizing filter around it. (Gnulib module 'uninorm/filter'.) - Feed it with Unicode characters from standard input. (Gnulib module unistr/u8-mbtouc'.) I would love to see such a program in coreutils. But I am not a coreutils maintainer. Bruno _______________________________________________ Bug-coreutils mailing list Bug-coreutils@gnu.org http://lists.gnu.org/mailman/listinfo/bug-coreutils