Re: utf-8-strings

Thomas Morley Sun, 08 Jul 2012 04:39:45 -0700

2012/7/8 David Kastrup <d...@gnu.org>:
> Thomas Morley <thomasmorle...@googlemail.com> writes:
>
>> Hi,
>>
>> together with Arnold I worked on a method how to compress or stretch a
>> text, limiting it to the space between characters, i.e. the letters
>> itself shouldn't be scaled.
>> (Comes out of a discussion at the german LilyPond-Forum:
>> http://www.lilypondforum.de/index.php?topic=1152.0 )
>>
>> The difficulty is to achieve a functionality which turns a string into
>> a list of single strings and works with accented letters, german
>> Umlaute, non-europian fonts etc.
>> p.e.:
>> "áèçäöüテスト" → '("á" "è" "ç" "ä" "ö" "ü" "テ" "ス" "ト")
>>
>> We're coming up with the attached code.
>>
>> Problems:
>> UNICODE is increasing, so the code needs updating from time to time.
>> Once LilyPond uses guile 2.0 the situation may be completely
>> different. (I've not a clue about guile 2.0)
>>
>> What do you think?
>> Or let me ask different: Are there any objections to turn it into a
>> patch?
>
> Several observations:
>
> a) guilev2 is going to become a definite issue this year.  We may either
>    decide to support both guilev1 or guilev2, or ditch guilev1 support
>    completely.
>
>    So it does not make sense to design a solution that is not easy to
>    support with guilev2.
>
> b) LilyPond's lexer goes to considerable length to not let any invalid
>    utf8 pass into strings.  It would be reasonably straightforward, if
>    required, to make sure that this also holds for embedded Scheme.  In
>    that case, the only way to arrive at invalid utf-8 would be
>    synthesizing strings in Scheme from bytes.  So I'd not bother about
>    invalid utf-8.  This means that, diacriticals apart, you can just
>    split the string before any byte outside the range 80-bf.
>
> This can basically be done using charsets.  I tried doing this with
> regexps, but curiously enough, in contrast to Guile proper, those appear
> to be already utf-8 aware, so
>
> #(use-modules (ice-9 regex))
>
> #(define (utf8-substrings str)
>    (define char-pat (make-regexp "."))
>    (map match:substring (list-matches char-pat str)))
>
> #(write (utf8-substrings "áèçäöüテスト"))
>
> works just fine (if you overlook the fact that write misbehaves, writing
> some byte codes quoted as \xhh inside of a string and others literally).
>
> --
> David Kastrup
>
>
> _______________________________________________
> lilypond-devel mailing list
> lilypond-devel@gnu.org
> https://lists.gnu.org/mailman/listinfo/lilypond-devel


Wow!
Following your suggestion I managed to drop about 300 lines, reducing
it to a quarter of the original.
You definitly should earn more money!!

Of course I had to redefine `string-list->string'. I used recursion,
which was the best I could think of.
(`string-list->string' isn't used here, but I need it elsewhere)

Do you agree If I turn it into a patch?
I think `string->string-list' and `string-list->string' are very
useful tools and `char-space' might be of interest, too.


Thanks a lot,
  Harm

utf-8-strings-rev-02.ly
Description: Binary data

_______________________________________________
lilypond-devel mailing list
lilypond-devel@gnu.org
https://lists.gnu.org/mailman/listinfo/lilypond-devel

Re: utf-8-strings

Reply via email to