GUILE 2/3 and string encoding cost

Han-Wen Nienhuys Wed, 22 Jan 2020 01:00:57 -0800

I looked a bit through the GUILE source code to see what is going on.

I believe our current hypothesis (LilyPond's slowdown is caused by
expensive unicode transcoding into 32-bit strings) is incorrect.


If you look into the source code, you can see that the UTF-8 -> SCM
conversion checks if there are any code points over 255


https://git.savannah.nongnu.org/cgit/guile.git//tree/libguile/strings.c/?id=1b8e9ca0e37fab366435436995248abdfc780a10#n1620

if there aren't, it uses Latin1 encoding ("narrow == 1") to encode the
string as a normal byte array. This code walks the string twice, but that
is very cheap due to CPU cache locality, so it should be
essentially equivalent to whatever GUILE 1.8 was doing.

The conversion in the other direction is here:
https://git.savannah.nongnu.org/cgit/guile.git//tree/libguile/strings.c/?id=1b8e9ca0e37fab366435436995248abdfc780a10#n2065

as you can see, if the string is narrow (Latin1/ASCII), it uses the cheap
path as well.

LilyPond internally doesn't use any Unicode strings, as all our identifiers
are pure ascii, as well as internal strings (eg. font glyph names). This
means that files that do not use Unicode characters at all should have the
same overhead for strings as GUILE 1.8.

Even so, if the input flie does use UTF-8, there should be little overhead,
because the number of texts that we process is always small. LilyPond is
not a text processor.

So, what hard data do we have on GUILE 2/3 slowness, and what does that
data say?

-- 
Han-Wen Nienhuys - hanw...@gmail.com - http://www.xs4all.nl/~hanwen

GUILE 2/3 and string encoding cost

Reply via email to