Re: [dev-servo] WTF-8 encoding for DOM strings and HTML parsing

Jan de Mooij Wed, 08 Oct 2014 06:14:10 -0700

On Tue, Oct 7, 2014 at 3:57 PM, Henri Sivonen <[email protected]> wrote:


> > UTF-8 strings will mean that we will have to copy all non-7-bit ASCII
> strings between the DOM and JS.
>
> Not if JS stores strings as WTF-8. I think it would be tragic not to
> bother to try to make the JS engine use WTF-8 when having the
> opportunity to fix things and thereby miss the opportunity to use
> UTF-8 in the DOM in Servo. UTF-16 is such a mistake.
>

When I added Latin1 to SpiderMonkey, we did consider using UTF8 but it's
complicated. As mentioned, we have to ensure charAt/charCodeAt stay fast
(crypto benchmarks etc rely on this, sadly). Many other string operations
are also very perf-sensitive and extra branches in tight loops can hurt a
lot. Also, the regular expression engine currently emits JIT code to load
and compare multiple characters at once. All this is fixable to work on
WTF-8 strings, but it's a lot of work and performance is a risk.

Also note that the copying we do for strings passed from JS to Gecko is not
only necessary for moving GC, but also to inflate Latin1 strings (= most
strings) to TwoByte Gecko strings. If Servo or Gecko could deal with both
Latin1 and TwoByte strings, we could think about ways to avoid the copying.
Though, as Boris said, I'm not aware of any (non-micro-)benchmark
regressions from the copying so I don't expect big wins from optimizing
this. But again, doing a Latin1 -> TwoByte copy is a very tight loop that
compilers can probably vectorize. UTF8/WTF8 -> TwoByte is more complicated
and probably slower.

Jan
_______________________________________________
dev-servo mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-servo

Re: [dev-servo] WTF-8 encoding for DOM strings and HTML parsing

Reply via email to