Hi Chen,
thanks for your feedback. Indeed it does not make sense to optimize
UTF-8 processing for a rather vague set of beneficiaries when there are
realistic counterexamples.
Still I don't want to give up on my idea too early :-)
I tried this modification:
* harvest pure ASCII-bytes before
Hi Johannes,
I think the 3rd scenario you've mentioned is likely: we have Swedish or other
languages that extend the ascii encoding with diacritics, which are non-ascii
bytes are frequently interrupting ascii. For non-ascii heavy languages like
Chinese, sometimes the text can include spaces or a