On Sun, Apr 2, 2017 at 9:25 AM, Mikhail V <mikhail...@gmail.com> wrote:
> On 2 April 2017 at 00:22, Chris Angelico <ros...@gmail.com> wrote:
>> On Sun, Apr 2, 2017 at 8:16 AM, Mikhail V <mikhail...@gmail.com> wrote:
>>> For multiple-alphabet rendering I will use some
>>> custom text format, e.g. with tags
>>> <s="Voynich"> ... </s>, and for latin
>>> <s="Latin">...</s> and etc.
>>>
>>> Simple and effective.
>>
>> For multi-alphabet rendering, I would rather use an even simpler
>> format: Remove the tags and use a consistent encoding.
>
> No, flat encoding would not be simpler, it would be simpler only and only
> if you take a text with several alphabets, and mix the data randomly.
> In real situation, data chunks that use different glyph sets for
> representation are not mixed in a random manner.
> Also for different processing purposes tagged structure will be way
> more effective, e.g. if I want to extract all chunks in alphabet A
> in a single list with strings, or use advanced search, etc.

https://github.com/Rosuav/LetItTrans/blob/master/25%20languages.srt

Not exactly random, but that's a single file, a single document, using
characters from several different scripts. And this is far from the
only case of this sort of thing happening.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to