On Thu, Apr 2, 2009 at 9:49 AM, Ray Saintonge <sainto...@telus.net> wrote: > Aryeh Gregor wrote: >> On Wed, Apr 1, 2009 at 11:32 AM, Ziko van Dijk <zvand...@googlemail.com> >> wrote: >> >>> I am sceptical about automatic conversion. As you said, it is mainly a >>> solution for reading, but not for writing, because the source text is in one >>> specific spelling or character system. >>> >> Why couldn't that be converted on the fly as well? Choose one variant >> as the canonical one, and store only that in the database. Anyone >> wanting to use other formats would have the text in the edit box >> automatically converted to their preferred variant on the fly, and >> converted back when they saved. > > When you declare one version canonical the risk is that you will have > supporters of the losing version(s) becoming irrationally angry.
Not just that... It is computationally non-sustainable. Even in the most simplest cases, like Serbian script conversion is, conversion is not transitive (however, intransitivity is small and approximation works good enough). So, one of the simplest cases assumes: * Usually, it is thought that Serbian Cyrillic alphabet has more informations than Serbian Latin. In Cyrillic, sound "dzh" is marked with letter "џ", while it is marked as digraph in Latin -- "dž". However, there are cases where combination "d+zh" is regular, so it is in Cyrillic "дж", while in Latin it marked as the sound "dzh": as "dž". So, it means that if you are keeping text in Cyrillic, as a canonical version, you'll be able to regenerate Latin (while not vice versa). * However, because of those digraphs, Latin differs capital letters from heading letters. If you are converting Cyrillic capital letter "Џ" into Latin, you'll put "Dž" as its counterpart. However, if it is a part of heading letters, let's say "ЏАК", you'll get "DžAK", while the correct form should be "DŽAK". Of course, it is possible to solve it by testing are the surrounding letters are capital or not (as well as it is not a big deal in Serbian). However, this is a very simple case for conversion rules. Usually, it is much cheaper to do conversion at the time of adding/changing text and to keep both versions inside of databases. Because there are two different sets of rules for conversion. The other option is to keep one meta text inside of database, which would have internal markup. So, the previous example may look like "{Latin: {DŽ}AK}". And, of course, if there are more than two script/orthography versions (Kurdish is an example), it would be necessary to make conversion rules for all combinations. Of course, a lot of generalizations are possible, but, it isn't possible to generalize all of the rules. _______________________________________________ foundation-l mailing list foundation-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/foundation-l