On 6/28/19 4:25 PM, Tobiah wrote:
A guy comes in and enters his last name as RÖnngren.

So what did the browser really give me; is it encoded
in some way, like latin-1?  Does it depend on whether
the name was cut and pasted from a Word doc. etc?
Should I handle these internally as unicode?  Right
now my database tables are latin-1 and things seem
to usually work, but not always.

Also, what do people do when searching for a record.
Is there some way to get 'Ronngren' to match the other
possible foreign spellings?

The first thing I'd want to do is to produce a front-end to discover the character set (latin-1, whatever) and convert it to a standard UTF-8. e.g.:

   data.decode('latin1').encode('utf8')

That gets rid of character set variations in the data, simplifying things before any of the hard work has to be done.

Then you have a choice - store and index everything as utf-8, or transliterate some or all strings to 7 bit US ASCII. You may have to perform the same processing on input search strings.

I have not used it myself but there is a Python port of a Perl module by Sean M. Burke called Unidecode. It will transliterate non-US ASCII strings into ASCII using reasonable substitutions of non-ASCII sequences. I believe that there are other packages that can also do this.

The easy way to use packages like this is to transliterate entire records before putting them into your database, but then you may perplex or even offend some users who will look at a record and say "What's this? That's not French!" You'll also have to transliterate all input search strings.

A more sophisticated way is to leave the records in Unicode, but add transliterated index strings for those index strings that wind up containing utf-8 non-ASCII chars.

There are various ways to do this that tradeoff time, space, and programming effort. You can store two versions of each record, search one and display the other. You can just process index strings and add the transliterations to the record. What to choose depends on your needs and resources.

And of course all bets are off if some of your data is Chinese, Japanese, Hebrew, or maybe even Russian or Greek.

Sometimes I think, Why don't we all just learn Esperanto? But we all know that that isn't going to happen.

    Alan
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to