Re: Handle foreign character web input

Alan Meyer via Python-list Sat, 29 Jun 2019 13:03:58 -0700

On 6/28/19 4:25 PM, Tobiah wrote:

A guy comes in and enters his last name as RÖnngren.


So what did the browser really give me; is it encoded
in some way, like latin-1?  Does it depend on whether
the name was cut and pasted from a Word doc. etc?
Should I handle these internally as unicode?  Right
now my database tables are latin-1 and things seem
to usually work, but not always.

Also, what do people do when searching for a record.
Is there some way to get 'Ronngren' to match the other
possible foreign spellings?

The first thing I'd want to do is to produce a front-end to discover thecharacter set (latin-1, whatever) and convert it to a standard UTF-8. e.g.:


   data.decode('latin1').encode('utf8')

That gets rid of character set variations in the data, simplifyingthings before any of the hard work has to be done.

Then you have a choice - store and index everything as utf-8, ortransliterate some or all strings to 7 bit US ASCII. You may have toperform the same processing on input search strings.

I have not used it myself but there is a Python port of a Perl moduleby Sean M. Burke called Unidecode. It will transliterate non-US ASCIIstrings into ASCII using reasonable substitutions of non-ASCIIsequences. I believe that there are other packages that can also do this.

The easy way to use packages like this is to transliterate entirerecords before putting them into your database, but then you may perplexor even offend some users who will look at a record and say "What'sthis? That's not French!" You'll also have to transliterate all inputsearch strings.

A more sophisticated way is to leave the records in Unicode, but addtransliterated index strings for those index strings that wind upcontaining utf-8 non-ASCII chars.

There are various ways to do this that tradeoff time, space, andprogramming effort. You can store two versions of each record, searchone and display the other. You can just process index strings and addthe transliterations to the record. What to choose depends on yourneeds and resources.

And of course all bets are off if some of your data is Chinese,Japanese, Hebrew, or maybe even Russian or Greek.

Sometimes I think, Why don't we all just learn Esperanto? But we allknow that that isn't going to happen.


    Alan
--
https://mail.python.org/mailman/listinfo/python-list

Re: Handle foreign character web input

Reply via email to