### Description

In Unicode, some CJK characters such as 化 have one codepoint but will appear 
differently in Simplified Chinese (<span 
lang="zh-Hans">化</span>), Traditional Chinese (<span 
lang="zh-Hant">化</span>), and Japanese (<span 
lang="ja">化</span>). On the frontend, we can display names 
correctly using an HTML attribute such as `lang="zh-Hant"` This issue 
is known as [Han unification](https://en.wikipedia.org/wiki/Han_unification) 
and it has appeared over the years [in many software 
projects](https://issues.chromium.org/issues/41315603)

This was addressed in iD https://github.com/openstreetmap/iD/pull/10716 and is 
a long-running discussion in openstreetmap-carto.

If we add `&addressdetails=1` to Nominatim queries, we can read the 
country_code and display the best label for mainland China, Hong Kong, Japan, 
or Taiwan.

### How has this been tested?

This can be tricky to test, as **many names do not change**, and the 
display_name will be in your browser's language if it's available

- Search results will have a lang tag, such as `lang="zh-HK"` or 
`lang="ja"`, regardless of language of display_name
- In Taiwan, a search result for <span 
lang="zh-Hant">彰化</span> should show a horizontal bar in 
<span lang="zh-Hant">化</span>
- In mainland China, a search result for <span 
lang="zh-Hans">玉门 expressway</span> should return a split 
frame <span lang="zh-Hans">门</span>  in the second 
character, not the 门 with a +

### Notes

As an alternative to adding `&addressdetails=1` to queries, we could 
possibly parse display_name (varies with the browser language) or use geo 
bounding boxes?

This matching of languages is imperfect, but without a language tag we are 
always using your browser's default for any CJK character. It would be 
difficult to make exceptions (for example, Japanese restaurants in these 
countries) without a name regex, a language tag, or access to other tags

This does not affect Chinese names in other countries

I have heard that there are some variations for Cyrillic in 
[Bulgaria](https://en.wikipedia.org/wiki/Bulgarian_alphabet) and 
[Serbia](https://en.wikipedia.org/wiki/Serbian_Cyrillic_alphabet#Differences_from_other_Cyrillic_alphabets),
 particularly in italics? But I don't know how universal it is. [Additional 
info](https://commons.wikimedia.org/wiki/File:Special_Cyrillics_BGDPT.svg)
You can view, comment on, or merge this pull request online at:

  https://github.com/openstreetmap/openstreetmap-website/pull/6079

-- Commit Summary --

  * add lang attribute to results from CJK countries, plus Cyrillic
  * remove Bulgaria/Serbia for now
  * fix HK subregion

-- File Changes --

    M app/controllers/concerns/nominatim_methods.rb (2)
    M app/controllers/searches/nominatim_queries_controller.rb (7)
    M app/helpers/geocoder_helper.rb (2)

-- Patch Links --

https://github.com/openstreetmap/openstreetmap-website/pull/6079.patch
https://github.com/openstreetmap/openstreetmap-website/pull/6079.diff

-- 
Reply to this email directly or view it on GitHub:
https://github.com/openstreetmap/openstreetmap-website/pull/6079
You are receiving this because you are subscribed to this thread.

Message ID: <openstreetmap/openstreetmap-website/pull/6...@github.com>
_______________________________________________
rails-dev mailing list
rails-dev@openstreetmap.org
https://lists.openstreetmap.org/listinfo/rails-dev

Reply via email to