Re: [PATCH] Completed unaccent dictionary with many missing characters

2023-01-31 Thread vignesh C
On Mon, 16 Jan 2023 at 20:07, vignesh C wrote: > > On Fri, 4 Nov 2022 at 04:59, Ian Lawrence Barwick wrote: > > > > 2022年7月13日(水) 19:13 Przemysław Sztoch : > > > > > > Dear Michael P., > > > > > > 3. The matter is not that simple. When I change priorities (ie > > > Latin-ASCII.xml is less import

Re: [PATCH] Completed unaccent dictionary with many missing characters

2023-01-16 Thread vignesh C
On Fri, 4 Nov 2022 at 04:59, Ian Lawrence Barwick wrote: > > 2022年7月13日(水) 19:13 Przemysław Sztoch : > > > > Dear Michael P., > > > > 3. The matter is not that simple. When I change priorities (ie > > Latin-ASCII.xml is less important than Unicode decomposition), > > then "U + 33D7" changes not t

Re: [PATCH] Completed unaccent dictionary with many missing characters

2022-11-03 Thread Ian Lawrence Barwick
2022年7月13日(水) 19:13 Przemysław Sztoch : > > Dear Michael P., > > 3. The matter is not that simple. When I change priorities (ie > Latin-ASCII.xml is less important than Unicode decomposition), > then "U + 33D7" changes not to pH but to PH. > In the end, I left it like it was before ... > > If you

Re: [PATCH] Completed unaccent dictionary with many missing characters

2022-07-13 Thread Michael Paquier
On Tue, Jul 05, 2022 at 09:24:49PM +0200, Przemysław Sztoch wrote: > I do not add more, because they probably concern older languages. > An alternative might be to rely entirely on Unicode decomposition ... > However, after the change, only one additional Ukrainian letter with an > accent was added

Re: [PATCH] Completed unaccent dictionary with many missing characters

2022-07-13 Thread Przemysław Sztoch
Dear Michael P., 3. The matter is not that simple. When I change priorities (ie Latin-ASCII.xml is less important than Unicode decomposition), then "U + 33D7" changes not to pH but to PH. In the end, I left it like it was before ... If you decide what to do with point 3, I will correct it and s

Re: [PATCH] Completed unaccent dictionary with many missing characters

2022-07-05 Thread Przemysław Sztoch
Michael Paquier wrote on 7/5/2022 9:22 AM: On Tue, Jun 28, 2022 at 02:14:53PM +0900, Michael Paquier wrote: Well, the addition of cyrillic does not make necessary the removal of SOUND RECORDING COPYRIGHT or the DEGREEs, that implies the use of a dictionnary when manipulating the set of codepoint

Re: [PATCH] Completed unaccent dictionary with many missing characters

2022-07-05 Thread Michael Paquier
On Tue, Jun 28, 2022 at 02:14:53PM +0900, Michael Paquier wrote: > Well, the addition of cyrillic does not make necessary the removal of > SOUND RECORDING COPYRIGHT or the DEGREEs, that implies the use of a > dictionnary when manipulating the set of codepoints, but that's me > being too picky. Jus

Re: [PATCH] Completed unaccent dictionary with many missing characters

2022-07-03 Thread Przemysław Sztoch
Michael Paquier wrote on 6/28/2022 7:14 AM: On Thu, Jun 23, 2022 at 02:10:42PM +0200, Przemysław Sztoch wrote: The only division that is probably possible is the one attached. Well, the addition of cyrillic does not make necessary the removal of SOUND RECORDING COPYRIGHT or the DEGREEs, that im

Re: [PATCH] Completed unaccent dictionary with many missing characters

2022-06-27 Thread Michael Paquier
On Thu, Jun 23, 2022 at 02:10:42PM +0200, Przemysław Sztoch wrote: > The only division that is probably possible is the one attached. Well, the addition of cyrillic does not make necessary the removal of SOUND RECORDING COPYRIGHT or the DEGREEs, that implies the use of a dictionnary when manipulat

Re: [PATCH] Completed unaccent dictionary with many missing characters

2022-06-23 Thread Przemysław Sztoch
Michael Paquier wrote on 23.06.2022 06:39: That'd leave just DEGREE CELSIUS and DEGREE FAHRENHEIT. Not sure how to kill those last two special cases -- they should be directly replaced by their decomposition. [1] https://unicode-org.atlassian.net/browse/CLDR-11383 I patch v3 support for cirili

Re: [PATCH] Completed unaccent dictionary with many missing characters

2022-06-22 Thread Michael Paquier
On Tue, Jun 21, 2022 at 03:41:48PM +0200, Przemysław Sztoch wrote: > Thomas Munro wrote on 21.06.2022 02:53: >> Oh, we're using CLDR 41, which reminds me: CLDR 36 added SOUND >> RECORDING COPYRIGHT[1] so we could drop it from special_cases(). Indeed. >> Hmm, is it possible to get rid of CYRILLIC

Re: [PATCH] Completed unaccent dictionary with many missing characters

2022-06-21 Thread Przemysław Sztoch
Thomas Munro wrote on 21.06.2022 02:53: On Tue, Jun 21, 2022 at 12:11 PM Michael Paquier wrote: Yeah, Latin-ASCII.xml is getting it wrong here, then. unaccent fetches the thing from this URL currently: https://raw.githubusercontent.com/unicode-org/cldr/release-41/common/transforms/Latin-ASCII

Re: [PATCH] Completed unaccent dictionary with many missing characters

2022-06-21 Thread Przemysław Sztoch
Michael Paquier wrote on 21.06.2022 02:11: On Mon, Jun 20, 2022 at 10:37:57AM +0200, Przemysław Sztoch wrote: But ligature check is performed on combining_ids (result of translation), not on base codepoint. Without it, you will get assertions in get_plain_letters. Hmm. I am wondering if we c

Re: [PATCH] Completed unaccent dictionary with many missing characters

2022-06-20 Thread Thomas Munro
On Tue, Jun 21, 2022 at 12:11 PM Michael Paquier wrote: > Yeah, Latin-ASCII.xml is getting it wrong here, then. unaccent > fetches the thing from this URL currently: > https://raw.githubusercontent.com/unicode-org/cldr/release-41/common/transforms/Latin-ASCII.xml Oh, we're using CLDR 41, which r

Re: [PATCH] Completed unaccent dictionary with many missing characters

2022-06-20 Thread Michael Paquier
On Mon, Jun 20, 2022 at 10:37:57AM +0200, Przemysław Sztoch wrote: > But ligature check is performed on combining_ids (result of translation), > not on base codepoint. > Without it, you will get assertions in get_plain_letters. Hmm. I am wondering if we could make the whole logic a bit more intui

Re: [PATCH] Completed unaccent dictionary with many missing characters

2022-06-20 Thread Przemysław Sztoch
Michael Paquier wrote on 20.06.2022 03:49: On Wed, Jun 15, 2022 at 01:01:37PM +0200, Przemysław Sztoch wrote: Two fixes (bad comment and fixed Latin-ASCII.xml). if codepoint.general_category.startswith('L') and \ - len(codepoint.combining_ids) > 1: + len(codepoin

Re: [PATCH] Completed unaccent dictionary with many missing characters

2022-06-19 Thread Michael Paquier
On Wed, Jun 15, 2022 at 01:01:37PM +0200, Przemysław Sztoch wrote: > Two fixes (bad comment and fixed Latin-ASCII.xml). if codepoint.general_category.startswith('L') and \ - len(codepoint.combining_ids) > 1: + len(codepoint.combining_ids) > 0: So, this one checks for t

Re: [PATCH] Completed unaccent dictionary with many missing characters

2022-06-15 Thread Przemysław Sztoch
Two fixes (bad comment and fixed Latin-ASCII.xml). Michael Paquier wrote on 17.05.2022 09:11: On Thu, May 05, 2022 at 09:44:15PM +0200, Przemysław Sztoch wrote: Tom, I disagree with you because many similar numerical conversions are already taking place, e.g. 1/2, 1/4... This part sounds like

Re: [PATCH] Completed unaccent dictionary with many missing characters

2022-05-17 Thread Michael Paquier
On Thu, May 05, 2022 at 09:44:15PM +0200, Przemysław Sztoch wrote: > Tom, I disagree with you because many similar numerical conversions are > already taking place, e.g. 1/2, 1/4... This part sounds like a valid argument to me. unaccent.rules does already the conversion of some mathematical signs

Re: [PATCH] Completed unaccent dictionary with many missing characters

2022-05-05 Thread Przemysław Sztoch
Tom Lane wrote on 5/4/2022 5:32 PM: Peter Eisentraut writes: On 28.04.22 18:50, Przemysław Sztoch wrote: Current unnaccent dictionary does not include many popular numeric symbols, in example: "m²" -> "m2" Seems reasonable. It kinda feels like this is outside the charter of an "unaccent" dic

Re: [PATCH] Completed unaccent dictionary with many missing characters

2022-05-05 Thread Przemysław Sztoch
Peter Eisentraut wrote on 5/4/2022 5:17 PM: On 28.04.22 18:50, Przemysław Sztoch wrote: Current unnaccent dictionary does not include many popular numeric symbols, in example: "m²" -> "m2" Seems reasonable. Can you explain what your patch does to achieve this? I used an existing python imple

Re: [PATCH] Completed unaccent dictionary with many missing characters

2022-05-04 Thread Tom Lane
Peter Eisentraut writes: > On 28.04.22 18:50, Przemysław Sztoch wrote: >> Current unnaccent dictionary does not include many popular numeric symbols, >> in example: "m²" -> "m2" > Seems reasonable. It kinda feels like this is outside the charter of an "unaccent" dictionary. I don't object to ha

Re: [PATCH] Completed unaccent dictionary with many missing characters

2022-05-04 Thread Peter Eisentraut
On 28.04.22 18:50, Przemysław Sztoch wrote: Current unnaccent dictionary does not include many popular numeric symbols, in example: "m²" -> "m2" Seems reasonable. Can you explain what your patch does to achieve this?

[PATCH] Completed unaccent dictionary with many missing characters

2022-04-28 Thread Przemysław Sztoch
Current unnaccent dictionary does not include many popular numeric symbols, in example: "m²" -> "m2" -- Przemysław Sztoch | Mobile +48 509 99 00 66 diff --git a/contrib/unaccent/generate_unaccent_rules.py b/contrib/unaccent/generate_unaccent_rules.py index c405e231b3..a1a1a65112 100644 --- a/con