kerbrose khaled wrote: > I would like to update unaccent.rules file to support Arabic letters. so > could someone help me or tell me how could I add such contribution. I > attached the file including the modifications, only the last 4 lines.
The Arabic letters are found in the Unicode block U+0600 to U+06FF (https://www.fileformat.info/info/unicode/block/arabic/list.htm) There has been no coverage of this block until now by the unaccent module. Since Arabic uses several diacritics [1] , it would be best to figure out all the transliterations that should go in and let them in one go (plus coding that in the Python script). The canonical way to unaccent is normally to apply a Unicode transformation: NFC -> NFD and remove the non-spacing marks. I've tentatively did that with each codepoint in the 0600-06FF block in SQL with icu_transform in icu_ext [2], and it produces the attached result, with 60 (!) entries, along with Unicode names for readability. Does that make sense to people who know Arabic? For the record, here's the query: WITH block(cp) AS (select * FROM generate_series(x'600'::int,x'6ff'::int) AS cp), dest AS (select cp, icu_transform(chr(cp), 'any-NFD;[:nonspacing mark:] any-remove; any-NFC') AS unaccented FROM block) SELECT chr(cp) as "src", icu_transform(chr(cp), 'Name') as "srcName", dest.unaccented as "dest", icu_transform(dest.unaccented, 'Name') as "destName" FROM dest WHERE chr(cp) <> dest.unaccented; [1] https://en.wikipedia.org/wiki/Arabic_diacritics [2] https://github.com/dverite/icu_ext#icu_transform Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
unaccent-arabic-block.utf8.output
Description: Binary data