Hi Hugh, I tested the script in python 2.7 and it works perfect. The problem is in python 3.7(and may be only in windows as you were not getting the issue) and I was getting the following error
UnicodeEncodeError: 'charmap' codec can't encode character '\u0100' in position 0: character maps to <undefined> I went through the python script and found that the stdout encoding is set to utf-8 only if python version is <=2. I have made the same change for python version 3 as well. Please find the patch for the same.Let me know if it makes sense Regards, Ram. On Tue, 12 Feb 2019 at 00:50, Hugh Ranalli <h...@whtc.ca> wrote: > > On Sun, 10 Feb 2019 at 15:07, raam narayana <raam.s...@gmail.com> wrote: > >> Hi, >> >> After the latest commit in master branch, I was trying to test the python >> script. Ironically I still see that the output from the script is >> completely different from the unaccent.rules file content. Am I missing >> anything.My testing includes the following >> >> Downloaded the following files >> >> http://unicode.org/Public/8.0.0/ucd/UnicodeData.txt >> >> >> http://unicode.org/cldr/trac/export/14746/tags/release-34/common/transforms/Latin-ASCII.xml >> >> Executed the below python script >> >> python generate_unaccent_rules.py --unicode-data-file UnicodeData.txt >> --latin-ascii-file Latin-ASCII.xml > unaccent.rules >> >> I am using python 3.7.1 and running on Windows 10 Platform >> >> The new status of this patch is: Needs review >> > > Hi Raam, > I just ran generate_unaccent_rules.py under two environments, using the > data files given above : > - Python 3.4.3 on Linux Mint 17.3 (equivalent to Ubuntu 14.04) > - Python 3.6.7 on Ubuntu 18.04 > > In both cases, the output was identical to that generated by the program > under Python 2.7. So yes, more information would help. Unfortunately I > don't have a Windows Python environment readily available, but could set > one up if I had to. > > Thanks, > Hugh > -- Cheers Ram 4.0
generate_unaccent_rules-remove-combining-diacritical-accents-03.patch
Description: Binary data