Hi Ram, Thanks for doing this; I've been overestimating my ability to get to things over the last couple of weeks.
I've looked at the patch and have made one minor change. I had moved all the imports up to the top, to keep them in one place (and I think some had originally been used only by the Python 2 code. You added them there, but didn't remove them from their original positions. So I've incorporated that into your patch, attached as v2. I've tested this under Python 2 and 3 on Linux, not Windows. Everything else looks correct. I apologise for not having replied to your question in the original bug report. I had intended to, but as I said, there's been an increase in the things I need to juggle at the moment. Best wishes, Hugh On Sat, 16 Mar 2019 at 22:58, Ramanarayana <raam.s...@gmail.com> wrote: > Hi Hugh, > > I have abstracted out the windows compatibility changes from your patch to > a new patch and tested it. Added the patch to > https://commitfest.postgresql.org/23/ > > Please feel free to change it if it requires any changes. > > Cheers > Ram 4.0 >
diff --git a/contrib/unaccent/generate_unaccent_rules.py b/contrib/unaccent/generate_unaccent_rules.py index 58b6e7d..7a0a96e 100644 --- a/contrib/unaccent/generate_unaccent_rules.py +++ b/contrib/unaccent/generate_unaccent_rules.py @@ -32,9 +32,15 @@ # The approach is to be Python3 compatible with Python2 "backports". from __future__ import print_function from __future__ import unicode_literals +# END: Python 2/3 compatibility - remove when Python 2 compatibility dropped + +import argparse import codecs +import re import sys +import xml.etree.ElementTree as ET +# BEGIN: Python 2/3 compatibility - remove when Python 2 compatibility dropped if sys.version_info[0] <= 2: # Encode stdout as UTF-8, so we can just print to it sys.stdout = codecs.getwriter('utf8')(sys.stdout) @@ -45,12 +51,9 @@ if sys.version_info[0] <= 2: # Python 2 and 3 compatible bytes call def bytes(source, encoding='ascii', errors='strict'): return source.encode(encoding=encoding, errors=errors) +else: # END: Python 2/3 compatibility - remove when Python 2 compatibility dropped - -import re -import argparse -import sys -import xml.etree.ElementTree as ET + sys.stdout = codecs.getwriter('utf8')(sys.stdout.buffer) # The ranges of Unicode characters that we consider to be "plain letters". # For now we are being conservative by including only Latin and Greek. This @@ -233,21 +236,22 @@ def main(args): charactersSet = set() # read file UnicodeData.txt - unicodeDataFile = open(args.unicodeDataFilePath, 'r') - - # read everything we need into memory - for line in unicodeDataFile: - fields = line.split(";") - if len(fields) > 5: - # http://www.unicode.org/reports/tr44/tr44-14.html#UnicodeData.txt - general_category = fields[2] - decomposition = fields[5] - decomposition = re.sub(decomposition_type_pattern, ' ', decomposition) - id = int(fields[0], 16) - combining_ids = [int(s, 16) for s in decomposition.split(" ") if s != ""] - codepoint = Codepoint(id, general_category, combining_ids) - table[id] = codepoint - all.append(codepoint) + with codecs.open( + args.unicodeDataFilePath, mode='r', encoding='UTF-8', + ) as unicodeDataFile: + # read everything we need into memory + for line in unicodeDataFile: + fields = line.split(";") + if len(fields) > 5: + # http://www.unicode.org/reports/tr44/tr44-14.html#UnicodeData.txt + general_category = fields[2] + decomposition = fields[5] + decomposition = re.sub(decomposition_type_pattern, ' ', decomposition) + id = int(fields[0], 16) + combining_ids = [int(s, 16) for s in decomposition.split(" ") if s != ""] + codepoint = Codepoint(id, general_category, combining_ids) + table[id] = codepoint + all.append(codepoint) # walk through all the codepoints looking for interesting mappings for codepoint in all: