Steve Dower <steve.do...@python.org> added the comment:
The benchmark may not be triggering that much work. NFKC normalization only applies for characters outside of the basic Latin characters (0-255). I ran the below benchmarks and saw a huge difference. Granted, it's a very degenerate case with collections this big, but it appears to be linear with len(NAMES), suggesting that the normalization is the expensive part. >>> CHRS=[c for c in (chr(i) for i in range(65535)) if c.isidentifier()] >>> def makename(): ... return ''.join(random.choice(CHRS) for _ in range(10)) ... >>> NAMES = [makename() for _ in range(10000)] >>> timeit.timeit('len(set(NAMES))', globals=globals(), number=100000) 38.04007526000004 >>> timeit.timeit('len(set(unicodedata.normalize("NFKC", n) for n in NAMES))', >>> globals=globals(), number=100000) 820.2586788580002 I wonder if it's better to catch the SyntaxError and do the check there? That way we don't really have a performance impact, since it's only going to show up in exceptional cases anyway. ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue33881> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com