cor3ntin added a comment.

So that this doesn't get lost, I'm opposed to this change because it does not 
actually follow the TR39 guidance or algorithm.
  For an input string X, define skeleton(X) to be the following transformation 
on the string:
  
  1. Convert X to NFD format, as described in [UAX15].
  2. Concatenate the prototypes for each character in X according to the 
specified data, producing a string of exemplar characters.
  3. Reapply NFD.

Step 1 and 3 not being performed will lead to false negative.

That being said, given that it is for clang tidy and that adding decomposition 
to llvm is a bit involved (I could look into it if there is interest), this is 
not a strong opposition, but i wanted to make sure people are aware of the 
limitation.
I think we should have a fixme in the code so that we don't forget.



================
Comment at: 
clang-tools-extra/clang-tidy/misc/ConfusableTable/build_confusable_table.cpp:59
+  llvm::raw_fd_ostream os(argv[2], ec);
+  os << "struct {llvm::UTF32 codepoint; llvm::UTF32 values[32];} "
+        "ConfusableEntries[] = {\n";
----------------
This is terribly menory inneficient - most confusables are 1-3 code point longs.
Consider
 *  Replacing the hard coded 32 by the actual maximum length that you can 
extract when buikding the table
  * Having a table for the common case of 1-3 codepoints for the common case 
and another one for the long sequence. 


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D112916/new/

https://reviews.llvm.org/D112916

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to