On Friday, December 14, 2012 2:07:51 PM UTC+1, Pander Musubi wrote: > On Friday, December 14, 2012 1:06:23 AM UTC+1, Steven D'Aprano wrote: > > > On Thu, 13 Dec 2012 07:30:57 -0800, Pander Musubi wrote: > > > > > > > > > > > > > I was expecting PyPI. Here is the code, please advise on where to submit > > > > > > > it: > > > > > > > http://pastebin.com/dbzeasyq > > > > > > > > > > > > If anywhere, either a third-party module, or the unicodedata standard > > > > > > library module. > > > > > > > > > > > > > > > > > > Some unanswered questions: > > > > > > > > > > > > - when would somebody need this function? > > > > > > > When working with Unicode metedata, see below. > > > > > > > > > > > - why is is called "decodeUnicodeGeneralCategory" when it > > > > > > doesn't seem to have anything to do with decoding? > > > > It is actually a simple LUT. I like your improvements below. > > > > > - why is the parameter "sortable" called sortable, when it > > > > > > doesn't seem to have anything to do with sorting? > > > > The values return are alphabetically sortable. > > > > > > > > > > > > > > > > > > > > If this is useful at all, it would be more useful to just expose the data > > > > > > as a dict, and forget about an unnecessary wrapper function: > > > > > > > > > > > > > > > > > > from collections import namedtuple > > > > > > r = namedtuple("record", "other name desc") # better field names needed! > > > > > > > > > > > > GC = { > > > > > > 'C' : r('Other', 'Other', 'Cc | Cf | Cn | Co | Cs'), > > > > > > 'Cc': r('Control', 'Control', > > > > > > 'a C0 or C1 control code'), # a.k.a. cntrl > > > > > > 'Cf': r('Format', 'Format', 'a format control character'), > > > > > > 'Cn': r('Unassigned', 'Unassigned', > > > > > > 'a reserved unassigned code point or a noncharacter'), > > > > > > 'Co': r('Private Use', 'Private_Use', 'a private-use character'), > > > > > > 'Cs': r('Surrogate', 'Surrogate', 'a surrogate code point'), > > > > > > 'L' : r('Letter', 'Letter', 'Ll | Lm | Lo | Lt | Lu'), > > > > > > 'LC': r('Letter, Cased', 'Cased_Letter', 'Ll | Lt | Lu'), > > > > > > 'Ll': r('Letter, Lowercase', 'Lowercase_Letter', > > > > > > 'a lowercase letter'), > > > > > > 'Lm': r('Letter, Modifier', 'Modifier_Letter', 'a modifier letter'), > > > > > > 'Lo': r('Letter, Other', 'Other_Letter', > > > > > > 'other letters, including syllables and ideographs'), > > > > > > 'Lt': r('Letter, Titlecase', 'Titlecase_Letter', > > > > > > 'a digraphic character, with first part uppercase'), > > > > > > 'Lu': r('Letter, Uppercase', 'Uppercase_Letter', > > > > > > 'an uppercase letter'), > > > > > > 'M' : r('Mark', 'Mark', 'Mc | Me | Mn '), # a.k.a. Combining_Mark > > > > > > 'Mc': r('Mark, Spacing', 'Spacing_Mark', > > > > > > 'a spacing combining mark (positive advance width)'), > > > > > > 'Me': r('Mark, Enclosing', 'Enclosing_Mark', > > > > > > 'an enclosing combining mark'), > > > > > > 'Mn': r('Mark, Nonspacing', 'Nonspacing_Mark', > > > > > > 'a nonspacing combining mark (zero advance width)'), > > > > > > 'N' : r('Number', 'Number', 'Nd | Nl | No'), > > > > > > 'Nd': r('Number, Decimal', 'Decimal_Number', > > > > > > 'a decimal digit'), # a.k.a. digit > > > > > > 'Nl': r('Number, Letter', 'Letter_Number', > > > > > > 'a letterlike numeric character'), > > > > > > 'No': r('Number, Other', 'Other_Number', > > > > > > 'a numeric character of other type'), > > > > > > 'P' : r('Punctuation', 'Punctuation', > > > > > > 'Pc | Pd | Pe | Pf | Pi | Po | Ps'), # a.k.a. punct > > > > > > 'Pc': r('Punctuation, Connector', 'Connector_Punctuation', > > > > > > 'a connecting punctuation mark, like a tie'), > > > > > > 'Pd': r('Punctuation, Dash', 'Dash_Punctuation', > > > > > > 'a dash or hyphen punctuation mark'), > > > > > > 'Pe': r('Punctuation, Close', 'Close_Punctuation', > > > > > > 'a closing punctuation mark (of a pair)'), > > > > > > 'Pf': r('Punctuation, Final', 'Final_Punctuation', > > > > > > 'a final quotation mark'), > > > > > > 'Pi': r('Punctuation, Initial', 'Initial_Punctuation', > > > > > > 'an initial quotation mark'), > > > > > > 'Po': r('Punctuation, Other', 'Other_Punctuation', > > > > > > 'a punctuation mark of other type'), > > > > > > 'Ps': r('Punctuation, Open', 'Open_Punctuation', > > > > > > 'an opening punctuation mark (of a pair)'), > > > > > > 'S' : r('Symbol', 'Symbol', 'Sc | Sk | Sm | So'), > > > > > > 'Sc': r('Symbol, Currency', 'Currency_Symbol', 'a currency sign'), > > > > > > 'Sk': r('Symbol, Modifier', 'Modifier_Symbol', > > > > > > 'a non-letterlike modifier symbol'), > > > > > > 'Sm': r('Symbol, Math', 'Math_Symbol', > > > > > > 'a symbol of mathematical use'), > > > > > > 'So': r('Symbol, Other', 'Other_Symbol', 'a symbol of other type'), > > > > > > 'Z' : r('Separator', 'Separator', 'Zl | Zp | Zs'), > > > > > > 'Zl': r('Separator, Line', 'Line_Separator', > > > > > > 'U+2028 LINE SEPARATOR only'), > > > > > > 'Zp': r('Separator, Paragraph', 'Paragraph_Separator', > > > > > > 'U+2029 PARAGRAPH SEPARATOR only'), > > > > > > 'Zs': r('Separator, Space', 'Space_Separator', > > > > > > 'a space character (of various non-zero widths)'), > > > > > > } > > > > > > > > > > > > del r > > > > > > > > > > > > > > > > > > Usage is then trivially the same as normal dict and attribute access: > > > > > > > > > > > > py> GC['Ps'].desc > > > > > > 'an opening punctuation mark (of a pair)' > > > > > > > > > > > > > Thank you for the improvements. I have some more extra dicts in this way such > as: > > http://www.unicode.org/Public/UNIDATA/UnicodeData.txt > > where this general category is begin used. This information is useful when > handling Unicode metadata. > > > > I think I will approach both > > http://pypi.python.org/pypi/unicodeblocks/ > > and > > http://pypi.python.org/pypi/unicodescript/ > > to see who will adopt this. > > > > Perhaps it might be in their mutual interest to join their packages to e.g. > unicodemetadata or something similar. Extra ideas on this are still welcome. > > > > Thanks for all your help, > > > > Pander > > > > > > > > > > > > > > > > > -- > > > > > > Steven
Ah, it will become a feature request for http://docs.python.org/3/library/unicodedata.html -- http://mail.python.org/mailman/listinfo/python-list