New submission from Inada Naoki <songofaca...@gmail.com>:
Assume you are writing an extension module that reads string. For example, HTML escape or JSON encode. There are two courses: (a) Support three KINDs in the flexible unicode representation. (b) Get UTF-8 data from the unicode. (a) will be the fastest on CPython, but there are few drawbacks: * This is tightly coupled with CPython implementation. It will be slow on PyPy. * CPython may change the internal representation to UTF-8 in the future, like PyPy. * You can not easily reuse algorithms written in C that handle `char*`. So I believe (b) should be the preferred way. But CPython doesn't provide an efficient way to get UTF-8 from the unicode object. * PyUnicode_AsUTF8AndSize(): When the unicode contains non-ASCII character, it will create a UTF-8 cache. The cache will be remained for longer than required. And there is additional malloc + memcpy to create the cache. * PyUnicode_DecodeUTF8(): It creates bytes object even when the unicode object is ASCII-only or there is a UTF-8 cache already. For speed and efficiency, I propose a new API: ``` /* Borrow the UTF-8 C string from the unicode. * * Store a pointer to the UTF-8 encoding of the unicode to *utf8* and its size to *size*. * The returned object is the owner of the *utf8*. You need to Py_DECREF() it after * you finished to using the *utf8*. The owner may be not the unicode. * Returns NULL when the error occurred while decoding the unicode. */ PyObject* PyUnicode_BorrowUTF8(PyObject *unicode, const char **utf8, Py_ssize_t *len); ``` When the unicode object is ASCII or has UTF-8 cache, this API increment refcnt of the unicode and return it. Otherwise, this API calls `_PyUnicode_AsUTF8String(unicode, NULL)` and return it. ---------- components: C API messages: 358623 nosy: inada.naoki priority: normal severity: normal status: open title: No efficient API to get UTF-8 string from unicode object. type: enhancement versions: Python 3.9 _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue39087> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com