Thanks to MRAB and Chris Angelico for your help. Here is how I implemented the string conversion, and it works correctly now for a library call that needs a list converted to a string (error handling not shown):
PyObject* str_sentence = PyObject_Str(pSentence); PyObject* separator = PyUnicode_FromString(" "); PyObject* str_join = PyUnicode_Join(separator, pSentence); Py_DECREF(separator); PyObject* pNltk_WTok = PyObject_GetAttrString(pModule_mstr, "word_tokenize"); PyObject* pWTok = PyObject_CallFunctionObjArgs(pNltk_WTok, str_join, 0); That produces what I need (this is the REPR of pWTok): "['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']" Thanks again to both of you. Jen Mar 7, 2022, 11:03 by pyt...@mrabarnett.plus.com: > On 2022-03-07 17:05, Jen Kris wrote: > >> Thank you MRAB for your reply. >> >> Regarding your first question, pSentence is a list. In the nltk library, >> nltk.word_tokenize takes a string, so we convert sentence to string before >> we call nltk.word_tokenize: >> >> >>> sentence = " ".join(sentence) >> >>> pt = nltk.word_tokenize(sentence) >> >>> print(sentence) >> [ Emma by Jane Austen 1816 ] >> >> But with the C API it looks like this: >> >> PyObject *pSentence = PySequence_GetItem(pSents, sent_count); >> PyObject* str_sentence = PyObject_Str(pSentence); // Convert to string >> >> ; See what str_sentence looks like: >> PyObject* repr_str = PyObject_Repr(str_sentence); >> PyObject* str_str = PyUnicode_AsEncodedString(repr_str, "utf-8", "~E~"); >> const char *bytes_str = PyBytes_AS_STRING(str_str); >> printf("REPR_String: %s\n", bytes_str); >> >> REPR_String: "['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']" >> >> So the two string representations are not the same – or at least the >> PyUnicode_AsEncodedString is not the same, as each item is surrounded by >> single quotes. >> >> Assuming that the conversion to bytes object for the REPR is an accurate >> representation of str_sentence, it looks like I need to strip the quotes >> from str_sentence before “PyObject* pWTok = >> PyObject_CallFunctionObjArgs(pNltk_WTok, str_sentence, 0).” >> >> So my questions now are (1) is there a C API function that will convert a >> list to a string exactly the same way as ‘’.join, and if not then (2) how >> can I strip characters from a string object in the C API? >> > Your Python code is joining the list with a space as the separator. > > The equivalent using the C API is: > > PyObject* separator; > PyObject* joined; > > separator = PyUnicode_FromString(" "); > joined = PyUnicode_Join(separator, pSentence); > Py_DECREF(sep); > >> >> Mar 6, 2022, 17:42 by pyt...@mrabarnett.plus.com: >> >> On 2022-03-07 00:32, Jen Kris via Python-list wrote: >> >> I am using the C API in Python 3.8 with the nltk library, and >> I have a problem with the return from a library call >> implemented with PyObject_CallFunctionObjArgs. >> >> This is the relevant Python code: >> >> import nltk >> from nltk.corpus import gutenberg >> fileids = gutenberg.fileids() >> sentences = gutenberg.sents(fileids[0]) >> sentence = sentences[0] >> sentence = " ".join(sentence) >> pt = nltk.word_tokenize(sentence) >> >> I run this at the Python command prompt to show how it works: >> >> sentence = " ".join(sentence) >> pt = nltk.word_tokenize(sentence) >> print(pt) >> >> ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'] >> >> type(pt) >> >> <class 'list'> >> >> This is the relevant part of the C API code: >> >> PyObject* str_sentence = PyObject_Str(pSentence); >> // nltk.word_tokenize(sentence) >> PyObject* pNltk_WTok = PyObject_GetAttrString(pModule_mstr, >> "word_tokenize"); >> PyObject* pWTok = PyObject_CallFunctionObjArgs(pNltk_WTok, >> str_sentence, 0); >> >> (where pModule_mstr is the nltk library). >> >> That should produce a list with a length of 7 that looks like >> it does on the command line version shown above: >> >> ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'] >> >> But instead the C API produces a list with a length of 24, and >> the REPR looks like this: >> >> '[\'[\', "\'", \'[\', "\'", \',\', "\'Emma", "\'", \',\', >> "\'by", "\'", \',\', "\'Jane", "\'", \',\', "\'Austen", "\'", >> \',\', "\'1816", "\'", \',\', "\'", \']\', "\'", \']\']' >> >> I also tried this with PyObject_CallMethodObjArgs and >> PyObject_Call without success. >> >> Thanks for any help on this. >> >> What is pSentence? Is it what you think it is? >> To me it looks like it's either the list: >> >> ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'] >> >> or that list as a string: >> >> "['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']" >> >> and that what you're tokenising. >> -- https://mail.python.org/mailman/listinfo/python-list >> > -- > https://mail.python.org/mailman/listinfo/python-list > -- https://mail.python.org/mailman/listinfo/python-list