Thank you MRAB for your reply. Regarding your first question, pSentence is a list. In the nltk library, nltk.word_tokenize takes a string, so we convert sentence to string before we call nltk.word_tokenize:
>>> sentence = " ".join(sentence) >>> pt = nltk.word_tokenize(sentence) >>> print(sentence) [ Emma by Jane Austen 1816 ] But with the C API it looks like this: PyObject *pSentence = PySequence_GetItem(pSents, sent_count); PyObject* str_sentence = PyObject_Str(pSentence); // Convert to string ; See what str_sentence looks like: PyObject* repr_str = PyObject_Repr(str_sentence); PyObject* str_str = PyUnicode_AsEncodedString(repr_str, "utf-8", "~E~"); const char *bytes_str = PyBytes_AS_STRING(str_str); printf("REPR_String: %s\n", bytes_str); REPR_String: "['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']" So the two string representations are not the same – or at least the PyUnicode_AsEncodedString is not the same, as each item is surrounded by single quotes. Assuming that the conversion to bytes object for the REPR is an accurate representation of str_sentence, it looks like I need to strip the quotes from str_sentence before “PyObject* pWTok = PyObject_CallFunctionObjArgs(pNltk_WTok, str_sentence, 0).” So my questions now are (1) is there a C API function that will convert a list to a string exactly the same way as ‘’.join, and if not then (2) how can I strip characters from a string object in the C API? Thanks. Mar 6, 2022, 17:42 by pyt...@mrabarnett.plus.com: > On 2022-03-07 00:32, Jen Kris via Python-list wrote: > >> I am using the C API in Python 3.8 with the nltk library, and I have a >> problem with the return from a library call implemented with >> PyObject_CallFunctionObjArgs. >> >> This is the relevant Python code: >> >> import nltk >> from nltk.corpus import gutenberg >> fileids = gutenberg.fileids() >> sentences = gutenberg.sents(fileids[0]) >> sentence = sentences[0] >> sentence = " ".join(sentence) >> pt = nltk.word_tokenize(sentence) >> >> I run this at the Python command prompt to show how it works: >> >>>>> sentence = " ".join(sentence) >>>>> pt = nltk.word_tokenize(sentence) >>>>> print(pt) >>>>> >> ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'] >> >>>>> type(pt) >>>>> >> <class 'list'> >> >> This is the relevant part of the C API code: >> >> PyObject* str_sentence = PyObject_Str(pSentence); >> // nltk.word_tokenize(sentence) >> PyObject* pNltk_WTok = PyObject_GetAttrString(pModule_mstr, "word_tokenize"); >> PyObject* pWTok = PyObject_CallFunctionObjArgs(pNltk_WTok, str_sentence, 0); >> >> (where pModule_mstr is the nltk library). >> >> That should produce a list with a length of 7 that looks like it does on the >> command line version shown above: >> >> ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'] >> >> But instead the C API produces a list with a length of 24, and the REPR >> looks like this: >> >> '[\'[\', "\'", \'[\', "\'", \',\', "\'Emma", "\'", \',\', "\'by", "\'", >> \',\', "\'Jane", "\'", \',\', "\'Austen", "\'", \',\', "\'1816", "\'", >> \',\', "\'", \']\', "\'", \']\']' >> >> I also tried this with PyObject_CallMethodObjArgs and PyObject_Call without >> success. >> >> Thanks for any help on this. >> > What is pSentence? Is it what you think it is? > To me it looks like it's either the list: > > ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']'] > > or that list as a string: > > "['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']" > > and that what you're tokenising. > -- > https://mail.python.org/mailman/listinfo/python-list > -- https://mail.python.org/mailman/listinfo/python-list