On 2022-03-07 00:32, Jen Kris via Python-list wrote:
I am using the C API in Python 3.8 with the nltk library, and I have a problem 
with the return from a library call implemented with 
PyObject_CallFunctionObjArgs.

This is the relevant Python code:

import nltk
from nltk.corpus import gutenberg
fileids = gutenberg.fileids()
sentences = gutenberg.sents(fileids[0])
sentence = sentences[0]
sentence = " ".join(sentence)
pt = nltk.word_tokenize(sentence)

I run this at the Python command prompt to show how it works:
sentence = " ".join(sentence)
pt = nltk.word_tokenize(sentence)
print(pt)
['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']
type(pt)
<class 'list'>

This is the relevant part of the C API code:

PyObject* str_sentence = PyObject_Str(pSentence);
// nltk.word_tokenize(sentence)
PyObject* pNltk_WTok = PyObject_GetAttrString(pModule_mstr, "word_tokenize");
PyObject* pWTok = PyObject_CallFunctionObjArgs(pNltk_WTok, str_sentence, 0);

(where pModule_mstr is the nltk library).

That should produce a list with a length of 7 that looks like it does on the 
command line version shown above:

['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']

But instead the C API produces a list with a length of 24, and the REPR looks 
like this:

'[\'[\', "\'", \'[\', "\'", \',\', "\'Emma", "\'", \',\', "\'by", "\'", \',\', "\'Jane", "\'", \',\', "\'Austen", "\'", 
\',\', "\'1816", "\'", \',\', "\'", \']\', "\'", \']\']'

I also tried this with PyObject_CallMethodObjArgs and PyObject_Call without 
success.

Thanks for any help on this.

What is pSentence? Is it what you think it is?
To me it looks like it's either the list:

    ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']

or that list as a string:

    "['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']']"

and that what you're tokenising.
--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to