This is a follow-on to a question I asked yesterday, which was answered by 
MRAB.   I'm using the Python C API to load the Gutenberg corpus from the nltk 
library and iterate through the sentences.  The Python code I am trying to 
replicate is:

from nltk.corpus import gutenberg
for i, fileid in enumerate(gutenberg.fileids()):
        sentences = gutenberg.sents(fileid)
        etc

I have everything finished down to the last line (sentences = 
gutenberg.sents(fileid)) where I use  PyObject_Call to call gutenberg.sents, 
but it segfaults.  The fileid is a string -- the first fileid in this corpus is 
"austen-emma.txt."  

pName = PyUnicode_FromString("nltk.corpus");
pModule = PyImport_Import(pName);

pSubMod = PyObject_GetAttrString(pModule, "gutenberg");
pFidMod = PyObject_GetAttrString(pSubMod, "fileids");
pSentMod = PyObject_GetAttrString(pSubMod, "sents");

pFileIds = PyObject_CallObject(pFidMod, 0);
pListItem = PyList_GetItem(pFileIds, listIndex);
pListStrE = PyUnicode_AsEncodedString(pListItem, "UTF-8", "strict");
pListStr = PyBytes_AS_STRING(pListStrE);
Py_DECREF(pListStrE);

// sentences = gutenberg.sents(fileid)
PyObject *c_args = Py_BuildValue("s", pListStr);  
PyObject *NullPtr = 0;
pSents = PyObject_Call(pSentMod, c_args, NullPtr);

The final line segfaults:
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff6e4e8d5 in _PyEval_EvalCodeWithName ()
   from /usr/lib/x86_64-linux-gnu/libpython3.8.so.1.0

My guess is the problem is in Py_BuildValue, which returns a pointer but it may 
not be constructed correctly.  I also tried it with "O" and it doesn't segfault 
but it returns 0x0. 

I'm new to using the C API.  Thanks for any help. 

Jen


-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to