On Mon, 14 Nov 2005, enas khalil wrote:
> hello all [program cut] Hi Enas, You may want to try talking with NTLK folks about this, as what you're dealing with is a specialized subject. Also, have you gone through the tokenization tutorial in: http://nltk.sourceforge.net/tutorial/tokenization/nochunks.html#AEN276 and have you tried to compare your program to the ones in the tutorial's examples? Let's look at the error message. > File "F:\MSC first Chapters\unigramgtag1.py", line 14, in -toplevel- > for tok in train_tokens: mytagger.train(tok) > File "C:\Python24\Lib\site-packages\nltk\tagger\__init__.py", line 324, in > train > assert chktype(1, tagged_token, Token) > File "C:\Python24\Lib\site-packages\nltk\chktype.py", line 316, in chktype > raise TypeError(errstr) > TypeError: > Argument 1 to train() must have type: Token > (got a str) This error message implies that each element in your train_tokens list is a string and not a token. The 'train_tokens' variable gets its values in the block of code: ########################################### train_tokens = [] xx=Token(TEXT=open('fataha2.txt').read()) WhitespaceTokenizer().tokenize(xx) for l in xx: train_tokens.append(l) ########################################### Ok. I see something suspicious here. The for loop: ###### for l in xx: train_tokens.append(l) ###### assumes that we get tokens from the 'xx' token. Is this true? Are you sure you don't have to specifically say: ###### for l in xx['SUBTOKENS']: ... ###### The example in the tutorial explicitely does something like this to iterate across the subtokens of a token. But what you're doing instead is to iterate across all the property names of a token, which is almost certainly not what you want. -- http://mail.python.org/mailman/listinfo/python-list