Hi, Thanks for the confirmation. IM50re.txt is a plain text corpus. Let us say that we want to count the words in this corpus. In the NLTK book, there is an example.
>>> import nltk >>> nltk.corpus.gutenberg.fileids() ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt'] These are the texts that come with NLTK. >>> emma = nltk.corpus.gutenberg.words('austen-emma.txt') >>> len(emma) 192427 So this is the number of words in a particular 'austen-emma.txt'. How would I do this with my IM50re.txt? It seems the code "nltk.corpus.gutenberg.words" is specific to some Gutenberg corpus installed with NLTK. Like this many examples are given for different analyses that can be done with NLTK. However they all seem to be specific to one of the texts above or another one already installed with NLTK. I am not sure how to apply these examples to my own corpus. Thank you. You are my own source of help right now; I have been trying to figure this out all day now. ________________________________ From: Kent Johnson <ken...@tds.net> To: Ishan Puri <ballerz4i...@sbcglobal.net> Cc: *tutor python <tutor@python.org> Sent: Friday, August 28, 2009 7:03:15 PM Subject: Re: [Tutor] NLTK On Fri, Aug 28, 2009 at 7:29 PM, Ishan Puri<ballerz4i...@sbcglobal.net> wrote: > Hi, >>>> from nltk.corpus import PlaintextCorpusReader >>>> corpus_root='C:\Users\Ishan\Documents' >>>> wordlists = PlaintextCorpusReader(corpus_root, 'IM50re.txt') >>>> wordlists.fileids() > ['IM50re.txt'] > > This is the result I get. That seems to be working then. You should be able to get a list of words with wordlists.words('IM50re.txt') > I was wondering how I can use the packages on > IM50re.txt? I followed successfully the steps detailed under Using Your Own > Corpus. What do I do next, say, if I wanted to use the lemmatizer on this > .txt document? I have no idea. Is IM50re.txt a plain text corpus? What is a package? What is a lemmatizer? I don't know anything about NLTK, I'm just good at reading manuals. You have to give me more help than that. What have you tried? Can you find an example that is similar to what you want to do? Don't assume I know what you are talking about :-) Kent
_______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor