Re: [Tutor] NLTK

Ishan Puri Fri, 28 Aug 2009 19:24:21 -0700

Hi,
    Thanks for the confirmation. IM50re.txt is a plain text corpus. Let us say 
that we want to count the words in this corpus. In the NLTK book, there is an 
example.

>>> import nltk
>>> nltk.corpus.gutenberg.fileids()
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 
'bible-kjv.txt',
'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt',
'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt',
'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt',
'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt',
'shakespeare-macbeth.txt', 'whitman-leaves.txt']

These are the texts that come with NLTK.

>>> emma = nltk.corpus.gutenberg.words('austen-emma.txt')
>>> len(emma)
192427

So this is the number of words in a particular 'austen-emma.txt'. How would I 
do this 
with my IM50re.txt? It seems the code "nltk.corpus.gutenberg.words" is specific 
to some Gutenberg corpus installed with NLTK. 
Like this many examples are given for different analyses that can be done with 
NLTK. However they all seem to be specific
to one of the texts above or another one already installed with NLTK. I am not 
sure how to apply these examples to my own corpus.

        Thank you. You are my own source of help right now; I have been  trying 
to figure this out all day now.

________________________________
From: Kent Johnson <ken...@tds.net>
To: Ishan Puri <ballerz4i...@sbcglobal.net>
Cc: *tutor python <tutor@python.org>
Sent: Friday, August 28, 2009 7:03:15 PM
Subject: Re: [Tutor] NLTK

On Fri, Aug 28, 2009 at 7:29 PM, Ishan Puri<ballerz4i...@sbcglobal.net> wrote:
> Hi,
>>>> from nltk.corpus import PlaintextCorpusReader
>>>> corpus_root='C:\Users\Ishan\Documents'
>>>> wordlists = PlaintextCorpusReader(corpus_root, 'IM50re.txt')
>>>> wordlists.fileids()
> ['IM50re.txt']
>
> This is the result I get.

That seems to be working then. You should be able to get a list of words with
wordlists.words('IM50re.txt')

> I was wondering how I can use the packages on
> IM50re.txt? I followed successfully the steps detailed under Using Your Own
> Corpus. What do I do next, say, if I wanted to use the lemmatizer on this
> .txt document?

I have no idea. Is IM50re.txt a plain text corpus? What is a package?
What is a lemmatizer?

I don't know anything about NLTK, I'm just good at reading manuals.
You have to give me more help than that. What have you tried? Can you
find an example that is similar to what you want to do? Don't assume I
know what you are talking about :-)

Kent

_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] NLTK

Reply via email to