Finding keywords

2011-03-07 Thread Cross

Hello

I have got a project in which I have to extract keywords given a URL. I would 
like to know methods for extraction of keywords. Frequency of occurence is one; 
but it seems naive. I would prefer something more robust. Please suggest.


Regards
Cross

--- news://freenews.netfront.net/ - complaints: n...@netfront.net ---
--
http://mail.python.org/mailman/listinfo/python-list


Re: Finding keywords

2011-03-08 Thread Cross

On 03/08/2011 01:27 PM, Chris Rebert wrote:


Complaint: This question is not Python-specific in any way.

Regards,
Chris


Well Chris, my implementation is in Python. :) That is as much python-specific 
as it gets.


Well the question is general of course and I want to discuss the problem here.

--- news://freenews.netfront.net/ - complaints: n...@netfront.net ---
--
http://mail.python.org/mailman/listinfo/python-list


Re: Finding keywords

2011-03-08 Thread Cross

On 03/08/2011 06:09 PM, Heather Brown wrote:


The keywords are an attribute in a tag called , in the section called
. Are you having trouble parsing the xhtml to that point?

Be more specific in your question, and somebody is likely to chime in. Although
I'm not the one, if it's a question of parsing the xhtml.

DaveA
I know meta tags contain keywords but they are not always reliable. I can parse 
xhtml to obtain keywords from meta tags; but how do I verify them. To obtain 
reliable keywords, I have to parse the plain text obtained from the URL.


Cross

--- news://freenews.netfront.net/ - complaints: n...@netfront.net ---
--
http://mail.python.org/mailman/listinfo/python-list


Re: Finding keywords

2011-03-09 Thread Cross

On 03/09/2011 01:21 AM, Vlastimil Brom wrote:

2011/3/8 Cross:

On 03/08/2011 06:09 PM, Heather Brown wrote:


The keywords are an attribute in a tag called, in the section
called
. Are you having trouble parsing the xhtml to that point?

Be more specific in your question, and somebody is likely to chime in.
Although
I'm not the one, if it's a question of parsing the xhtml.

DaveA


I know meta tags contain keywords but they are not always reliable. I can
parse xhtml to obtain keywords from meta tags; but how do I verify them. To
obtain reliable keywords, I have to parse the plain text obtained from the
URL.

Cross

--- news://freenews.netfront.net/ - complaints: n...@netfront.net ---
--
http://mail.python.org/mailman/listinfo/python-list



Hi,
if you need to extract meaningful keywords in terms of data mining
using natural language processing, it might become quite a complex
task, depending on the requirements; the NLTK toolkit may help with
some approaches [ http://www.nltk.org/ ].
One possibility would be to filter out more frequent and less
meaningful words ("stopwords") and extract the more frequent words
from the reminder., e.g. (with some simplifications/hacks in the
interactive mode):


import re, urllib2, nltk
page_src = 
urllib2.urlopen("http://www.python.org/doc/essays/foreword/";).read().decode("utf-8")
page_plain = nltk.clean_html(page_src).lower()
txt_filtered = nltk.Text((word for word in re.findall(r"(?u)\w+", page_plain) if word not 
in set(nltk.corpus.stopwords.words("english"
frequency_dist = nltk.FreqDist(txt_filtered)
[(word, freq) for (word, freq) in frequency_dist.items() if freq>  2]

[(u'python', 39), (u'abc', 11), (u'code', 10), (u'c', 7),
(u'language', 7), (u'programming', 7), (u'unix', 7), (u'foreword', 5),
(u'new', 5), (u'would', 5), (u'1st', 4), (u'book', 4), (u'ed', 4),
(u'features', 4), (u'many', 4), (u'one', 4), (u'programmer', 4),
(u'time', 4), (u'use', 4), (u'community', 3), (u'documentation', 3),
(u'early', 3), (u'enough', 3), (u'even', 3), (u'first', 3), (u'help',
3), (u'indentation', 3), (u'instance', 3), (u'less', 3), (u'like', 3),
(u'makes', 3), (u'personal', 3), (u'programmers', 3), (u'readability',
3), (u'readable', 3), (u'write', 3)]




Another possibility would be to extract parts of speech (e.g. nouns,
adjective, verbs) using e.g. nltk.pos_tag(input_txt) etc.;
for more convoluted html code e.g. BeautifulSoup might be used and
there are likely many other options.

hth,
   vbr
I had considered nltk. That is why I said that straightforward frequency 
calculation of words would be naive. I have to look into this BeautifulSoup thing.


--- news://freenews.netfront.net/ - complaints: n...@netfront.net ---
--
http://mail.python.org/mailman/listinfo/python-list


Re: [Python-Dev] compiling python2.5 on linux under wine

2009-01-08 Thread Simon Cross
On Sat, Jan 3, 2009 at 11:22 PM, Luke Kenneth Casson Leighton
 wrote:
> secondly, i want a python25.lib which i can use to cross-compile
> modules for poor windows users _despite_ sticking to my principles and
> keeping my integrity as a free software developer.

If this eventually leads to being able to compile Python software for
Windows under Wine (using for example, py2exe) it would make my life a
lot easier.

Schiavo
Simon
--
http://mail.python.org/mailman/listinfo/python-list


Re: [ctpug] Introducing Kids to Programming: 2 or 3?

2010-09-27 Thread Simon Cross
On Mon, Sep 27, 2010 at 5:48 PM, Marco Gallotta  wrote:
> We received a grant from Google to reach 1,000 kids in South Africa
> with our course in 2011. People have also shown interest in running
> the course in Croatia, Poland and Egypt. We're also eyeing developing
> African countries in the long-term. As such, we're taking the time now
> to write our very own specialised course notes and exercises, and we
> this is why we need to decide *now* which path to take: 2 or 3? As we
> will be translating the notes we'll probably stick with out choice for
> the next few years.

If you were going to start running the course tomorrow I'd suggest
sticking with Python 2. Python 3 ports are rapidly becoming available
but few have had the bugs shaken out of them yet. In three or four
months I expect that the important bugs will have been dealt with.
Given that 2.x will not receive any new features, I think it is
effectively dead.

I would explicitly mention the existence of 2.7 and 3.2 [1] to
students (perhaps near the end of the first day or whenever they're
about to go off and download Python for themselves).

One caveat is that web applications may only start to migrate to 3.x
late next year. There are a number of reasons for this. First it's not
yet clear what form the WSGI standard will take under Python 3 (and if
3.2 is released before this decision is made it will effectively have
to wait for 3.3 to be included).  Secondly the software stack involved
is quite deep in some places. For example, database support might
require porting MySQLdb, then SQLAlchemy, then the web framework and
only after that the web application itself.

[1] Which should hopefully make it out before 2011. :)

Schiavo
Simon
-- 
http://mail.python.org/mailman/listinfo/python-list