Finding keywords
Hello I have got a project in which I have to extract keywords given a URL. I would like to know methods for extraction of keywords. Frequency of occurence is one; but it seems naive. I would prefer something more robust. Please suggest. Regards Cross --- news://freenews.netfront.net/ - complaints: n...@netfront.net --- -- http://mail.python.org/mailman/listinfo/python-list
Re: Finding keywords
On 03/08/2011 01:27 PM, Chris Rebert wrote: Complaint: This question is not Python-specific in any way. Regards, Chris Well Chris, my implementation is in Python. :) That is as much python-specific as it gets. Well the question is general of course and I want to discuss the problem here. --- news://freenews.netfront.net/ - complaints: n...@netfront.net --- -- http://mail.python.org/mailman/listinfo/python-list
Re: Finding keywords
On 03/08/2011 06:09 PM, Heather Brown wrote: The keywords are an attribute in a tag called , in the section called . Are you having trouble parsing the xhtml to that point? Be more specific in your question, and somebody is likely to chime in. Although I'm not the one, if it's a question of parsing the xhtml. DaveA I know meta tags contain keywords but they are not always reliable. I can parse xhtml to obtain keywords from meta tags; but how do I verify them. To obtain reliable keywords, I have to parse the plain text obtained from the URL. Cross --- news://freenews.netfront.net/ - complaints: n...@netfront.net --- -- http://mail.python.org/mailman/listinfo/python-list
Re: Finding keywords
On 03/09/2011 01:21 AM, Vlastimil Brom wrote: 2011/3/8 Cross: On 03/08/2011 06:09 PM, Heather Brown wrote: The keywords are an attribute in a tag called, in the section called . Are you having trouble parsing the xhtml to that point? Be more specific in your question, and somebody is likely to chime in. Although I'm not the one, if it's a question of parsing the xhtml. DaveA I know meta tags contain keywords but they are not always reliable. I can parse xhtml to obtain keywords from meta tags; but how do I verify them. To obtain reliable keywords, I have to parse the plain text obtained from the URL. Cross --- news://freenews.netfront.net/ - complaints: n...@netfront.net --- -- http://mail.python.org/mailman/listinfo/python-list Hi, if you need to extract meaningful keywords in terms of data mining using natural language processing, it might become quite a complex task, depending on the requirements; the NLTK toolkit may help with some approaches [ http://www.nltk.org/ ]. One possibility would be to filter out more frequent and less meaningful words ("stopwords") and extract the more frequent words from the reminder., e.g. (with some simplifications/hacks in the interactive mode): import re, urllib2, nltk page_src = urllib2.urlopen("http://www.python.org/doc/essays/foreword/";).read().decode("utf-8") page_plain = nltk.clean_html(page_src).lower() txt_filtered = nltk.Text((word for word in re.findall(r"(?u)\w+", page_plain) if word not in set(nltk.corpus.stopwords.words("english" frequency_dist = nltk.FreqDist(txt_filtered) [(word, freq) for (word, freq) in frequency_dist.items() if freq> 2] [(u'python', 39), (u'abc', 11), (u'code', 10), (u'c', 7), (u'language', 7), (u'programming', 7), (u'unix', 7), (u'foreword', 5), (u'new', 5), (u'would', 5), (u'1st', 4), (u'book', 4), (u'ed', 4), (u'features', 4), (u'many', 4), (u'one', 4), (u'programmer', 4), (u'time', 4), (u'use', 4), (u'community', 3), (u'documentation', 3), (u'early', 3), (u'enough', 3), (u'even', 3), (u'first', 3), (u'help', 3), (u'indentation', 3), (u'instance', 3), (u'less', 3), (u'like', 3), (u'makes', 3), (u'personal', 3), (u'programmers', 3), (u'readability', 3), (u'readable', 3), (u'write', 3)] Another possibility would be to extract parts of speech (e.g. nouns, adjective, verbs) using e.g. nltk.pos_tag(input_txt) etc.; for more convoluted html code e.g. BeautifulSoup might be used and there are likely many other options. hth, vbr I had considered nltk. That is why I said that straightforward frequency calculation of words would be naive. I have to look into this BeautifulSoup thing. --- news://freenews.netfront.net/ - complaints: n...@netfront.net --- -- http://mail.python.org/mailman/listinfo/python-list
Re: [Python-Dev] compiling python2.5 on linux under wine
On Sat, Jan 3, 2009 at 11:22 PM, Luke Kenneth Casson Leighton wrote: > secondly, i want a python25.lib which i can use to cross-compile > modules for poor windows users _despite_ sticking to my principles and > keeping my integrity as a free software developer. If this eventually leads to being able to compile Python software for Windows under Wine (using for example, py2exe) it would make my life a lot easier. Schiavo Simon -- http://mail.python.org/mailman/listinfo/python-list
Re: [ctpug] Introducing Kids to Programming: 2 or 3?
On Mon, Sep 27, 2010 at 5:48 PM, Marco Gallotta wrote: > We received a grant from Google to reach 1,000 kids in South Africa > with our course in 2011. People have also shown interest in running > the course in Croatia, Poland and Egypt. We're also eyeing developing > African countries in the long-term. As such, we're taking the time now > to write our very own specialised course notes and exercises, and we > this is why we need to decide *now* which path to take: 2 or 3? As we > will be translating the notes we'll probably stick with out choice for > the next few years. If you were going to start running the course tomorrow I'd suggest sticking with Python 2. Python 3 ports are rapidly becoming available but few have had the bugs shaken out of them yet. In three or four months I expect that the important bugs will have been dealt with. Given that 2.x will not receive any new features, I think it is effectively dead. I would explicitly mention the existence of 2.7 and 3.2 [1] to students (perhaps near the end of the first day or whenever they're about to go off and download Python for themselves). One caveat is that web applications may only start to migrate to 3.x late next year. There are a number of reasons for this. First it's not yet clear what form the WSGI standard will take under Python 3 (and if 3.2 is released before this decision is made it will effectively have to wait for 3.3 to be included). Secondly the software stack involved is quite deep in some places. For example, database support might require porting MySQLdb, then SQLAlchemy, then the web framework and only after that the web application itself. [1] Which should hopefully make it out before 2011. :) Schiavo Simon -- http://mail.python.org/mailman/listinfo/python-list