RE: lucene in combination with pattern recognition...

bruce Thu, 22 Jun 2006 15:02:33 -0700

hi simon....

like a hole in my head!!!!

what i really need is a way to recursively iterate through a site, and to to
be able to selectively iterate through the 'form' elements on a given page.
ie, if i visually analyze a site and determine that the 1st level (page) has
a form, and i need to set the 1st two elements of the form, i'd like to be
able to somehow perform this function when i crawl through the site as
opposed to every potential combination of the form, or as opposed to only
using the default for the form...

i'd also like to be able to handle forms at lower levels in a similar
manner. this would/should allow the app to return all the required pages.

i could then create the parser(s) to extract information from each page.
although it would be better to actually have the ability to somehow manage
the data extraction from the DOM within the crawling application...

-bruce

-----Original Message-----
From: Simon Courtenage [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 22, 2006 2:08 PM
To: java-user@lucene.apache.org
Subject: Re: lucene in combination with pattern recognition...

You might also check out an old paper by Kruger, Giles, Lawrence et al. on
a search engine called Deadliner (see here at
http://clgiles.ist.psu.edu/papers/CIKM-2000-deadliner.pdf).
Deadliner crawled for Calls for Papers for conferences, using Support
Vector Machines trained to
recognise relevant pages, and then applying sets of regular expressions
to extract information
from the CFP pages.  Lawrence is now with Google, I believe.

Hope this helps,

Simon

Bob Carpenter wrote:
> Check out Andrew McCallum's paper:
>
> http://www.cs.umass.edu/~mccallum/papers/acm-queue-ie.pdf
>
> It mentions this very problem.  There are
> also some more technical presentations around.
>
> He was part of the Whiz-Bang team that took
> on the problem.  The fact that the company's
> out of business is a testament to how hard
> this problem is in general.
>
> - Bob Carpenter
>   Alias-i
>
>>
>> i'm looking at a problem and i can't figure out how to "easily" solve
>> it...
>>
>> basically, i'm trying to figure out if there's a way to use lucene/nutch
>> with some form of pattern matching to extract course information from a
>> College/Registrar's course section...
>>
>> Assume I can point to a Regiatrar's section of a College site.
>> Assume I can then crawl through the section, and capture
>>  all the underlying information, including the Course
>>  information...
>> Is there a way to somehow use pattern matching/recognition
>>  to somehow interpret the DOM to pull out the class schedule
>>  information. I'm pretty sure there's no vanilla approach,
>>  so I'd even consider some kind of solution where I might
>>  have to intially evaluate/analyze the site, to tell it
>>  what DOM elements are "important"...
>>
>> anyone done any work/projects like this...
>> any research/papers/sample apps i could look at...
>> any thoughts/comments/etc....
>>
>> i could brute force this by writing a bunch of perl
>> scripts, with each script tied to a given registrar site,
>> but i'd like a more generalizable solution if one exists..
>>
>> thanks
>>
>> -bruce
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

--
Dr. Simon Courtenage
Software Systems Engineering Research Group
Dept. of Software Engineering, Cavendish School of Computer Science
University of Westminster, London, UK
Email: [EMAIL PROTECTED]   Web: http://users.cscs.wmin.ac.uk/~courtes |
http://www.sse.wmin.ac.uk

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: lucene in combination with pattern recognition...

Reply via email to