Hi Dermont/Ray,
Please check out the MRS system (mrs.cmbi.ru.nl). It has a SOAP
interface to perl and other languages, and is extremely fast in indexing
and retrieval. MRS is a generic tool and you can index yourself, but
also dowload indexed bio-databanks. The source code is in C++ and is
available as well.
Teaching material is available but tailored towards biologists.
Best,
Marc
Raymond Wan schreef:
Hi Dermot,
Off-topic, so I hope no one minds if I reply.
Perl is good at manipulating text strings, but that doesn't usually
help search engine implementations. A search engine (or information
retrieval system) has to be fast and after it has tokenized the
document collection or query, you're basically comparing integers
(i.e., a lookup table that maps an integer to a word in a
dictionary). Actually, even during the initial mapping, a C-style
strcmp would be sufficient. I doubt a fast search engine would actual
perform string matching using regular expressions.
Of course, a Perl implementation might be interesting as a learning
tool for students. But as an IR system that is suppose to be run in
the "real world" and not in the class room...I don't think you will
see a Perl system anytime soon. I think if you wrote quick Perl and
C/C++ implementations that merely tokenize a collection (let's say of
the range in GBs), you'll know what I am talking about. Of course, in
the classroom, a lecturer might just want the students to play with
something that is a MB or less...if so, I think Perl would be good and
students might even prefer it... :-)
Ray
Dermot wrote:
I am looking for a text search engine that has a Perl interface. I
have found a few, Lucene, OpenFTS and Swish-E. OpenFTS hasn't had a
release of the last 3 years. That makes me nervous about using it.
Lucene is java based. I have zero java experience but there is Perl
Module into a 'C++ port API of Lucene'. There is also a thread on
perlmonks about the performance penalty of tying Perl to Java. I am a
bit surprised that the there isn't a more native Perl text search
engine given Perl's agility with text strings.
Could anyone recommend any of the above or suggest an alternative?
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/