PHP-Lucene Integration

Owen Densmore Tue, 22 Mar 2005 20:33:41 -0800

[Sorry if this is received twice .. I tried earlier but didn't see it in the list!]

A while back I asked folks how they deployed Lucene in a PHP environment. This summarizes how we proceeded with doing so.

The response to the initial question was quite helpful. Kelvin Tan mentioned "How about XML-RPC/SOAP, or REST?" while pedja did a great job of presenting the use of the "PHP-Java-Bridge". Maurits suggested a way to use a proxy approach. Great example of how useful this list is!

The solution we (http://redfish.com) chose was REST ..i.e. build a servlet which provides access to the index with a few bells and whistles unique to our application. This servlet then is accessed via PHP using the enhanced fopen(url,'r') which allows the filename to be a url. The PHP code then just reads in the result of searches line by line and makes them available as dynamic web pages. The php is used two ways: one as a fairly standard text search capability, and more creatively, as the feed to a Flash graphical interface which lets you "fly" through the collection. The servlet itself emits only plain text, no html. It likely should convert to XML.

The reason we chose this approach is because it fits into a broader desire of the client to form a general "institutional repository". Each group could have such a servlet exporting their data as a "web service" that others can listen in on. An example of a study this would enable is studying co-authorship (collaboration) in relation to the "events" group -- folks putting on workshops and conferences. It would be interesting to see whether or not the event attendees do eventually increase their collaboration with others due to the event. So we would link the events data with the working papers data to see whether or not there are increased collaborations. Loosely coupled, tightly aligned.

The collection we're providing access to is a very innovative scientific set .. 1200 working papers of the Santa Fe Institute. "Similarity" searching has proven very useful. A user looks at a document and can then ask for similar ones. Another extremely useful secondary search is for co-authors: search for all of the documents by a given author, collect all their collaborators, and provide that as a result.

These secondary searches are done with a general interface which uses two searches: a primary search which is then used as input to the second batch. So for co-author searches, we perform a primary search for an author. We collect all their documents, stripping out the authors for each document. This list of authors forms a secondary search which in effect returns all the documents with authors who have co-authored with the initial search.

This is extremely general and lets us perform a poor man's clustering. We find the documents most representative of a set of documents our client wants to use as a cluster. We use the similarity searching above, with the primary search being the documents representative of a cluster. The secondary search is much like in the Lucene book's example: give the author's of the retrieved documents a boost of 2, and then tack on a search of all the relative text terms.

We wanted to provide additional examples of clustering, so we got some earlier work done by the institute's library and information technology experts, and created a second set of similarity searches. These worked quite well, and the similarity technique helped bridge a two year gap caused by dropping the professional classification project. Indeed, it may breathe life back into that project due to our showing how useful it was.

For comedy relief we provided a third classification built upon astrological signs! We captured the 12 signs descriptions, and used them as our primary search. Then we used the documents recovered by these searches and used there terms to find similar documents. It was great fun and naturally enough helped make the technique understandable by the clients. We can't wait to find out which authors are "aries" and so on.

The improvement over the traditional searching used by the institute is quite dramatic. My partner and I find ourselves getting lost for tens of minutes tracking down papers we simply didn't know were there.

We are hoping the institute can afford to have us work on true clustering techniques such as Carrot2 uses. (Thanks to Dawid and all the Poznan University folks who's papers were so stimulating!) We did do a quick LSA SVD on a random set of the papers to see what the performance (both CPU and good clustering) would be like. Our results are encouraging, and I think the frequent phrases approach would be best for this collection. This collection is quite a clustering challenge due to its extreme cross-discipline nature.

BTW: My partner uncovered an interesting solution which allows us to mix the "keyword" and "text" world nicely. The papers use key-phrases which are entirely author derived. One my use "evolution" and another "human evolution". We liked the looseness of letting them be text. But we also need to search for exact phrases as used by the authors. A simple solution was to create a set of relations in the RDB sense during the indexing phase. Then evolution might have a keyphrase index of 22, say. We can then use that for unambiguous keyphrase searching when we want. (Note that phrase quoting does not work for evolution: searching for "evolution" will still hit on "human evolution". The relations remove that problem.)

I just want to take this as an opportunity to thank everyone for all the help. Thanks!

Owen


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

PHP-Lucene Integration

Reply via email to