Hi Thomas, (and Mike, for questions)

On Thu, 27 Oct 2011, Thomas Koch wrote:

while I was playing with the SynonymAnalyzer stuff (pylucene-3.4 samples) I
discovered that the wordnet example is broken due to an outdated wordnet
database: The SynonymAnalyzerTest works fine, but the SynonymAnalyzerViewer
fails with:
...lucene.JavaError: org.apache.lucene.index.IndexFormatTooOldException:
Format version is not supported in file 'segments': 44132 (needs to be
between -1 and -11). This version of Lucene only supports indexes created
with release 3.0 and later.

The WordNetSynonymEngine uses an index contained in the indexes.tgz file
which is looked up in indexes\wordnet - this file (dated 2004) seems to be
an old lucene index format. I managed to find the files required to build
the index for lucene-3.4, adjusted the WordNetSynonymEngine to work with
lucene 3.4 and all seems to be working again. I've created an archive with
the relevant changes and uploaded it to the pylucene-extras project - just
in case anyone is interested:
http://code.google.com/a/apache-extras.org/p/pylucene-extra/downloads/list

BTW, who is maintaining/updating the samples that are included in the
distribution?

Most PyLucene samples are ports of the first edition Lucene in Action book to PyLucene. I ported them all and I'm the maintainer. If you find bugs, patches are of course welcome.

It looks like the wordnet index file is coming directly from the Lucene in Action downloadable sample. Given that this is now seven years old, something like that was bound to happen.

It looks like the second edition of the book has samples that were written for Lucene 3.0.2:
  http://www.manning.com/hatcher3/
  http://www.manning.com/hatcher3/LIAsourcecode.zip

So, I downloaded the new version of the samples, hoping to find a new version of the wordnet index. But first, following instructions in README, running 'ant test' in the lia2e directory fails with:
    Testcase: testWriteLock(lia.indexing.LockTest):     Caused an ERROR
    [junit] Unknown format version: -11
during the CreateTestIndex step. Mike, what could that be ? (running Lucene 3.4.0)

Ignoring this failure, 'ant SynonymAnalyzerViewer' runs fine. The new version doesn't seem to be using the wordnet index anymore. Yet the code that would be is commented out, so I'm wondering what the intent was.

But since you did the work, Thomas, I followed your instructions and rebuilt the wordnet index used by this sample in the earlier version and refreshed the indexes.tar.gz archive with the new wordnet one built from Lucene 3.4.0. The other two indexes in there, t9 and distributed, most likely suffer from the same problem but I didn't check.

SynonymAnalyzerViewer.py is now run as part of the 'make test' suite.

This is checked into rev 1189735 of branch_3x.

Many thanks !

Andi..

It should be noted that the SynonymAnalyzer examples are based on the lia
book and implement their own Synonym support while there is currently
already support for SynonymAnalyzer in java-lucene-3.4:  package
org.apache.lucene.analysis.synonym;  (in contrib)

see CHANGELOG
LUCENE-3233, LUCENE-3375: Added SynonymFilter for applying multi-word
synonyms during indexing or querying (with parsers for wordnet and solr
formats). Removed contrib/wordnet.

It's already included in the PyLucene core: lucene.SynonymFilter - however I
couldn't find any samples / tests for this new feature - will have to play
with this one as well... Let me know if anyone has made experience with the
new lucene.SynonymFilter and possible advantages over the Python-based
implementation (in
pylucene-3.4\samples\LuceneInAction\lia\analysis\synonym).


regards
Thomas
--
OrbiTeam Software GmbH & Co. KG
Endenicher Allee 35
53121 Bonn - Germany
http://www.orbiteam.de



Reply via email to