Please add me as a wiki editor

2013-06-09 Thread Lance Norskog
I'm responsible for the OpenNLP wiki page: https://wiki.apache.org/solr/OpenNLP Please add me to the list of editors.

RE: Build your own Lucene finite state transducer

2013-06-09 Thread Doug Turnbull
Awesome work Mike! Kudos! Sent from my Windows Phone From: Michael McCandless Sent: 6/9/2013 11:09 AM To: Lucene Users Subject: Build your own Lucene finite state transducer For those of you curious about Lucene's finite state transducers (FSTs)... I just built simple web app that lets you enter

Build your own Lucene finite state transducer

2013-06-09 Thread Michael McCandless
For those of you curious about Lucene's finite state transducers (FSTs)... I just built simple web app that lets you enter input/output pairs and see the resulting FST: It's running here: http://examples.mikemccandless.com/fst.py And here's a quick blog post showing some examples/details:

Re: setSegmentsPerTier >= setMaxMergeAtOnce ?

2013-06-09 Thread Michael McCandless
Hi Boaz, That's correct! But what is "too big" of a merge is an app-level decision / requires testing in the "real" context / depends on things like how much free RAM the OS can dedicate to bytes read-ahead, whether you have an SSD, whether you throttle merge rate (RateLimitedDirWrapper), etc.

Re: setSegmentsPerTier >= setMaxMergeAtOnce ?

2013-06-09 Thread Boaz Leskes
Hi Mike, Thanks for the quick answer. So if I understand correctly, collapsing tiers in one go leads to too many big merges. The goal is then to avoid too big merges which will happen if we allow complete tiers to be collapsed in one merge. We rather have a tier collapsed partially (and thus more

Re: setSegmentsPerTier >= setMaxMergeAtOnce ?

2013-06-09 Thread Michael McCandless
The two settings let you decouple your tolerance for how many segments are allowed to accumulate (setSegmentsPerTier), from how large a single merge can be (setMaxMergeAtOnce). E.g. say setSegmentsPerTier is 20 and setMaxMergeAtOnce is 10. The 20 gives TMP a "generous" budget to allow up to 20 se

How to get the most frequent words for a set of documents in Lucene?

2013-06-09 Thread Gucko Gucko
Hello all, I'm trying to cluster documents that were indexed using Lucene 4.3. The results of the clustering algorithm is a set of clusters where each cluster contains the most similar documents (I only store their docIDs in each cluster). What I want is to get the most frequent words for each clu

setSegmentsPerTier >= setMaxMergeAtOnce ?

2013-06-09 Thread Boaz Leskes
Hi All, I recently looked at the settings for the TieredMergedPolicy [1] and was puzzled by the note on the setSegmentsPerTier method indicating it should be equal or larger to the MaxMergeAtOnce settings, in order to not cause too many merges. I understood segments per tier to indicate the goal