Re: distributing the indexing process

2011-07-06 Thread Otis Gospodnetic
We've used Hadoop MapReduce with Solr to parallelize indexing for a customer and that brought down their multi-hour indexing process down to a couple of minutes.  There is/was also Lucene-level contrib in Hadoop that makes use of MapReduce to parallelize indexing. Otis Sematext :: http://

Re: Extracting span terms using WeightedSpanTermExtractor

2011-07-06 Thread Mark Miller
Sorry - kind of my fault. When I fixed this to use maxDocCharsToAnalyze, I didn't set a default other than 0 because I didn't really count on this being used beyond how it is in the Highlighter - which always sets maxDocCharsToAnalyze with it's default. You've got to explicitly set it higher t

Re: Extracting span terms using WeightedSpanTermExtractor

2011-07-06 Thread Michael Sokolov
I tried something similar, and failed - I think the API is lacking there? My only advice is to vote for this: https://issues.apache.org/jira/browse/LUCENE-2878 which should provide an alternative better API, but it's not near completion. -Mike On 7/6/2011 5:34 PM, Jahangir Anwari wrote: I h

Extracting span terms using WeightedSpanTermExtractor

2011-07-06 Thread Jahangir Anwari
I have a CustomHighlighter that extends the SolrHighlighter and overrides the doHighlighting() method. Then for each document I am trying to extract the span terms so that later I can use it to get the span Positions. I tried to get the weightedSpanTerms using WeightedSpanTermExtractor but was unsu

Re: Autocompletion on large index

2011-07-06 Thread Elmer
I just profiled the application and tst.TernaryTreeNode takes 99.99..% of the memory. I'll test further tomorrow and report on mem usage for runnable smaller indexes. I will email you privately for sharing the index to work with. BR, Elmer -Oorspronkelijk bericht- From: Michael McC

Re: Autocompletion on large index

2011-07-06 Thread Michael McCandless
Hmm... so I suspect the fst suggest module must first gather up all titles, then sort them, in RAM, and then build the actual FST. Maybe it's this gather + sort that's taking so much RAM? 1.3 M publications times 100 chars times 2 bytes/char = ~248 MB. So that shouldn't be it... Is this a an ac

Re: Autocompletion on large index

2011-07-06 Thread Elmer
You could try storing your autocomplete index in a RAMDirectory? I forgot to mention. I tried this previously, but that also resulted in heap space problems. That's why I was interested in using the new suggest classes :) BR, Elmer -Oorspronkelijk bericht- From: Michael McCandless

Re: Autocompletion on large index

2011-07-06 Thread Elmer
Hi Mike, That's what I thought when I started indexing it. To be clear, it happens on build time. I don't know if memory efficiency is better when building has finished. The titles I index are titles from the dblp computer sience bibliography. They can take up to... say 100 characters. Examp

Re: Autocompletion on large index

2011-07-06 Thread Michael McCandless
You could try storing your autocomplete index in a RAMDirectory? But: I'm surprised you see the FST suggest impl using up so much RAM; very low memory usage is one of the strengths of the FST approach. Can you share the text (titles) you are feeding to the suggest module? Mike McCandless http://

Autocompletion on large index

2011-07-06 Thread Elmer
Hi again. I have created my own autocompleter based on the spellchecker. This works well in a sense that it is able to create an auto completion index from my 'publication' index. However, integrated in my web application, each keypress asks autocompleter to search the index, which is stored on di

# as a special character?

2011-07-06 Thread Aradon Strider
Hello, First off I am using the QueryParser with the standardanalyzer. It seems that whenever I search for the # symbol, nothing is found. This wouldn't be a problem but the documents I am searching have C# used and needing to be searched for. I have tried escaping the # symbol but when I d

name matching / mapping

2011-07-06 Thread Thomas Rewig
Hello, until now, we use a stupid %like% SQL query script to assign the following terms for Id/Item mapping in different id-spaces: john wayne == john wayne wayne, john == john wayne I can imagine that Lucene offers much more possibilities for this assignment. Maybe with Lucene is also pos

[ANN] Luke 3.3.0 released.

2011-07-06 Thread Andrzej Bialecki
Hi all, Luke 3.3.0 has been released and is available for download here: http://code.google.com/p/luke/ Apart from the updated Lucene libraries there were no changes in functionality. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ ||

RE: how are built the packages in the maven repository?

2011-07-06 Thread Steven A Rowe
Ant is the official Lucene/Solr build system. Snapshot and release artifacts are produced with Ant. While Maven is capable of producing artifacts, the artifacts produced in this way may not be the same as the official Ant artifacts. For this reason: no, the artifacts should not be built with

Re: Index statistics

2011-07-06 Thread Andres Taylor
Thanks. It was what I expected, but it's nice to have it confirmed. On Tue, Jul 5, 2011 at 9:39 PM, Michael McCandless < luc...@mikemccandless.com> wrote: > This API doesn't exist today. > > Lucene has long needed for queries impls to do this, so that we can > properly plan/optimize how the query

how are built the packages in the maven repository?

2011-07-06 Thread jedim
Hi I'm looking inside the jenkins maven repository. For example the package in https://builds.apache.org/job/Lucene-Solr-Maven-trunk/lastSuccessfulBuild/artifact/maven_artifacts/org/apache/lucene/lucene-misc/4.0-SNAPSHOT/lucene-misc-4.0-20110705.223250-1.jar seems to be built with ant instea

Re: deleting 8,000,000 indexes takes forever!!!! any solution to this...

2011-07-06 Thread Toke Eskildsen
On Tue, 2011-07-05 at 17:50 +0200, Hiller, Dean x66079 wrote: > We are using a sort of nosql environment and deleting 200 gig on one machine > from the database is fast, but then we go and delete 5 gigs of indexes that > were created and it takes forever 8 million indexes is at a minimum 16