Hello,
I have a doubt about index size,
I am testing a program using Lucene to index some dataset.
At the final the result of index size is varying a little, since i haven't
finished the tests at all, i'm doubt if it is normal the index size vary on
size among different tests.
att.
dataset. Please don't compare file MD5/SHA1, the files will *not* be
> identical, because order of documents may still vary.
>
> -
> Uwe Schindler
> H.-H.-Meier-Allee 63, D-28213 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
>
> > -Original
There is a small problem in your problem formulation and Lucene, Lucene
don't count words, you count terms based on an Analyzer that you have
defined during a phase called IndexWriting, such analyzer will tokenize
(which does not means use the white space between the words) a sequence of
strings
Hi,
I haven't reach this point but it seems that Lucene has a "suggester"
project that works over "Lucene's Index" it self which simplifies terms
(for query suggestion) collecting. I saw something on GitHub to be used
with javascript but i cant remember now the name of the project.
att.
On Tue
I've "remember now" name is /liblevenshtein works with Node.js if i am not
wrong.
"Lucene suggest" works on same algorithm. Which in practice is enough for
words with same "character sequence".
On Tue, Apr 1, 2014 at 1:30 PM, Jose Carlos Canova <
jose
Seems that you want to force a max number of segments to 1,
On a previous thread someone answered that the number of segments will
affect the Index Size, and is not related with Index Integrity (like size
of index may vary according with number of segments).
on version 4.6 there is a small issue o
throw new
> OutOfMemoryError(e.toString());
> }
> }
> }
> }
> }
>
> } else {
> FileInpu
One thing that maybe affect and usually i forget is that if your object has
a unique identifier (client_no) such identifier must be present on the
override of "equals" methods and be part of the generation of the hashCode,
otherwise if you store this object in a collection and different routines
ac
Hum,
You don't have a document weight you have a Document Score in relation of
other documents on the index during a search event. On practice the
document weight will be the sum of the weight of the terms in relation with
an Index.
You might find this presentation useful.
http://www.cs.cmu.edu/
Hello,
That's because NRTCachingDirectory uses a in cache memory to "mimic in
memory the Directory that you used to index your files ", in theory the
commit is needed because you need to flush the documents recently added
otherwise this document will not be available for search until the end of
th
e
> what went wrong.
>
> /Jason
>
>
>
>
>
>
>
> On Mon, Apr 14, 2014 at 9:01 PM, Jose Carlos Canova <
> jose.carlos.can...@gmail.com> wrote:
>
> > Hello,
> >
> > That's because NRTCachingDirectory uses a in cache memory to "
In fact you have both, the documents at see looking at first time is first
the results with all words (AND) then the ORed results, which makes perfect
sense. Google sometimes marks on the result which word was not found with
a "strike through".
But it is not so powerful as logical operators on qu
e of the terms to be ranked higher, so it merely LOOKS like the
> terms were ANDed. This gives you the best of both worlds.
>
> Using explicit operators gives you "precision", which power users will
> appreciate. Average users just get annoyed when the search engine is
You can persist the IndexConfiguration somewhere using a Serializable
object and persisting the configuration on a "File using an
ObjectOutputStream", persist the configuration on a "persistent mechanism
like a Database or on a fever of the moment a JSON storage" or like "Solr"
using a Xml File.
I
My suggestion is you not worry about the docId, in practice it is an
"internal lucene" id, quite similar with a rowId on a database, each index
may generate a different docId (it is their problem) from a translated
document, you may use your own ID that relates one document to another on
different
try to use the lucene wildcard. *John*Mail*
The analyzer is just how you want the segment terms on your index. the
query parser is how you tokenize the terms that that you want to query
against the index (something like that). But lucene allows you use the wild
card to handle with "other cases" th
16 matches
Mail list logo