Index size for Same DataSet.

2014-03-24 Thread Jose Carlos Canova
Hello, I have a doubt about index size, I am testing a program using Lucene to index some dataset. At the final the result of index size is varying a little, since i haven't finished the tests at all, i'm doubt if it is normal the index size vary on size among different tests. att.

Re: Index size for Same DataSet.

2014-03-25 Thread Jose Carlos Canova
dataset. Please don't compare file MD5/SHA1, the files will *not* be > identical, because order of documents may still vary. > > - > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > > -Original

Re: Apache Lucene 4.x word counting

2014-03-28 Thread Jose Carlos Canova
There is a small problem in your problem formulation and Lucene, Lucene don't count words, you count terms based on an Analyzer that you have defined during a phase called IndexWriting, such analyzer will tokenize (which does not means use the white space between the words) a sequence of strings

Re: Lucene Suggest Phrase

2014-04-01 Thread Jose Carlos Canova
Hi, I haven't reach this point but it seems that Lucene has a "suggester" project that works over "Lucene's Index" it self which simplifies terms (for query suggestion) collecting. I saw something on GitHub to be used with javascript but i cant remember now the name of the project. att. On Tue

Re: Lucene Suggest Phrase

2014-04-01 Thread Jose Carlos Canova
I've "remember now" name is /liblevenshtein works with Node.js if i am not wrong. "Lucene suggest" works on same algorithm. Which in practice is enough for words with same "character sequence". On Tue, Apr 1, 2014 at 1:30 PM, Jose Carlos Canova < jose

Re: background merge hit exception

2014-04-05 Thread Jose Carlos Canova
Seems that you want to force a max number of segments to 1, On a previous thread someone answered that the number of segments will affect the Index Size, and is not related with Index Integrity (like size of index may vary according with number of segments). on version 4.6 there is a small issue o

Re: background merge hit exception

2014-04-08 Thread Jose Carlos Canova
throw new > OutOfMemoryError(e.toString()); > } > } > } > } > } > > } else { > FileInpu

Re: NRT facet issue (bug?), hard to reproduce, please advise

2014-04-12 Thread Jose Carlos Canova
One thing that maybe affect and usually i forget is that if your object has a unique identifier (client_no) such identifier must be present on the override of "equals" methods and be part of the generation of the hashCode, otherwise if you store this object in a collection and different routines ac

Re: Multiply instead of summing two scores

2014-04-12 Thread Jose Carlos Canova
Hum, You don't have a document weight you have a Document Score in relation of other documents on the index during a search event. On practice the document weight will be the sum of the weight of the terms in relation with an Index. You might find this presentation useful. http://www.cs.cmu.edu/

Re: make data search as index progress.

2014-04-14 Thread Jose Carlos Canova
Hello, That's because NRTCachingDirectory uses a in cache memory to "mimic in memory the Directory that you used to index your files ", in theory the commit is needed because you need to flush the documents recently added otherwise this document will not be available for search until the end of th

Re: make data search as index progress.

2014-04-15 Thread Jose Carlos Canova
e > what went wrong. > > /Jason > > > > > > > > On Mon, Apr 14, 2014 at 9:01 PM, Jose Carlos Canova < > jose.carlos.can...@gmail.com> wrote: > > > Hello, > > > > That's because NRTCachingDirectory uses a in cache memory to "

Re: is there a historical reason why default conjunction operator is "OR"?

2014-04-16 Thread Jose Carlos Canova
In fact you have both, the documents at see looking at first time is first the results with all words (AND) then the ORed results, which makes perfect sense. Google sometimes marks on the result which word was not found with a "strike through". But it is not so powerful as logical operators on qu

Re: is there a historical reason why default conjunction operator is "OR"?

2014-04-16 Thread Jose Carlos Canova
e of the terms to be ranked higher, so it merely LOOKS like the > terms were ANDed. This gives you the best of both worlds. > > Using explicit operators gives you "precision", which power users will > appreciate. Average users just get annoyed when the search engine is

Re: Getting IndexWriterConfig details for a closed index

2014-04-22 Thread Jose Carlos Canova
You can persist the IndexConfiguration somewhere using a Serializable object and persisting the configuration on a "File using an ObjectOutputStream", persist the configuration on a "persistent mechanism like a Database or on a fever of the moment a JSON storage" or like "Solr" using a Xml File. I

Re: Fields, Index segments and docIds (second Try)

2014-04-29 Thread Jose Carlos Canova
My suggestion is you not worry about the docId, in practice it is an "internal lucene" id, quite similar with a rowId on a database, each index may generate a different docId (it is their problem) from a translated document, you may use your own ID that relates one document to another on different

Re: How to locate a Phrase inside text (like a Browser text searcher)

2014-05-11 Thread Jose Carlos Canova
try to use the lucene wildcard. *John*Mail* The analyzer is just how you want the segment terms on your index. the query parser is how you tokenize the terms that that you want to query against the index (something like that). But lucene allows you use the wild card to handle with "other cases" th