ICUTokenizer and CJK

2010-11-22 Thread Burton-West, Tom
Hi all, I see in the javadoc for the ICUTokenizer that it has special handling for Lao,Myanmar, Khmer word breaking but no details in the javadoc about what it does with CJK, which for C and J appears to be breaking into unigrams. Is this correct? Tom

Re: best practice: 1.4 billions documents

2010-11-22 Thread Luca Rondanini
eheheheh, 1.4 billion of documents = 1,400,000,000 documents for almost 2T = 2 therabites = 2000 gigas on HD! On Mon, Nov 22, 2010 at 10:16 AM, wrote: > > of course I will distribute my index over many machines: > > store everything on > > one computer is just crazy, 1.4B docs is going to b

RE: best practice: 1.4 billions documents

2010-11-22 Thread spring
> of course I will distribute my index over many machines: > store everything on > one computer is just crazy, 1.4B docs is going to be an index > of almost 2T > (in my case) billion = giga in english billion = tera in non-english 2T docs = 2.000.000.000.000 docs... ;) AFAIK 2 ^ 32 - 1 docs is

Re: best practice: 1.4 billions documents

2010-11-22 Thread Luca Rondanini
Thank you all, I really got some good hints! of course I will distribute my index over many machines: store everything on one computer is just crazy, 1.4B docs is going to be an index of almost 2T (in my case) the best solution for me at the moment (from your suggestions) seems to identify a crit

RE: best practice: 1.4 billions documents

2010-11-22 Thread Uwe Schindler
Hi Yonik, Can we do the same for Lucene, the problem is combining the rewritten queries using the broken method in Query? As far as I know, the problem is that e.g. MTQs rewrite *per searcher* so each searcher uses a different rewritten query (with different terms). So the scores are totally diff

Re: best practice: 1.4 billions documents

2010-11-22 Thread Yonik Seeley
On Mon, Nov 22, 2010 at 12:17 PM, Uwe Schindler wrote: > The latest discussion was more about MultiReader vs. MultiSearcher. > > But you are right, 1.4 B documents is not easy to go, especially when you > index grows and you get to the 2.1 B marker, then no MultiSearcher or > whatever helps. > > O

RE: best practice: 1.4 billions documents

2010-11-22 Thread Uwe Schindler
The latest discussion was more about MultiReader vs. MultiSearcher. But you are right, 1.4 B documents is not easy to go, especially when you index grows and you get to the 2.1 B marker, then no MultiSearcher or whatever helps. On the other hand even distributed Solr has the same problems like Mu

Re: best practice: 1.4 billions documents

2010-11-22 Thread eks dev
Am I the only one who thinks this is not the way to go, MultiReader (or MulitiSearcher) is not going to fix your problems. Having 1.4B Documents on one machine is a big number, does not matter how you partition them (or you have some really expensive hardware at your disposal). Did I miss the poin

RE: best practice: 1.4 billions documents

2010-11-22 Thread Uwe Schindler
A local multithreaded search can be done in another way even for a single index, but not using the impl of (Parallel)MultiSearcher. This may be a new class directly extending IndexSearcher, which may even do parallel search on e.g. different segments (because searching a MultiReader is no longer

RE: best practice: 1.4 billions documents

2010-11-22 Thread David Fertig
> it has only problems. Perhaps these known problems should be added to the doc api, so users who are encouraged to start clean with the 3.x API don't build bad applications from scratch? Parallel searching is extremely powerful and should not be abandoned. -Original Message- From: Uw

incremental indexation

2010-11-22 Thread ZYWALEWSKI, DANIEL (DANIEL)
Hello, I'm just stuck with one problem and don't know how to figure it out. I'm working on the indexation of the objects that are in computer memory (they exist only in my java code). Don't have any problems with indexing it, however I have no idea how to re-index it if they change during the e

RE: best practice: 1.4 billions documents

2010-11-22 Thread Uwe Schindler
There is no reason to use MultiSearcher instead the much more consistent and effective MultiReader! We (Robert and me) are already planning to deprecate it. MultiSearcher itsself has no benefit over a simple IndexSearcher on top of a MultiReader since Lucene 2.9, it has only problems. Use case

RE: best practice: 1.4 billions documents

2010-11-22 Thread David Fertig
>> We have a couple billion docs in our archives as well...Breaking them up by >> day worked well for us We do not have 2 billion segments in one index We have roughly 5-10 million documents per index. We are currently using a miltisearcher but unresolved lucene issues in this will force us to

Re: best practice: 1.4 billions documents

2010-11-22 Thread Erick Erickson
Are you looking at Solr? It has a lot of the infrastructure you'll be building yourself for Lucene already built in. Including replication, distributed searching, etc. Yes, there's a learning curve for something new, but your Lucene experience will help you a LOT with that. It has support for shard

Re: [SOLR] DisMaxQParserPlugin and Tokenization

2010-11-22 Thread Ian Lea
> if there is a solr newsgroup better suited form y question, please point me > there. http://lucene.apache.org/solr/mailing_lists.html -- Ian. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional c

[SOLR] DisMaxQParserPlugin and Tokenization

2010-11-22 Thread jan.kurella
Hi, if there is a solr newsgroup better suited form y question, please point me there. Using the SearchHandler with the deftype=”dismax” option enables the DisMaxQParserPlugin. From investigating it seems, it is just tokenizing by whitespace. Although by looking in the code I could not find t