Hi Erick, Thank you very much for you explanations. 588 is a rather long way to go, so you're right maybe I won't need at the moment to care about that problem. To answer your final question : no indeed I won't need to store a lot of data. Just some keys in order to find the data in Cassandra later on.
If you don't mind, please let me ask you another question : Is it really interesting to begin with Lucene rather than directly with solR (or Nutch) ? What I mean by that is : is it the same difficulty to implement a search with solR and stay with it instead of first implement a search with Lucene, then when the project becomes very big change it to a new system ? My goal is to have that can evolve with time even if I have 1 million documents added daily ? Thank you, Victor 2010/6/21 Erick Erickson <erickerick...@gmail.com> > By and large, you won't ever actually be interested in very many documents, > what's returned in the TopDocs structure internal document ID and score, in > score order. But retrieval by document ID is quite efficient, it's not a > search. I'm quite sure this won't be a problem. > > Adding 10,000 documents a day means that in 588 years you'll exceed a > 31-bit > number. I don't think you really need to worry about that either. And > that's > the worst-case, assuming the ints are signed. And I believe that they're > unsigned anyway. > > What you will have to worry about is the time to get the top N > highest-scoring documents. That is, IndexSearcher.seach() will be your > limiting factor long before you reach these numbers. By that time, though, > you'll have moved to SOLR or some other distributed search mechanism. > > Performance is influenced by the complexity of the queries and the > structure > and size of your index. The time spent retrieving the top few matches is > completely dwarfed by the search time for an index of any size. > > All this may be irrelevant if you really want to retrieve a very large > number of documents rather than, say, the top 100. But the use case would > have to be very interesting for it to be a requirement to return, say, > 100,000 documents to a user. > > But do be aware that you're not retrieving the *original* text with > IndexSearcher. Typically, the relevant data is indexed but not stored These > two concepts are confusing when you start using Lucene, especially since > they're specified in the same call. Indexing a field splits it up into > tokens, normalizes it (e.g. lowercases, stems, puts in synonyms, etc). The > indexed data is the part that's searched. You can also store the input > verbatim, the but stored part is just a copy that's never searched but is > available for retrieval. > > Which brings up one of the central decisions you need to make. Are you, > indeed, going to store all the data for retrieval in your index or just > index the relevant text to be searched along with some locator information > to the original document? You mention Cassandra, which leads me to > speculate > that it's the latter. > > HTH > Erick > > > On Sun, Jun 20, 2010 at 4:04 PM, Victor Kabdebon > <victor.kabde...@gmail.com>wrote: > > > Hello Simon, > > > > As I told you, I am quite new with Lucene, so there are many things that > > might be wrong. > > I'm using Lucene to make a search service for a website that has a large > > amount of information daily. This amount of information is directly > avaible > > as text in a Cassandra Database. > > There might be as much as 10.000 new documents added daily, and yes my > > concern is it possible to retrieve more documents than the integer max > > value > > ? > > I don't really see also how the IndexSearcher.doc( ) really works, > because > > it seems like we give this method an ID and it is going to search in the > > indexed documents. So what exactly is going to do this > > IndexSearcher.doc(int) ? > > > > *Or are you concerned about retrieving all documents > > containing term "XY" if the number of documents matching is large?* > > * > > * > > > > I'm also concerned by this problem, yes > > > > Could you explain me a little bit how it works, and how Lucene enables > one > > to retrieve a very large number of documents even if it uses int ? > > > > Thank you for your answers, > > Victor > > > > 2010/6/20 Simon Willnauer <simon.willna...@googlemail.com> > > > > > Hi, maybe I don't understand your question correctly. Are you asking > > > if you could run into problems if you retrieve more documents than > > > integer max value? Or are you concerned about retrieving all documents > > > containing term "XY" if the number of documents matching is large? If > > > you are afraid of loading all documents matched from a stored field I > > > guess you are doing something wrong. > > > What are you using lucene for? > > > > > > simon > > > > > > On Sun, Jun 20, 2010 at 8:00 PM, Victor Kabdebon > > > <victor.kabde...@gmail.com> wrote: > > > > Hello everybody, > > > > > > > > I am new to Apache Lucene and it seems to fit perfectly my needs for > my > > > > application. > > > > However I'm a little concerned about something (pardon me if it's a > > > > recurrent question, I've searched the archives but I didn't find > > > something > > > > about that) > > > > > > > > So here is my case : > > > > > > > > I have index a few files (like 10) and I'm trying to search something > > > stupid > > > > in it. The word "test". So after opening everything etc... (assuming > it > > > > works also) I do that : > > > > > > > > *Term test = new Term("text_comment","test");* > > > > * Query query = new TermQuery(test);* > > > > * TopDocs top = searcher.search(query, 10);* > > > > > > > > I want to recover the first document (I have 2 documents in TopDocs), > I > > > do : > > > > > > > > *IndexSearcher.doc( top[0].doc)* > > > > > > > > I searched a little bit in javadoc and I saw that this method uses > > "int" > > > as > > > > a parameter > > > > I'm a little bit concerned about this... At the moment, I have 10 > > > documents > > > > so that's ok, but if I want to index let's say 20 files documents, > how > > > will > > > > the IndexSearcher.doc(int) be able to retrieve documents ? > > > > Same problem if 100.000 files have the word "test" in "text_comment" > > will > > > I > > > > still be able to get these 100.000 documents or is it going to be a > > > problem > > > > ? > > > > > > > > Thank you very much. > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > >