Hi, You need just the counts? And you want to do just whole-field matching, not word matching? In that case, Lucene might be an overkill for you. Or, if you do use Lucene, make sure to use "keyword" (untokenized) fields, not "tokenized" fields.
Sorry for not elaborating my requirement more. Actually I have some fields that need word matching and for some fields I do not need word matching. I have used NO_NORMS for whole fields and TOKENIZED for the fields that need normalization. I need count as well as I need to show the fields that are indexed. For example the following criteria can be given by the user; USER:john AND MSG:ftp Here USER is NO_NORMS field and MSG will be tokenized field. Original log message will be as follows. 2007 Jan 27 10:10:01 User John accessed ftp url images.html So i cannot identify the count in the memory as the criteria will be
selected by the user or its not predefined. Moreover I have read the following thread dated 2002
Thread on 2002: my experiences are that the writing to the index takes the most time except any parsing done by the user. I have been working on xml indexes and here the collection of data takes just as much time as to write. to increase *speed*i have done three things that reduced my index time from 11hours to 2,5 hours for the same dataset (1,3gb xml documents). 1: i index 50 documents into a ramdir, then when the limit is reached i merge this ramdir into a fsdir and flush the ramdir. this speeds up things as i then don't have to use the fsdir as much and ramdir is much faster. 2: merging a large index into a large index takes nearly as much time as merging a small index into a large index, so i have 4 (any number will do) fsdirs that i write ramdirs to and then i merge these fsdirs into one large fsdir at the end of a large indexrun. 3: multithreaded my application, create workerthreads that indexes into its own sepparate ramdir, then flushes these ramdirs into each separate fsdir (hench i have a fsdir for each workerthread), this because you can only write to a dir by one thread. in the end this imporved my *indexing* time a lot... hope some of this can help you! mvh karl �ie Is this still hold good now ? Thanks for your reply. regards, MSK ---------- Forwarded message ----------
From: "Nadav Har'El" <[EMAIL PROTECTED]> To: java-user@lucene.apache.org Date: Thu, 1 Mar 2007 10:28:07 +0200 Subject: Re: indexing performance On Tue, Feb 27, 2007, Saravana wrote about "indexing performance": > Hi, > > Is it possible to scale lucene indexing like 2000/3000 documents per > second? I don't know about the actual numbers, but one trick I've used in the past to get really fast indexing was to create several independent indexes in parallel. Simply, if you have, say, 4 CPUs and perhaps even several physical disks, run 4 indexing processes each indexing a 1/4 of the files and creating a separate index (on separate disks on separate IO channels, if possible). At the end, you have 4 indexes which you can actually search together without any real need to merge them, unless query performance is very important to you as well. > I need to index 10 fields each with 20 bytes long. I should be > able to search by just giving any of the field values as criteria. I need to > get the count that has same field values. You need just the counts? And you want to do just whole-field matching, not word matching? In that case, Lucene might be an overkill for you. Or, if you do use Lucene, make sure to use "keyword" (untokenized) fields, not "tokenized" fields. -- Nadav Har'El | Thursday, Mar 1 2007, 11 Adar 5767 IBM Haifa Research Lab |----------------------------------------- |Open your arms to change, but don't let http://nadav.harel.org.il |go of your values. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]