Re: [Fwd: Re: indexing performance]

Saravana Thu, 01 Mar 2007 07:22:09 -0800

Hi,

You need just the counts? And you want to do just whole-field matching, not
word matching? In that case, Lucene might be an overkill for you. Or, if you
do use Lucene, make sure to use "keyword" (untokenized) fields, not
"tokenized" fields.


Sorry for not elaborating my requirement more. Actually I have some fields
that need word matching and for some fields I do not need word matching. I
have used NO_NORMS for whole fields and TOKENIZED for the fields that need
normalization. I need count as well as I need to show the fields that are
indexed.
For example the following criteria can be given by the user;

USER:john AND MSG:ftp

Here USER is NO_NORMS field and MSG will be tokenized field. Original log
message will be as follows.

2007 Jan 27 10:10:01 User John accessed ftp url images.html

So i cannot identify the count in the memory as the criteria will be

selected by the user or its not predefined. Moreover I have read the
following thread dated 2002



Thread on 2002:

my experiences are that the writing to the index takes the most time except
any parsing done by the user. I have been working on xml indexes and here
the
collection of data takes just as much time as to write. to increase *speed*i
have done three things that reduced my index time from 11hours to 2,5 hours
for the same dataset (1,3gb xml documents).

1: i index 50 documents into a ramdir, then when the limit is reached i
merge
this ramdir into a fsdir and flush the ramdir. this speeds up things
as i then don't have to use the fsdir as much and ramdir is much faster.

2: merging a large index into a large index takes nearly as much time as
merging a small index into a large index, so i have 4 (any number will do)
fsdirs that i write ramdirs to and then i merge these fsdirs into one large
fsdir at the end of a large indexrun.

3: multithreaded my application, create workerthreads that indexes into its
own sepparate ramdir, then flushes these ramdirs into each separate fsdir
(hench i have a fsdir for each workerthread), this because you can only
write
to a dir by one thread.

in the end this imporved my *indexing* time a lot...

hope some of this can help you!

mvh karl �ie



Is this still hold good now ? Thanks for your reply.

regards,
MSK

---------- Forwarded message ----------

From: "Nadav Har'El" <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Date: Thu, 1 Mar 2007 10:28:07 +0200
Subject: Re: indexing performance
On Tue, Feb 27, 2007, Saravana wrote about "indexing performance":
> Hi,
>
> Is it possible to scale lucene indexing like 2000/3000 documents per
> second?

I don't know about the actual numbers, but one trick I've used in the past
to get really fast indexing was to create several independent indexes in
parallel. Simply, if you have, say, 4 CPUs and perhaps even several
physical
disks, run 4 indexing processes each indexing a 1/4 of the files and
creating
a separate index (on separate disks on separate IO channels, if possible).

At the end, you have 4 indexes which you can actually search together
without
any real need to merge them, unless query performance is very important to
you as well.

> I need to index 10 fields each with 20 bytes long.  I should be
> able to search by just giving any of the field values as criteria. I
need to
> get the count that has same field values.

You need just the counts? And you want to do just whole-field matching,
not
word matching? In that case, Lucene might be an overkill for you. Or, if
you
do use Lucene, make sure to use "keyword" (untokenized) fields, not
"tokenized" fields.

--
Nadav Har'El                        |      Thursday, Mar  1 2007, 11 Adar
5767
IBM Haifa Research
Lab              |-----------------------------------------
                                    |Open your arms to change, but don't
let
http://nadav.harel.org.il           |go of your values.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [Fwd: Re: indexing performance]

Reply via email to