Lokeya <[EMAIL PROTECTED]> wrote on 21/03/2007 22:09:06:
>
> Initially I was writing into the Index 7,00,000 times. I chaged the code
to
> now write only 70 times which means I am putting lot of data in an array
> list and add to doc and index at one shot. This is where the improvement
> came from
Initially I was writing into the Index 7,00,000 times. I chaged the code to
now write only 70 times which means I am putting lot of data in an array
list and add to doc and index at one shot. This is where the improvement
came from. To be precise IndexWriter is now adding document 70 times Vs.
7,0
Lokeya <[EMAIL PROTECTED]> wrote on 18/03/2007 13:19:45:
>
> Yep I did that, and now my code looks as follows.
> The time taken for indexing one file is now
> => Elapsed Time in Minutes :: 0.3531
> which is really great
I am jumping in late so appologies if I am missing something.
However I don't
Thanks all for the valuable suggestions.
The lock issue also got resolved and all 7 laks files are indexed in arnd 85
minutes which is like wow !
To get away with the lock issue i followed the suggestion given in this :
http://mail-archives.apache.org/mod_mbox/lucene-java-user/200601.mbox/[EMAIL
I will try to explain it like an algorithm what I am trying to do:
1. There are 70 Dump files which have 10,000 record tags which I have pasted
in my earlier mails. I split every dumpfile and create 10,000 xml files each
with a single and its child tags. This is because there are some
parsing is
I'm not sure what the lock issues is. What version of Lucene are you
using? And what is your filesystem like? There are some known locking
issues with some versions of Lucene and some filesystems,
particularly NFS mounts as I remember... It would help if you told
us the entire stack trace rather t
Yep I did that, and now my code looks as follows.
The time taken for indexing one file is now => Elapsed Time in Minutes ::
0.3531, which is really great, but after processing 4 dumpfiles(which means
40,000 small xml's), I get :
caught a class java.io.IOException
40114 with message: Lock obta
Move index writer creation, optimization and closure outside of your
loop. I would also use a SAX parser. Take a look at the demo code
to see an example of indexing.
Cheers,
Grant
On Mar 18, 2007, at 12:31 PM, Lokeya wrote:
Erick Erickson wrote:
Grant:
I think that "Parsing 70 file
Erick Erickson wrote:
>
> Grant:
>
> I think that "Parsing 70 files totally takes 80 minutes" really
> means parsing 70 metadata files containing 10,000 XML
> files each.
>
> One Metadata File is split into 10,000 XML files which looks as below:
>
>
>
>
> oai:CiteSee
Grant:
I think that "Parsing 70 files totally takes 80 minutes" really
means parsing 70 metadata files containing 10,000 XML
files each.
Lokeya:
Can you confirm my supposition? And I'd still post the code
Grant requested if you can.
So, you're talking about indexing 10,000 xml files in
Can you post the relevant indexing code? Are you doing things like
optimizing after every file? Both the parsing and the indexing sound
really long. How big are these files?
Also, I assume you machine is at least somewhat current, right?
On Mar 18, 2007, at 1:00 AM, Lokeya wrote:
Thank
Thanks for your reply. I tried to check if the I/O and Parsing is taking time
separately and Indexing time also. I observed that I/O and Parsing 70 files
totally takes 80 minutes where as when I combine this with Indexing for a
single Metadata file it nearly 2 to 3 hours. So looks like IndexWriter
See below...
On 3/17/07, Lokeya <[EMAIL PROTECTED]> wrote:
Hi,
I am trying to index the content from XML files which are basically the
metadata collected from a website which have a huge collection of
documents.
This metadata xml has control characters which causes errors while trying
to
pars
On 3/17/07, Lokeya <[EMAIL PROTECTED]> wrote:
Hi,
I am trying to index the content from XML files which are basically the
metadata collected from a website which have a huge collection of documents.
This metadata xml has control characters which causes errors while trying to
parse using the DOM
Hi,
I am trying to index the content from XML files which are basically the
metadata collected from a website which have a huge collection of documents.
This metadata xml has control characters which causes errors while trying to
parse using the DOM parser. I tried to use encoding = UTF-8 but lo
15 matches
Mail list logo