Re: Issue while parsing XML files due to control characters, help appreciated.

2007-03-21 Thread Doron Cohen
Lokeya <[EMAIL PROTECTED]> wrote on 21/03/2007 22:09:06: > > Initially I was writing into the Index 7,00,000 times. I chaged the code to > now write only 70 times which means I am putting lot of data in an array > list and add to doc and index at one shot. This is where the improvement > came from

Re: Issue while parsing XML files due to control characters, help appreciated.

2007-03-21 Thread Lokeya
Initially I was writing into the Index 7,00,000 times. I chaged the code to now write only 70 times which means I am putting lot of data in an array list and add to doc and index at one shot. This is where the improvement came from. To be precise IndexWriter is now adding document 70 times Vs. 7,0

Re: Issue while parsing XML files due to control characters, help appreciated.

2007-03-20 Thread Doron Cohen
Lokeya <[EMAIL PROTECTED]> wrote on 18/03/2007 13:19:45: > > Yep I did that, and now my code looks as follows. > The time taken for indexing one file is now > => Elapsed Time in Minutes :: 0.3531 > which is really great I am jumping in late so appologies if I am missing something. However I don't

Re: Issue while parsing XML files due to control characters, help appreciated.

2007-03-19 Thread Lokeya
Thanks all for the valuable suggestions. The lock issue also got resolved and all 7 laks files are indexed in arnd 85 minutes which is like wow ! To get away with the lock issue i followed the suggestion given in this : http://mail-archives.apache.org/mod_mbox/lucene-java-user/200601.mbox/[EMAIL

Re: Issue while parsing XML files due to control characters, help appreciated.

2007-03-18 Thread Lokeya
I will try to explain it like an algorithm what I am trying to do: 1. There are 70 Dump files which have 10,000 record tags which I have pasted in my earlier mails. I split every dumpfile and create 10,000 xml files each with a single and its child tags. This is because there are some parsing is

Re: Issue while parsing XML files due to control characters, help appreciated.

2007-03-18 Thread Erick Erickson
I'm not sure what the lock issues is. What version of Lucene are you using? And what is your filesystem like? There are some known locking issues with some versions of Lucene and some filesystems, particularly NFS mounts as I remember... It would help if you told us the entire stack trace rather t

Re: Issue while parsing XML files due to control characters, help appreciated.

2007-03-18 Thread Lokeya
Yep I did that, and now my code looks as follows. The time taken for indexing one file is now => Elapsed Time in Minutes :: 0.3531, which is really great, but after processing 4 dumpfiles(which means 40,000 small xml's), I get : caught a class java.io.IOException 40114 with message: Lock obta

Re: Issue while parsing XML files due to control characters, help appreciated.

2007-03-18 Thread Grant Ingersoll
Move index writer creation, optimization and closure outside of your loop. I would also use a SAX parser. Take a look at the demo code to see an example of indexing. Cheers, Grant On Mar 18, 2007, at 12:31 PM, Lokeya wrote: Erick Erickson wrote: Grant: I think that "Parsing 70 file

Re: Issue while parsing XML files due to control characters, help appreciated.

2007-03-18 Thread Lokeya
Erick Erickson wrote: > > Grant: > > I think that "Parsing 70 files totally takes 80 minutes" really > means parsing 70 metadata files containing 10,000 XML > files each. > > One Metadata File is split into 10,000 XML files which looks as below: > > > > > oai:CiteSee

Re: Issue while parsing XML files due to control characters, help appreciated.

2007-03-18 Thread Erick Erickson
Grant: I think that "Parsing 70 files totally takes 80 minutes" really means parsing 70 metadata files containing 10,000 XML files each. Lokeya: Can you confirm my supposition? And I'd still post the code Grant requested if you can. So, you're talking about indexing 10,000 xml files in

Re: Issue while parsing XML files due to control characters, help appreciated.

2007-03-18 Thread Grant Ingersoll
Can you post the relevant indexing code? Are you doing things like optimizing after every file? Both the parsing and the indexing sound really long. How big are these files? Also, I assume you machine is at least somewhat current, right? On Mar 18, 2007, at 1:00 AM, Lokeya wrote: Thank

Re: Issue while parsing XML files due to control characters, help appreciated.

2007-03-17 Thread Lokeya
Thanks for your reply. I tried to check if the I/O and Parsing is taking time separately and Indexing time also. I observed that I/O and Parsing 70 files totally takes 80 minutes where as when I combine this with Indexing for a single Metadata file it nearly 2 to 3 hours. So looks like IndexWriter

Re: Issue while parsing XML files due to control characters, help appreciated.

2007-03-17 Thread Erick Erickson
See below... On 3/17/07, Lokeya <[EMAIL PROTECTED]> wrote: Hi, I am trying to index the content from XML files which are basically the metadata collected from a website which have a huge collection of documents. This metadata xml has control characters which causes errors while trying to pars

Re: Issue while parsing XML files due to control characters, help appreciated.

2007-03-16 Thread Cheolgoo Kang
On 3/17/07, Lokeya <[EMAIL PROTECTED]> wrote: Hi, I am trying to index the content from XML files which are basically the metadata collected from a website which have a huge collection of documents. This metadata xml has control characters which causes errors while trying to parse using the DOM

Issue while parsing XML files due to control characters, help appreciated.

2007-03-16 Thread Lokeya
Hi, I am trying to index the content from XML files which are basically the metadata collected from a website which have a huge collection of documents. This metadata xml has control characters which causes errors while trying to parse using the DOM parser. I tried to use encoding = UTF-8 but lo