Re: Issue while parsing XML files due to control characters, help appreciated.

Erick Erickson Sun, 18 Mar 2007 14:12:21 -0800

I'm not sure what the lock issues is. What version of Lucene are you
using? And what is your filesystem like? There are some known locking
issues with some versions of Lucene and some filesystems,
particularly NFS mounts as I remember... It would help if you told
us the entire stack trace rather than just the exception.


But I really don't understand your index structure. You are still
opening and closing Lucene every time you add a document.
And each document contains the titles and descriptions
of all the files in the directory. Is this actually your intent or
do you really want to index one Lucene document per
title/description pair? If the latter, you want a loop something like...

open writer
for each file
   parse file
   Doc doc = new Document();
   doc.add(field, val);
   doc.add(field2, val2);
   writer.write(doc);
end for

writer.close();

There's no reason to close the writer until the last xml file has been
indexed......

If that re-structuring causes your lock error to go away, I'll be
baffled because it shouldn't (unless your version of Lucene
and filesystem is one of the "interesting" ones).

But it'll make your code simpler...

Best
Erick
   add fields to

On 3/18/07, Lokeya <[EMAIL PROTECTED]> wrote:



Yep I did that, and now my code looks as follows.
The time taken for indexing one file is now => Elapsed Time in Minutes ::
0.3531, which is really great, but after processing 4 dumpfiles(which
means
40,000 small xml's), I get :

caught a class java.io.IOException
40114  with message: Lock obtain timed out:
Lock@/tmp/lucene-e36d478e46e88f594d57a03c10ee0b3b-write.lock


This is the new issue now.What could be the reason ?. I am surprised
because
I am only writing to Index under ./LUCENE/ evrytime and not doing anything
with the index(ofcrs to avoid such synchronization issues !)

                for (int i=0; i<children1.length; i++)
                        {
                                // Get filename of file or directory
                                String filename = children1[i];

                                if(!filename.startsWith("oai_citeseer"))
                                {
                                        continue;
                                }
                                String dirName = filename;
                                System.out.println("File => "+filename);

                                NodeList nl = null;
                                //Testing the creation of Index
                                try
                                {
                                        File dir = new File(dirName);
                                        String[] children = dir.list();
                                        if (children == null) {
                                                // Either dir does not
exist or is not a directory
                                        }
                                        else
                                        {
                                                ArrayList alist_Title =
new ArrayList(children.length);
                                                ArrayList alist_Descr =
new ArrayList(children.length);
                                                System.out.println("Size
of Array List : "+alist_Title.size());
                                                String title = null;
                                                String descr = null;

                                                long startTime =
System.currentTimeMillis();
                                                System.out.println(dir +"
start time  ==> "+
System.currentTimeMillis());

                                                for (int ii=0; ii<
children.length; ii++)
                                                {
                                                        // Get filename of
file or directory
                                                        String file =
children[ii];
                                                        System.out.println("
Name of Dir : Record ~~~~~~~ " +dirName +" : "+
file);
                                                        
//System.out.println("The
name of file parsed now ==> "+file);

                                                        //Get the value of
recXYZ's metadata tag
                                                        nl =
ValueGetter.getNodeList(filename+"/"+file, "metadata");

                                                        if(nl == null)
                                                        {

System.out.println("Error shoudlnt be thrown ...");

                                                                
alist_Title.add(ii,"title");

                                                                
alist_Descr.add(ii,"descr");
                                                                continue;
                                                        }

                                                        //Get the metadata
element(title and description tags from dump file
                                                        ValueGetter vg =
new ValueGetter();

                                                        //Get the
Extracted Tags Title, Identifier and Description
                                                        try
                                                        {
                                                                title =
vg.getNodeVal(nl, "dc:title");

                                                                
alist_Title.add(ii,title);
                                                                descr =
vg.getNodeVal(nl, "dc:description");

                                                                
alist_Descr.add(ii,descr);
                                                        }
                                                        catch(Exception
ex)
                                                        {

ex.printStackTrace();

System.out.println("Excep ==> "+ ex);
                                                        }

                                                }//End of For

                                                //Create an Index under
LUCENE

                                                IndexWriter writer = new
IndexWriter("./LUCENE", new
StopStemmingAnalyzer(),false);
                                                Document doc = new
Document();

                                                //Get Array List
Elements  and add them as fileds to doc

                                                for(int k=0; k <
alist_Title.size(); k++)
                                                {
                                                        doc.add(new
Field("Title",alist_Title.get(k).toString(),
Field.Store.YES, Field.Index.UN_TOKENIZED));
                                                }

                                                for(int k=0; k <
alist_Descr.size(); k++)
                                                {
                                                        doc.add(new
Field("Description",alist_Descr.get(k).toString(),
Field.Store.YES, Field.Index.UN_TOKENIZED));
                                                }

                                                //Add the document created
out of those fields to the IndexWriter
which will create and index
                                                writer.addDocument(doc);
                                                writer.optimize();
                                                writer.close();

                                                long elapsedTimeMillis =
System.currentTimeMillis()-startTime;
                                                System.out.println("Elapsed
Time for  "+dirName +" :: " +
elapsedTimeMillis);
                                                float elapsedTimeMin =
elapsedTimeMillis/(60*1000F);
                                                System.out.println("Elapsed
Time in Minutes ::  "+ elapsedTimeMin);

                                        }//End of Else

                                } //End of try
                                catch (Exception ee)
                                {
                                        ee.printStackTrace();
                                        System.out.println("caught a " +
ee.getClass() + "\n with message: "+
ee.getMessage());
                                }
                                System.out.println("Total Record ==>
"+total_records);

                        }// End of For

Grant Ingersoll-5 wrote:
>
> Move index writer creation, optimization and closure outside of your
> loop.  I would also use a SAX parser.  Take a look at the demo code
> to see an example of indexing.
>
> Cheers,
> Grant
>
> On Mar 18, 2007, at 12:31 PM, Lokeya wrote:
>
>>
>>
>> Erick Erickson wrote:
>>>
>>> Grant:
>>>
>>>  I think that "Parsing 70 files totally takes 80 minutes" really
>>> means parsing 70 metadata files containing 10,000 XML
>>> files each.....
>>>
>>> One Metadata File is split into 10,000 XML files which looks as
>>> below:
>>>
>>> <root>
>>>     <record>
>>>     <header>
>>>     <identifier>oai:CiteSeerPSU:1</identifier>
>>>     <datestamp>1993-08-11</datestamp>
>>>     <setSpec>CiteSeerPSUset</setSpec>
>>>     </header>
>>>             <metadata>
>>>             <oai_citeseer:oai_citeseer
>>> xmlns:oai_citeseer="http://copper.ist.psu.edu/oai/oai_citeseer/";
>>> xmlns:dc
>>> ="http://purl.org/dc/elements/1.1/";
>>> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
>>> xsi:schemaLocation="http://copper.ist.psu.edu/oai/oai_citeseer/
>>> http://copper.ist.psu.edu/oai/oai_citeseer.xsd ">
>>>             <dc:title>36 Problems for Semantic
Interpretation</dc:title>
>>>             <oai_citeseer:author name="Gabriele Scheler">
>>>                     <address>80290 Munchen , Germany</address>
>>>                     <affiliation>Institut fur Informatik; Technische
Universitat
>>> Munchen</affiliation>
>>>             </oai_citeseer:author>
>>>             <dc:subject>Gabriele Scheler 36 Problems for Semantic
>>> Interpretation</dc:subject>
>>>             <dc:description>This paper presents a collection of
problems for
>>> natural
>>> language analysisderived mainly from theoretical linguistics. Most of
>>> these problemspresent major obstacles for computational systems of
>>> language interpretation.The set of given sentences can easily be
>>> scaled up
>>> by introducing moreexamples per problem. The construction of
>>> computational
>>> systems couldbenefit from such a collection, either using it
>>> directly for
>>> training andtesting or as a set of benchmarks to qualify the
>>> performance
>>> of a NLPsystem.1 IntroductionThe main part of this paper consists
>>> of a
>>> collection of problems for semanticanalysis of natural language. The
>>> problems are arranged in the following way:example sentencesconcise
>>> description of the problemkeyword for the type of problemThe sources
>>> (first appearance in print) of the sentences have been left
>>> out,because
>>> they are sometimes hard to track and will usually not be of much
>>> use,as
>>> they indicate a starting-point of discussion only. The keywords
>>> howeve...
>>>             </dc:description>
>>>             <dc:contributor>The Pennsylvania State University CiteSeer
>>> Archives</dc:contributor>
>>>             <dc:publisher>unknown</dc:publisher>
>>>             <dc:date>1993-08-11</dc:date>
>>>             <dc:format>ps</dc:format>
>>>             <dc:identifier>http://citeseer.ist.psu.edu/1.html
</dc:identifier>
>>> <dc:source>ftp://flop.informatik.tu-muenchen.de/pub/fki/
>>> fki-179-93.ps.gz</dc:source>
>>>             <dc:language>en</dc:language>
>>>             <dc:rights>unrestricted</dc:rights>
>>>             </oai_citeseer:oai_citeseer>
>>>             </metadata>
>>>     </record>
>>> </root>
>>>
>>>
>>> From the above I will extract the Title and the Description tags
>>> to index.
>>>
>>> Code to do this:
>>>
>>> 1. I have 70 directories with the name like oai_citeseerXYZ/
>>> 2. Under each of above directory, I have 10,000 xml files each having
>>> above xml data.
>>> 3. Program does the following
>>>
>>>                                     File dir = new File(dirName);
>>>                                     String[] children = dir.list();
>>>                                     if (children == null) {
>>>                                             // Either dir does not
exist or is not a directory
>>>                                     }
>>>                                     else
>>>                                     {
>>>                                             for (int ii=0; ii<
children.length; ii++)
>>>                                             {
>>>                                                     // Get filename of
file or directory
>>>                                                     String file =
children[ii];
>>>
//System.out.println("The name of file parsed now ==> "+file);
>>>                                                     nl =
ReadDump.getNodeList(filename+"/"+file, "metadata");
>>>                                                     if(nl == null)
>>>                                                     {
>>>
//System.out.println("Error shoudlnt be thrown ...");
>>>                                                     }
>>>                                                     //Get the metadata
element tags from xml file
>>>                                                     ReadDump rd = new
ReadDump();
>>>
>>>                                                     //Get the
Extracted Tags Title, Identifier and Description
>>>                                                     ArrayList
alist_Title = rd.getElements(nl, "dc:title");
>>>                                                     ArrayList
alist_Descr = rd.getElements(nl, "dc:description");
>>>
>>>                                                     //Create an Index
under DIR
>>>                                                     IndexWriter writer
= new IndexWriter("./FINAL/", new
>>> StopStemmingAnalyzer(),false);
>>>                                                     Document doc = new
Document();
>>>
>>>                                                     //Get Array List
Elements  and add them as fileds to doc
>>>                                                     for(int k=0; k <
alist_Title.size(); k++)
>>>                                                     {
>>>                                                             doc.add(new
Field("Title",alist_Title.get(k).toString(),
>>> Field.Store.YES, Field.Index.UN_TOKENIZED));
>>>                                                     }
>>>
>>>                                                     for(int k=0; k <
alist_Descr.size(); k++)
>>>                                                     {
>>>                                                             doc.add(new
Field("Description",alist_Descr.get(k).toString
>>> (),
>>> Field.Store.YES, Field.Index.UN_TOKENIZED));
>>>                                                     }
>>>
>>>                     //Add the document created out of those fields to
the
>>> IndexWriter which
>>> will create and index
>>>                                                     writer.addDocument
(doc);
>>>                                                     writer.optimize();
>>>                                                     writer.close();
>>>                }
>>>
>>>
>>> This is the main file which does indexing.
>>>
>>> Hope this will give you an idea.
>>>
>>>
>>> Lokeya:
>>> Can you confirm my supposition? And I'd still post the code
>>> Grant requested if you can.....
>>>
>>> So, you're talking about indexing 10,000 xml files in 2-3 hours,
>>> 8 minutes or so which is spent reading/parsing, right? It'll be
>>> important to know how much data you're indexing and now, so
>>> the code snippet is doubly important....
>>>
>>> Erick
>>>
>>> On 3/18/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
>>>>
>>>> Can you post the relevant indexing code?  Are you doing things like
>>>> optimizing after every file?  Both the parsing and the indexing
>>>> sound
>>>> really long.  How big are these files?
>>>>
>>>> Also, I assume you machine is at least somewhat current, right?
>>>>
>>>> On Mar 18, 2007, at 1:00 AM, Lokeya wrote:
>>>>
>>>>>
>>>>> Thanks for your reply. I tried to check if the I/O and Parsing is
>>>>> taking time
>>>>> separately and Indexing time also. I observed that I/O and Parsing
>>>>> 70 files
>>>>> totally takes 80 minutes where as when I combine this with Indexing
>>>>> for a
>>>>> single Metadata file it nearly 2 to 3 hours. So looks like
>>>>> IndexWriter takes
>>>>> time that too when we are appending to the Index file this happens.
>>>>>
>>>>> So what is the best approach to handle this?
>>>>>
>>>>> Thanks in Advance.
>>>>>
>>>>>
>>>>> Erick Erickson wrote:
>>>>>>
>>>>>> See below...
>>>>>>
>>>>>> On 3/17/07, Lokeya <[EMAIL PROTECTED]> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am trying to index the content from XML files which are
>>>>>>> basically the
>>>>>>> metadata collected from a website which have a huge collection of
>>>>>>> documents.
>>>>>>> This metadata xml has control characters which causes errors
>>>>>>> while trying
>>>>>>> to
>>>>>>> parse using the DOM parser. I tried to use encoding = UTF-8 but
>>>>>>> looks
>>>>>>> like
>>>>>>> it doesn't cover all the unicode characters and I get error. Also
>>>>>>> when I
>>>>>>> tried to use UTF-16, I am getting Prolog content not allowed
>>>>>>> here. So my
>>>>>>> guess is there is no enoding which is going to cover almost all
>>>>>>> unicode
>>>>>>> characters. So I tried to split my metadata files into small
>>>>>>> files and
>>>>>>> processing records which doesnt throw parsing error.
>>>>>>>
>>>>>>> But by breaking metadata file into smaller files I get, 10,000
>>>>>>> xml files
>>>>>>> per
>>>>>>> metadata file. I have 70 metadata files, so altogether it becomes
>>>>>>> 7,00,000
>>>>>>> files. Processing them individually takes really long time using
>>>>>>> Lucene,
>>>>>>> my
>>>>>>> guess is I/O is time consuing, like opening every small xml file
>>>>>>> loading
>>>>>>> in
>>>>>>> DOM extracting required data and processing.
>>>>>>
>>>>>>
>>>>>>
>>>>>> So why don't you measure and find out before trying to make the
>>>>>> indexing
>>>>>> step more efficient? You simply cannot optimize without knowing
>>>>>> where
>>>>>> you're spending your time. I can't tell you how often I've been
>>>>>> wrong
>>>>>> about
>>>>>> "why my program was slow" <G>.
>>>>>>
>>>>>> In this case, it should be really simple. Just comment out the
>>>>>> part where
>>>>>> you index the data and run, say, one of your metadata files.. I
>>>>>> suspect
>>>>>> that
>>>>>> Cheolgoo Kang's response is cogent, and you indeed are spending
>>>>>> your
>>>>>> time parsing the XML. I further suspect that the problem is not
>>>>>> disk IO,
>>>>>> but the time spent parsing. But until you measure, you have no
>>>>>> clue
>>>>>> whether you should mess around with the Lucene parameters, or find
>>>>>> another parser, or just live with it.. Assuming that you
>>>>>> comment out
>>>>>> Lucene and things are still slow, the next step would be to just
>>>>>> read in
>>>>>> each file and NOT parse it to figure out whether it's the IO or
>>>>>> the
>>>>>> parsing.
>>>>>>
>>>>>> Then you can worry about how to fix it..
>>>>>>
>>>>>> Best
>>>>>> Erick
>>>>>>
>>>>>>
>>>>>> Qn  1: Any suggestion to get this indexing time reduced? It
>>>>>> would be
>>>>>> really
>>>>>>> great.
>>>>>>>
>>>>>>> Qn 2 : Am I overlooking something in Lucene with respect to
>>>>>>> indexing?
>>>>>>>
>>>>>>> Right now 12 metadata files take 10 hrs nearly which is really a
>>>>>>> long
>>>>>>> time.
>>>>>>>
>>>>>>> Help Appreciated.
>>>>>>>
>>>>>>> Much Thanks.
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://www.nabble.com/Issue-while-parsing-XML-files-due-to-
>>>>>>> control-characters%2C-help-appreciated.-tf3418085.html#a9526527
>>>>>>> Sent from the Lucene - Java Users mailing list archive at
>>>>>>> Nabble.com.
>>>>>>>
>>>>>>>
>>>>>>> -----------------------------------------------------------------
>>>>>>> ---
>>>>>>> -
>>>>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>>>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> View this message in context: http://www.nabble.com/Issue-while-
>>>>> parsing-XML-files-due-to-control-characters%2C-help-appreciated.-
>>>>> tf3418085.html#a9536099
>>>>> Sent from the Lucene - Java Users mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>>
>>>>> -------------------------------------------------------------------
>>>>> --
>>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>>>
>>>>
>>>> --------------------------
>>>> Grant Ingersoll
>>>> Center for Natural Language Processing
>>>> http://www.cnlp.org
>>>>
>>>> Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
>>>> LuceneFAQ
>>>>
>>>>
>>>>
>>>> --------------------------------------------------------------------
>>>> -
>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>>
>>>>
>>>
>>>
>>
>> --
>> View this message in context: http://www.nabble.com/Issue-while-
>> parsing-XML-files-due-to-control-characters%2C-help-appreciated.-
>> tf3418085.html#a9540232
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>
> ------------------------------------------------------
> Grant Ingersoll
> http://www.grantingersoll.com/
> http://lucene.grantingersoll.com
> http://www.paperoftheweek.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>

--
View this message in context:
http://www.nabble.com/Issue-while-parsing-XML-files-due-to-control-characters%2C-help-appreciated.-tf3418085.html#a9542471
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Issue while parsing XML files due to control characters, help appreciated.

Reply via email to