Re: Issue while parsing XML files due to control characters, help appreciated.

Lokeya Sun, 18 Mar 2007 12:20:07 -0800

Yep I did that, and now my code looks as follows. 
The time taken for indexing one file is now => Elapsed Time in Minutes :: 
0.3531, which is really great, but after processing 4 dumpfiles(which means
40,000 small xml's), I get :


caught a class java.io.IOException
 40114  with message: Lock obtain timed out:
Lock@/tmp/lucene-e36d478e46e88f594d57a03c10ee0b3b-write.lock


This is the new issue now.What could be the reason ?. I am surprised because
I am only writing to Index under ./LUCENE/ evrytime and not doing anything
with the index(ofcrs to avoid such synchronization issues !)

                for (int i=0; i<children1.length; i++)
                        {
                                // Get filename of file or directory
                                String filename = children1[i];
                                
                                if(!filename.startsWith("oai_citeseer"))
                                {
                                        continue;
                                }
                                String dirName = filename;
                                System.out.println("File => "+filename);

                                NodeList nl = null;
                                //Testing the creation of Index
                                try
                                {       
                                        File dir = new File(dirName);
                                        String[] children = dir.list();
                                        if (children == null) {
                                                // Either dir does not exist or 
is not a directory                                              
                                        } 
                                        else 
                                        {
                                                ArrayList alist_Title = new 
ArrayList(children.length); 
                                                ArrayList alist_Descr = new 
ArrayList(children.length);
                                                System.out.println("Size of 
Array List : "+alist_Title.size());
                                                String title = null;
                                                String descr = null;
                                        
                                                long startTime = 
System.currentTimeMillis();
                                                System.out.println(dir +" start 
time  ==> "+
System.currentTimeMillis());

                                                for (int ii=0; 
ii<children.length; ii++) 
                                                {                               
                        
                                                        // Get filename of file 
or directory
                                                        String file = 
children[ii];
                                                        System.out.println(" 
Name of Dir : Record ~~~~~~~ " +dirName +" : "+
file);
                                                        
//System.out.println("The name of file parsed now ==> "+file);

                                                        //Get the value of 
recXYZ's metadata tag
                                                        nl = 
ValueGetter.getNodeList(filename+"/"+file, "metadata");

                                                        if(nl == null)
                                                        {
                                                                
System.out.println("Error shoudlnt be thrown ...");
                                                                
alist_Title.add(ii,"title");
                                                                
alist_Descr.add(ii,"descr");
                                                                continue;
                                                        }       

                                                        //Get the metadata 
element(title and description tags from dump file
                                                        ValueGetter vg = new 
ValueGetter();

                                                        //Get the Extracted 
Tags Title, Identifier and Description
                                                        try
                                                        {
                                                                title = 
vg.getNodeVal(nl, "dc:title");
                                                                
alist_Title.add(ii,title);
                                                                descr = 
vg.getNodeVal(nl, "dc:description");
                                                                
alist_Descr.add(ii,descr);
                                                        }
                                                        catch(Exception ex)
                                                        {                       
                                        
                                                                
ex.printStackTrace();
                                                                
System.out.println("Excep ==> "+ ex);
                                                        }                       
                        

                                                }//End of For
                                                
                                                //Create an Index under LUCENE

                                                IndexWriter writer = new 
IndexWriter("./LUCENE", new
StopStemmingAnalyzer(),false);
                                                Document doc = new Document();

                                                //Get Array List Elements  and 
add them as fileds to doc 
                                        
                                                for(int k=0; k < 
alist_Title.size(); k++)
                                                {
                                                        doc.add(new 
Field("Title",alist_Title.get(k).toString(),
Field.Store.YES, Field.Index.UN_TOKENIZED));
                                                }

                                                for(int k=0; k < 
alist_Descr.size(); k++)
                                                {
                                                        doc.add(new 
Field("Description",alist_Descr.get(k).toString(),
Field.Store.YES, Field.Index.UN_TOKENIZED));
                                                }

                                                //Add the document created out 
of those fields to the IndexWriter
which will create and index
                                                writer.addDocument(doc);
                                                writer.optimize();
                                                writer.close();

                                                long elapsedTimeMillis = 
System.currentTimeMillis()-startTime;
                                                System.out.println("Elapsed 
Time for  "+dirName +" :: " +
elapsedTimeMillis);
                                                float elapsedTimeMin = 
elapsedTimeMillis/(60*1000F);
                                                System.out.println("Elapsed 
Time in Minutes ::  "+ elapsedTimeMin);
                                                
                                        }//End of Else

                                } //End of try
                                catch (Exception ee)
                                {
                                        ee.printStackTrace();
                                        System.out.println("caught a " + 
ee.getClass() + "\n with message: "+
ee.getMessage());
                                }       
                                System.out.println("Total Record ==> 
"+total_records);

                        }// End of For

Grant Ingersoll-5 wrote:
> 
> Move index writer creation, optimization and closure outside of your  
> loop.  I would also use a SAX parser.  Take a look at the demo code  
> to see an example of indexing.
> 
> Cheers,
> Grant
> 
> On Mar 18, 2007, at 12:31 PM, Lokeya wrote:
> 
>>
>>
>> Erick Erickson wrote:
>>>
>>> Grant:
>>>
>>>  I think that "Parsing 70 files totally takes 80 minutes" really
>>> means parsing 70 metadata files containing 10,000 XML
>>> files each.....
>>>
>>> One Metadata File is split into 10,000 XML files which looks as  
>>> below:
>>>
>>> <root>
>>>     <record>
>>>     <header>
>>>     <identifier>oai:CiteSeerPSU:1</identifier>
>>>     <datestamp>1993-08-11</datestamp>
>>>     <setSpec>CiteSeerPSUset</setSpec>
>>>     </header>
>>>             <metadata>
>>>             <oai_citeseer:oai_citeseer
>>> xmlns:oai_citeseer="http://copper.ist.psu.edu/oai/oai_citeseer/";  
>>> xmlns:dc
>>> ="http://purl.org/dc/elements/1.1/";
>>> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
>>> xsi:schemaLocation="http://copper.ist.psu.edu/oai/oai_citeseer/
>>> http://copper.ist.psu.edu/oai/oai_citeseer.xsd ">
>>>             <dc:title>36 Problems for Semantic Interpretation</dc:title>
>>>             <oai_citeseer:author name="Gabriele Scheler">
>>>                     <address>80290 Munchen , Germany</address>
>>>                     <affiliation>Institut fur Informatik; Technische 
>>> Universitat
>>> Munchen</affiliation>
>>>             </oai_citeseer:author>
>>>             <dc:subject>Gabriele Scheler 36 Problems for Semantic
>>> Interpretation</dc:subject>
>>>             <dc:description>This paper presents a collection of problems 
>>> for  
>>> natural
>>> language analysisderived mainly from theoretical linguistics. Most of
>>> these problemspresent major obstacles for computational systems of
>>> language interpretation.The set of given sentences can easily be  
>>> scaled up
>>> by introducing moreexamples per problem. The construction of  
>>> computational
>>> systems couldbenefit from such a collection, either using it  
>>> directly for
>>> training andtesting or as a set of benchmarks to qualify the  
>>> performance
>>> of a NLPsystem.1 IntroductionThe main part of this paper consists  
>>> of a
>>> collection of problems for semanticanalysis of natural language. The
>>> problems are arranged in the following way:example sentencesconcise
>>> description of the problemkeyword for the type of problemThe sources
>>> (first appearance in print) of the sentences have been left  
>>> out,because
>>> they are sometimes hard to track and will usually not be of much  
>>> use,as
>>> they indicate a starting-point of discussion only. The keywords  
>>> howeve...
>>>             </dc:description>
>>>             <dc:contributor>The Pennsylvania State University CiteSeer
>>> Archives</dc:contributor>
>>>             <dc:publisher>unknown</dc:publisher>
>>>             <dc:date>1993-08-11</dc:date>
>>>             <dc:format>ps</dc:format>
>>>             
>>> <dc:identifier>http://citeseer.ist.psu.edu/1.html</dc:identifier>
>>> <dc:source>ftp://flop.informatik.tu-muenchen.de/pub/fki/ 
>>> fki-179-93.ps.gz</dc:source>
>>>             <dc:language>en</dc:language>
>>>             <dc:rights>unrestricted</dc:rights>
>>>             </oai_citeseer:oai_citeseer>
>>>             </metadata>
>>>     </record>
>>> </root>
>>>
>>>
>>> From the above I will extract the Title and the Description tags  
>>> to index.
>>>
>>> Code to do this:
>>>
>>> 1. I have 70 directories with the name like oai_citeseerXYZ/
>>> 2. Under each of above directory, I have 10,000 xml files each having
>>> above xml data.
>>> 3. Program does the following
>>>
>>>                                     File dir = new File(dirName);
>>>                                     String[] children = dir.list();
>>>                                     if (children == null) {
>>>                                             // Either dir does not exist or 
>>> is not a directory
>>>                                     }
>>>                                     else
>>>                                     {
>>>                                             for (int ii=0; 
>>> ii<children.length; ii++)
>>>                                             {                               
>>>                         
>>>                                                     // Get filename of file 
>>> or directory
>>>                                                     String file = 
>>> children[ii];
>>>                                                     
>>> //System.out.println("The name of file parsed now ==> "+file);
>>>                                                     nl = 
>>> ReadDump.getNodeList(filename+"/"+file, "metadata");
>>>                                                     if(nl == null)
>>>                                                     {
>>>                                                             
>>> //System.out.println("Error shoudlnt be thrown ...");
>>>                                                     }       
>>>                                                     //Get the metadata 
>>> element tags from xml file
>>>                                                     ReadDump rd = new 
>>> ReadDump();
>>>
>>>                                                     //Get the Extracted 
>>> Tags Title, Identifier and Description
>>>                                                     ArrayList alist_Title = 
>>> rd.getElements(nl, "dc:title");
>>>                                                     ArrayList alist_Descr = 
>>> rd.getElements(nl, "dc:description");
>>>
>>>                                                     //Create an Index under 
>>> DIR
>>>                                                     IndexWriter writer = 
>>> new IndexWriter("./FINAL/", new
>>> StopStemmingAnalyzer(),false);
>>>                                                     Document doc = new 
>>> Document();
>>>
>>>                                                     //Get Array List 
>>> Elements  and add them as fileds to doc
>>>                                                     for(int k=0; k < 
>>> alist_Title.size(); k++)
>>>                                                     {
>>>                                                             doc.add(new 
>>> Field("Title",alist_Title.get(k).toString(),
>>> Field.Store.YES, Field.Index.UN_TOKENIZED));
>>>                                                     }
>>>                                                     
>>>                                                     for(int k=0; k < 
>>> alist_Descr.size(); k++)
>>>                                                     {
>>>                                                             doc.add(new 
>>> Field("Description",alist_Descr.get(k).toString 
>>> (),
>>> Field.Store.YES, Field.Index.UN_TOKENIZED));
>>>                                                     }                       
>>>                                 
>>>
>>>                     //Add the document created out of those fields to the  
>>> IndexWriter which
>>> will create and index
>>>                                                     writer.addDocument(doc);
>>>                                                     writer.optimize();
>>>                                                     writer.close();
>>>                }
>>>                                                     
>>>
>>> This is the main file which does indexing.
>>>
>>> Hope this will give you an idea.
>>>
>>>
>>> Lokeya:
>>> Can you confirm my supposition? And I'd still post the code
>>> Grant requested if you can.....
>>>
>>> So, you're talking about indexing 10,000 xml files in 2-3 hours,
>>> 8 minutes or so which is spent reading/parsing, right? It'll be
>>> important to know how much data you're indexing and now, so
>>> the code snippet is doubly important....
>>>
>>> Erick
>>>
>>> On 3/18/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
>>>>
>>>> Can you post the relevant indexing code?  Are you doing things like
>>>> optimizing after every file?  Both the parsing and the indexing  
>>>> sound
>>>> really long.  How big are these files?
>>>>
>>>> Also, I assume you machine is at least somewhat current, right?
>>>>
>>>> On Mar 18, 2007, at 1:00 AM, Lokeya wrote:
>>>>
>>>>>
>>>>> Thanks for your reply. I tried to check if the I/O and Parsing is
>>>>> taking time
>>>>> separately and Indexing time also. I observed that I/O and Parsing
>>>>> 70 files
>>>>> totally takes 80 minutes where as when I combine this with Indexing
>>>>> for a
>>>>> single Metadata file it nearly 2 to 3 hours. So looks like
>>>>> IndexWriter takes
>>>>> time that too when we are appending to the Index file this happens.
>>>>>
>>>>> So what is the best approach to handle this?
>>>>>
>>>>> Thanks in Advance.
>>>>>
>>>>>
>>>>> Erick Erickson wrote:
>>>>>>
>>>>>> See below...
>>>>>>
>>>>>> On 3/17/07, Lokeya <[EMAIL PROTECTED]> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am trying to index the content from XML files which are
>>>>>>> basically the
>>>>>>> metadata collected from a website which have a huge collection of
>>>>>>> documents.
>>>>>>> This metadata xml has control characters which causes errors
>>>>>>> while trying
>>>>>>> to
>>>>>>> parse using the DOM parser. I tried to use encoding = UTF-8 but
>>>>>>> looks
>>>>>>> like
>>>>>>> it doesn't cover all the unicode characters and I get error. Also
>>>>>>> when I
>>>>>>> tried to use UTF-16, I am getting Prolog content not allowed
>>>>>>> here. So my
>>>>>>> guess is there is no enoding which is going to cover almost all
>>>>>>> unicode
>>>>>>> characters. So I tried to split my metadata files into small
>>>>>>> files and
>>>>>>> processing records which doesnt throw parsing error.
>>>>>>>
>>>>>>> But by breaking metadata file into smaller files I get, 10,000
>>>>>>> xml files
>>>>>>> per
>>>>>>> metadata file. I have 70 metadata files, so altogether it becomes
>>>>>>> 7,00,000
>>>>>>> files. Processing them individually takes really long time using
>>>>>>> Lucene,
>>>>>>> my
>>>>>>> guess is I/O is time consuing, like opening every small xml file
>>>>>>> loading
>>>>>>> in
>>>>>>> DOM extracting required data and processing.
>>>>>>
>>>>>>
>>>>>>
>>>>>> So why don't you measure and find out before trying to make the
>>>>>> indexing
>>>>>> step more efficient? You simply cannot optimize without knowing  
>>>>>> where
>>>>>> you're spending your time. I can't tell you how often I've been  
>>>>>> wrong
>>>>>> about
>>>>>> "why my program was slow" <G>.
>>>>>>
>>>>>> In this case, it should be really simple. Just comment out the
>>>>>> part where
>>>>>> you index the data and run, say, one of your metadata files.. I
>>>>>> suspect
>>>>>> that
>>>>>> Cheolgoo Kang's response is cogent, and you indeed are spending  
>>>>>> your
>>>>>> time parsing the XML. I further suspect that the problem is not
>>>>>> disk IO,
>>>>>> but the time spent parsing. But until you measure, you have no  
>>>>>> clue
>>>>>> whether you should mess around with the Lucene parameters, or find
>>>>>> another parser, or just live with it.. Assuming that you  
>>>>>> comment out
>>>>>> Lucene and things are still slow, the next step would be to just
>>>>>> read in
>>>>>> each file and NOT parse it to figure out whether it's the IO or  
>>>>>> the
>>>>>> parsing.
>>>>>>
>>>>>> Then you can worry about how to fix it..
>>>>>>
>>>>>> Best
>>>>>> Erick
>>>>>>
>>>>>>
>>>>>> Qn  1: Any suggestion to get this indexing time reduced? It  
>>>>>> would be
>>>>>> really
>>>>>>> great.
>>>>>>>
>>>>>>> Qn 2 : Am I overlooking something in Lucene with respect to
>>>>>>> indexing?
>>>>>>>
>>>>>>> Right now 12 metadata files take 10 hrs nearly which is really a
>>>>>>> long
>>>>>>> time.
>>>>>>>
>>>>>>> Help Appreciated.
>>>>>>>
>>>>>>> Much Thanks.
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://www.nabble.com/Issue-while-parsing-XML-files-due-to-
>>>>>>> control-characters%2C-help-appreciated.-tf3418085.html#a9526527
>>>>>>> Sent from the Lucene - Java Users mailing list archive at
>>>>>>> Nabble.com.
>>>>>>>
>>>>>>>
>>>>>>> ----------------------------------------------------------------- 
>>>>>>> ---
>>>>>>> -
>>>>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>>>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> View this message in context: http://www.nabble.com/Issue-while-
>>>>> parsing-XML-files-due-to-control-characters%2C-help-appreciated.-
>>>>> tf3418085.html#a9536099
>>>>> Sent from the Lucene - Java Users mailing list archive at  
>>>>> Nabble.com.
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------- 
>>>>> --
>>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>>>
>>>>
>>>> --------------------------
>>>> Grant Ingersoll
>>>> Center for Natural Language Processing
>>>> http://www.cnlp.org
>>>>
>>>> Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
>>>> LuceneFAQ
>>>>
>>>>
>>>>
>>>> -------------------------------------------------------------------- 
>>>> -
>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>>
>>>>
>>>
>>>
>>
>> -- 
>> View this message in context: http://www.nabble.com/Issue-while- 
>> parsing-XML-files-due-to-control-characters%2C-help-appreciated.- 
>> tf3418085.html#a9540232
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
> 
> ------------------------------------------------------
> Grant Ingersoll
> http://www.grantingersoll.com/
> http://lucene.grantingersoll.com
> http://www.paperoftheweek.com/
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Issue-while-parsing-XML-files-due-to-control-characters%2C-help-appreciated.-tf3418085.html#a9542471
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Issue while parsing XML files due to control characters, help appreciated.

Reply via email to