Yep I did that, and now my code looks as follows.
The time taken for indexing one file is now => Elapsed Time in Minutes ::
0.3531, which is really great, but after processing 4 dumpfiles(which
means
40,000 small xml's), I get :
caught a class java.io.IOException
40114 with message: Lock obtain timed out:
Lock@/tmp/lucene-e36d478e46e88f594d57a03c10ee0b3b-write.lock
This is the new issue now.What could be the reason ?. I am surprised
because
I am only writing to Index under ./LUCENE/ evrytime and not doing anything
with the index(ofcrs to avoid such synchronization issues !)
for (int i=0; i<children1.length; i++)
{
// Get filename of file or directory
String filename = children1[i];
if(!filename.startsWith("oai_citeseer"))
{
continue;
}
String dirName = filename;
System.out.println("File => "+filename);
NodeList nl = null;
//Testing the creation of Index
try
{
File dir = new File(dirName);
String[] children = dir.list();
if (children == null) {
// Either dir does not
exist or is not a directory
}
else
{
ArrayList alist_Title =
new ArrayList(children.length);
ArrayList alist_Descr =
new ArrayList(children.length);
System.out.println("Size
of Array List : "+alist_Title.size());
String title = null;
String descr = null;
long startTime =
System.currentTimeMillis();
System.out.println(dir +"
start time ==> "+
System.currentTimeMillis());
for (int ii=0; ii<
children.length; ii++)
{
// Get filename of
file or directory
String file =
children[ii];
System.out.println("
Name of Dir : Record ~~~~~~~ " +dirName +" : "+
file);
//System.out.println("The
name of file parsed now ==> "+file);
//Get the value of
recXYZ's metadata tag
nl =
ValueGetter.getNodeList(filename+"/"+file, "metadata");
if(nl == null)
{
System.out.println("Error shoudlnt be thrown ...");
alist_Title.add(ii,"title");
alist_Descr.add(ii,"descr");
continue;
}
//Get the metadata
element(title and description tags from dump file
ValueGetter vg =
new ValueGetter();
//Get the
Extracted Tags Title, Identifier and Description
try
{
title =
vg.getNodeVal(nl, "dc:title");
alist_Title.add(ii,title);
descr =
vg.getNodeVal(nl, "dc:description");
alist_Descr.add(ii,descr);
}
catch(Exception
ex)
{
ex.printStackTrace();
System.out.println("Excep ==> "+ ex);
}
}//End of For
//Create an Index under
LUCENE
IndexWriter writer = new
IndexWriter("./LUCENE", new
StopStemmingAnalyzer(),false);
Document doc = new
Document();
//Get Array List
Elements and add them as fileds to doc
for(int k=0; k <
alist_Title.size(); k++)
{
doc.add(new
Field("Title",alist_Title.get(k).toString(),
Field.Store.YES, Field.Index.UN_TOKENIZED));
}
for(int k=0; k <
alist_Descr.size(); k++)
{
doc.add(new
Field("Description",alist_Descr.get(k).toString(),
Field.Store.YES, Field.Index.UN_TOKENIZED));
}
//Add the document created
out of those fields to the IndexWriter
which will create and index
writer.addDocument(doc);
writer.optimize();
writer.close();
long elapsedTimeMillis =
System.currentTimeMillis()-startTime;
System.out.println("Elapsed
Time for "+dirName +" :: " +
elapsedTimeMillis);
float elapsedTimeMin =
elapsedTimeMillis/(60*1000F);
System.out.println("Elapsed
Time in Minutes :: "+ elapsedTimeMin);
}//End of Else
} //End of try
catch (Exception ee)
{
ee.printStackTrace();
System.out.println("caught a " +
ee.getClass() + "\n with message: "+
ee.getMessage());
}
System.out.println("Total Record ==>
"+total_records);
}// End of For
Grant Ingersoll-5 wrote:
>
> Move index writer creation, optimization and closure outside of your
> loop. I would also use a SAX parser. Take a look at the demo code
> to see an example of indexing.
>
> Cheers,
> Grant
>
> On Mar 18, 2007, at 12:31 PM, Lokeya wrote:
>
>>
>>
>> Erick Erickson wrote:
>>>
>>> Grant:
>>>
>>> I think that "Parsing 70 files totally takes 80 minutes" really
>>> means parsing 70 metadata files containing 10,000 XML
>>> files each.....
>>>
>>> One Metadata File is split into 10,000 XML files which looks as
>>> below:
>>>
>>> <root>
>>> <record>
>>> <header>
>>> <identifier>oai:CiteSeerPSU:1</identifier>
>>> <datestamp>1993-08-11</datestamp>
>>> <setSpec>CiteSeerPSUset</setSpec>
>>> </header>
>>> <metadata>
>>> <oai_citeseer:oai_citeseer
>>> xmlns:oai_citeseer="http://copper.ist.psu.edu/oai/oai_citeseer/"
>>> xmlns:dc
>>> ="http://purl.org/dc/elements/1.1/"
>>> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
>>> xsi:schemaLocation="http://copper.ist.psu.edu/oai/oai_citeseer/
>>> http://copper.ist.psu.edu/oai/oai_citeseer.xsd ">
>>> <dc:title>36 Problems for Semantic
Interpretation</dc:title>
>>> <oai_citeseer:author name="Gabriele Scheler">
>>> <address>80290 Munchen , Germany</address>
>>> <affiliation>Institut fur Informatik; Technische
Universitat
>>> Munchen</affiliation>
>>> </oai_citeseer:author>
>>> <dc:subject>Gabriele Scheler 36 Problems for Semantic
>>> Interpretation</dc:subject>
>>> <dc:description>This paper presents a collection of
problems for
>>> natural
>>> language analysisderived mainly from theoretical linguistics. Most of
>>> these problemspresent major obstacles for computational systems of
>>> language interpretation.The set of given sentences can easily be
>>> scaled up
>>> by introducing moreexamples per problem. The construction of
>>> computational
>>> systems couldbenefit from such a collection, either using it
>>> directly for
>>> training andtesting or as a set of benchmarks to qualify the
>>> performance
>>> of a NLPsystem.1 IntroductionThe main part of this paper consists
>>> of a
>>> collection of problems for semanticanalysis of natural language. The
>>> problems are arranged in the following way:example sentencesconcise
>>> description of the problemkeyword for the type of problemThe sources
>>> (first appearance in print) of the sentences have been left
>>> out,because
>>> they are sometimes hard to track and will usually not be of much
>>> use,as
>>> they indicate a starting-point of discussion only. The keywords
>>> howeve...
>>> </dc:description>
>>> <dc:contributor>The Pennsylvania State University CiteSeer
>>> Archives</dc:contributor>
>>> <dc:publisher>unknown</dc:publisher>
>>> <dc:date>1993-08-11</dc:date>
>>> <dc:format>ps</dc:format>
>>> <dc:identifier>http://citeseer.ist.psu.edu/1.html
</dc:identifier>
>>> <dc:source>ftp://flop.informatik.tu-muenchen.de/pub/fki/
>>> fki-179-93.ps.gz</dc:source>
>>> <dc:language>en</dc:language>
>>> <dc:rights>unrestricted</dc:rights>
>>> </oai_citeseer:oai_citeseer>
>>> </metadata>
>>> </record>
>>> </root>
>>>
>>>
>>> From the above I will extract the Title and the Description tags
>>> to index.
>>>
>>> Code to do this:
>>>
>>> 1. I have 70 directories with the name like oai_citeseerXYZ/
>>> 2. Under each of above directory, I have 10,000 xml files each having
>>> above xml data.
>>> 3. Program does the following
>>>
>>> File dir = new File(dirName);
>>> String[] children = dir.list();
>>> if (children == null) {
>>> // Either dir does not
exist or is not a directory
>>> }
>>> else
>>> {
>>> for (int ii=0; ii<
children.length; ii++)
>>> {
>>> // Get filename of
file or directory
>>> String file =
children[ii];
>>>
//System.out.println("The name of file parsed now ==> "+file);
>>> nl =
ReadDump.getNodeList(filename+"/"+file, "metadata");
>>> if(nl == null)
>>> {
>>>
//System.out.println("Error shoudlnt be thrown ...");
>>> }
>>> //Get the metadata
element tags from xml file
>>> ReadDump rd = new
ReadDump();
>>>
>>> //Get the
Extracted Tags Title, Identifier and Description
>>> ArrayList
alist_Title = rd.getElements(nl, "dc:title");
>>> ArrayList
alist_Descr = rd.getElements(nl, "dc:description");
>>>
>>> //Create an Index
under DIR
>>> IndexWriter writer
= new IndexWriter("./FINAL/", new
>>> StopStemmingAnalyzer(),false);
>>> Document doc = new
Document();
>>>
>>> //Get Array List
Elements and add them as fileds to doc
>>> for(int k=0; k <
alist_Title.size(); k++)
>>> {
>>> doc.add(new
Field("Title",alist_Title.get(k).toString(),
>>> Field.Store.YES, Field.Index.UN_TOKENIZED));
>>> }
>>>
>>> for(int k=0; k <
alist_Descr.size(); k++)
>>> {
>>> doc.add(new
Field("Description",alist_Descr.get(k).toString
>>> (),
>>> Field.Store.YES, Field.Index.UN_TOKENIZED));
>>> }
>>>
>>> //Add the document created out of those fields to
the
>>> IndexWriter which
>>> will create and index
>>> writer.addDocument
(doc);
>>> writer.optimize();
>>> writer.close();
>>> }
>>>
>>>
>>> This is the main file which does indexing.
>>>
>>> Hope this will give you an idea.
>>>
>>>
>>> Lokeya:
>>> Can you confirm my supposition? And I'd still post the code
>>> Grant requested if you can.....
>>>
>>> So, you're talking about indexing 10,000 xml files in 2-3 hours,
>>> 8 minutes or so which is spent reading/parsing, right? It'll be
>>> important to know how much data you're indexing and now, so
>>> the code snippet is doubly important....
>>>
>>> Erick
>>>
>>> On 3/18/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
>>>>
>>>> Can you post the relevant indexing code? Are you doing things like
>>>> optimizing after every file? Both the parsing and the indexing
>>>> sound
>>>> really long. How big are these files?
>>>>
>>>> Also, I assume you machine is at least somewhat current, right?
>>>>
>>>> On Mar 18, 2007, at 1:00 AM, Lokeya wrote:
>>>>
>>>>>
>>>>> Thanks for your reply. I tried to check if the I/O and Parsing is
>>>>> taking time
>>>>> separately and Indexing time also. I observed that I/O and Parsing
>>>>> 70 files
>>>>> totally takes 80 minutes where as when I combine this with Indexing
>>>>> for a
>>>>> single Metadata file it nearly 2 to 3 hours. So looks like
>>>>> IndexWriter takes
>>>>> time that too when we are appending to the Index file this happens.
>>>>>
>>>>> So what is the best approach to handle this?
>>>>>
>>>>> Thanks in Advance.
>>>>>
>>>>>
>>>>> Erick Erickson wrote:
>>>>>>
>>>>>> See below...
>>>>>>
>>>>>> On 3/17/07, Lokeya <[EMAIL PROTECTED]> wrote:
>>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am trying to index the content from XML files which are
>>>>>>> basically the
>>>>>>> metadata collected from a website which have a huge collection of
>>>>>>> documents.
>>>>>>> This metadata xml has control characters which causes errors
>>>>>>> while trying
>>>>>>> to
>>>>>>> parse using the DOM parser. I tried to use encoding = UTF-8 but
>>>>>>> looks
>>>>>>> like
>>>>>>> it doesn't cover all the unicode characters and I get error. Also
>>>>>>> when I
>>>>>>> tried to use UTF-16, I am getting Prolog content not allowed
>>>>>>> here. So my
>>>>>>> guess is there is no enoding which is going to cover almost all
>>>>>>> unicode
>>>>>>> characters. So I tried to split my metadata files into small
>>>>>>> files and
>>>>>>> processing records which doesnt throw parsing error.
>>>>>>>
>>>>>>> But by breaking metadata file into smaller files I get, 10,000
>>>>>>> xml files
>>>>>>> per
>>>>>>> metadata file. I have 70 metadata files, so altogether it becomes
>>>>>>> 7,00,000
>>>>>>> files. Processing them individually takes really long time using
>>>>>>> Lucene,
>>>>>>> my
>>>>>>> guess is I/O is time consuing, like opening every small xml file
>>>>>>> loading
>>>>>>> in
>>>>>>> DOM extracting required data and processing.
>>>>>>
>>>>>>
>>>>>>
>>>>>> So why don't you measure and find out before trying to make the
>>>>>> indexing
>>>>>> step more efficient? You simply cannot optimize without knowing
>>>>>> where
>>>>>> you're spending your time. I can't tell you how often I've been
>>>>>> wrong
>>>>>> about
>>>>>> "why my program was slow" <G>.
>>>>>>
>>>>>> In this case, it should be really simple. Just comment out the
>>>>>> part where
>>>>>> you index the data and run, say, one of your metadata files.. I
>>>>>> suspect
>>>>>> that
>>>>>> Cheolgoo Kang's response is cogent, and you indeed are spending
>>>>>> your
>>>>>> time parsing the XML. I further suspect that the problem is not
>>>>>> disk IO,
>>>>>> but the time spent parsing. But until you measure, you have no
>>>>>> clue
>>>>>> whether you should mess around with the Lucene parameters, or find
>>>>>> another parser, or just live with it.. Assuming that you
>>>>>> comment out
>>>>>> Lucene and things are still slow, the next step would be to just
>>>>>> read in
>>>>>> each file and NOT parse it to figure out whether it's the IO or
>>>>>> the
>>>>>> parsing.
>>>>>>
>>>>>> Then you can worry about how to fix it..
>>>>>>
>>>>>> Best
>>>>>> Erick
>>>>>>
>>>>>>
>>>>>> Qn 1: Any suggestion to get this indexing time reduced? It
>>>>>> would be
>>>>>> really
>>>>>>> great.
>>>>>>>
>>>>>>> Qn 2 : Am I overlooking something in Lucene with respect to
>>>>>>> indexing?
>>>>>>>
>>>>>>> Right now 12 metadata files take 10 hrs nearly which is really a
>>>>>>> long
>>>>>>> time.
>>>>>>>
>>>>>>> Help Appreciated.
>>>>>>>
>>>>>>> Much Thanks.
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://www.nabble.com/Issue-while-parsing-XML-files-due-to-
>>>>>>> control-characters%2C-help-appreciated.-tf3418085.html#a9526527
>>>>>>> Sent from the Lucene - Java Users mailing list archive at
>>>>>>> Nabble.com.
>>>>>>>
>>>>>>>
>>>>>>> -----------------------------------------------------------------
>>>>>>> ---
>>>>>>> -
>>>>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>>>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> View this message in context: http://www.nabble.com/Issue-while-
>>>>> parsing-XML-files-due-to-control-characters%2C-help-appreciated.-
>>>>> tf3418085.html#a9536099
>>>>> Sent from the Lucene - Java Users mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>>
>>>>> -------------------------------------------------------------------
>>>>> --
>>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>>>
>>>>
>>>> --------------------------
>>>> Grant Ingersoll
>>>> Center for Natural Language Processing
>>>> http://www.cnlp.org
>>>>
>>>> Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
>>>> LuceneFAQ
>>>>
>>>>
>>>>
>>>> --------------------------------------------------------------------
>>>> -
>>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>>
>>>>
>>>
>>>
>>
>> --
>> View this message in context: http://www.nabble.com/Issue-while-
>> parsing-XML-files-due-to-control-characters%2C-help-appreciated.-
>> tf3418085.html#a9540232
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>
> ------------------------------------------------------
> Grant Ingersoll
> http://www.grantingersoll.com/
> http://lucene.grantingersoll.com
> http://www.paperoftheweek.com/
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
--
View this message in context:
http://www.nabble.com/Issue-while-parsing-XML-files-due-to-control-characters%2C-help-appreciated.-tf3418085.html#a9542471
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]