Yep I did that, and now my code looks as follows. The time taken for indexing one file is now => Elapsed Time in Minutes :: 0.3531, which is really great, but after processing 4 dumpfiles(which means 40,000 small xml's), I get :
caught a class java.io.IOException 40114 with message: Lock obtain timed out: Lock@/tmp/lucene-e36d478e46e88f594d57a03c10ee0b3b-write.lock This is the new issue now.What could be the reason ?. I am surprised because I am only writing to Index under ./LUCENE/ evrytime and not doing anything with the index(ofcrs to avoid such synchronization issues !) for (int i=0; i<children1.length; i++) { // Get filename of file or directory String filename = children1[i]; if(!filename.startsWith("oai_citeseer")) { continue; } String dirName = filename; System.out.println("File => "+filename); NodeList nl = null; //Testing the creation of Index try { File dir = new File(dirName); String[] children = dir.list(); if (children == null) { // Either dir does not exist or is not a directory } else { ArrayList alist_Title = new ArrayList(children.length); ArrayList alist_Descr = new ArrayList(children.length); System.out.println("Size of Array List : "+alist_Title.size()); String title = null; String descr = null; long startTime = System.currentTimeMillis(); System.out.println(dir +" start time ==> "+ System.currentTimeMillis()); for (int ii=0; ii<children.length; ii++) { // Get filename of file or directory String file = children[ii]; System.out.println(" Name of Dir : Record ~~~~~~~ " +dirName +" : "+ file); //System.out.println("The name of file parsed now ==> "+file); //Get the value of recXYZ's metadata tag nl = ValueGetter.getNodeList(filename+"/"+file, "metadata"); if(nl == null) { System.out.println("Error shoudlnt be thrown ..."); alist_Title.add(ii,"title"); alist_Descr.add(ii,"descr"); continue; } //Get the metadata element(title and description tags from dump file ValueGetter vg = new ValueGetter(); //Get the Extracted Tags Title, Identifier and Description try { title = vg.getNodeVal(nl, "dc:title"); alist_Title.add(ii,title); descr = vg.getNodeVal(nl, "dc:description"); alist_Descr.add(ii,descr); } catch(Exception ex) { ex.printStackTrace(); System.out.println("Excep ==> "+ ex); } }//End of For //Create an Index under LUCENE IndexWriter writer = new IndexWriter("./LUCENE", new StopStemmingAnalyzer(),false); Document doc = new Document(); //Get Array List Elements and add them as fileds to doc for(int k=0; k < alist_Title.size(); k++) { doc.add(new Field("Title",alist_Title.get(k).toString(), Field.Store.YES, Field.Index.UN_TOKENIZED)); } for(int k=0; k < alist_Descr.size(); k++) { doc.add(new Field("Description",alist_Descr.get(k).toString(), Field.Store.YES, Field.Index.UN_TOKENIZED)); } //Add the document created out of those fields to the IndexWriter which will create and index writer.addDocument(doc); writer.optimize(); writer.close(); long elapsedTimeMillis = System.currentTimeMillis()-startTime; System.out.println("Elapsed Time for "+dirName +" :: " + elapsedTimeMillis); float elapsedTimeMin = elapsedTimeMillis/(60*1000F); System.out.println("Elapsed Time in Minutes :: "+ elapsedTimeMin); }//End of Else } //End of try catch (Exception ee) { ee.printStackTrace(); System.out.println("caught a " + ee.getClass() + "\n with message: "+ ee.getMessage()); } System.out.println("Total Record ==> "+total_records); }// End of For Grant Ingersoll-5 wrote: > > Move index writer creation, optimization and closure outside of your > loop. I would also use a SAX parser. Take a look at the demo code > to see an example of indexing. > > Cheers, > Grant > > On Mar 18, 2007, at 12:31 PM, Lokeya wrote: > >> >> >> Erick Erickson wrote: >>> >>> Grant: >>> >>> I think that "Parsing 70 files totally takes 80 minutes" really >>> means parsing 70 metadata files containing 10,000 XML >>> files each..... >>> >>> One Metadata File is split into 10,000 XML files which looks as >>> below: >>> >>> <root> >>> <record> >>> <header> >>> <identifier>oai:CiteSeerPSU:1</identifier> >>> <datestamp>1993-08-11</datestamp> >>> <setSpec>CiteSeerPSUset</setSpec> >>> </header> >>> <metadata> >>> <oai_citeseer:oai_citeseer >>> xmlns:oai_citeseer="http://copper.ist.psu.edu/oai/oai_citeseer/" >>> xmlns:dc >>> ="http://purl.org/dc/elements/1.1/" >>> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" >>> xsi:schemaLocation="http://copper.ist.psu.edu/oai/oai_citeseer/ >>> http://copper.ist.psu.edu/oai/oai_citeseer.xsd "> >>> <dc:title>36 Problems for Semantic Interpretation</dc:title> >>> <oai_citeseer:author name="Gabriele Scheler"> >>> <address>80290 Munchen , Germany</address> >>> <affiliation>Institut fur Informatik; Technische >>> Universitat >>> Munchen</affiliation> >>> </oai_citeseer:author> >>> <dc:subject>Gabriele Scheler 36 Problems for Semantic >>> Interpretation</dc:subject> >>> <dc:description>This paper presents a collection of problems >>> for >>> natural >>> language analysisderived mainly from theoretical linguistics. Most of >>> these problemspresent major obstacles for computational systems of >>> language interpretation.The set of given sentences can easily be >>> scaled up >>> by introducing moreexamples per problem. The construction of >>> computational >>> systems couldbenefit from such a collection, either using it >>> directly for >>> training andtesting or as a set of benchmarks to qualify the >>> performance >>> of a NLPsystem.1 IntroductionThe main part of this paper consists >>> of a >>> collection of problems for semanticanalysis of natural language. The >>> problems are arranged in the following way:example sentencesconcise >>> description of the problemkeyword for the type of problemThe sources >>> (first appearance in print) of the sentences have been left >>> out,because >>> they are sometimes hard to track and will usually not be of much >>> use,as >>> they indicate a starting-point of discussion only. The keywords >>> howeve... >>> </dc:description> >>> <dc:contributor>The Pennsylvania State University CiteSeer >>> Archives</dc:contributor> >>> <dc:publisher>unknown</dc:publisher> >>> <dc:date>1993-08-11</dc:date> >>> <dc:format>ps</dc:format> >>> >>> <dc:identifier>http://citeseer.ist.psu.edu/1.html</dc:identifier> >>> <dc:source>ftp://flop.informatik.tu-muenchen.de/pub/fki/ >>> fki-179-93.ps.gz</dc:source> >>> <dc:language>en</dc:language> >>> <dc:rights>unrestricted</dc:rights> >>> </oai_citeseer:oai_citeseer> >>> </metadata> >>> </record> >>> </root> >>> >>> >>> From the above I will extract the Title and the Description tags >>> to index. >>> >>> Code to do this: >>> >>> 1. I have 70 directories with the name like oai_citeseerXYZ/ >>> 2. Under each of above directory, I have 10,000 xml files each having >>> above xml data. >>> 3. Program does the following >>> >>> File dir = new File(dirName); >>> String[] children = dir.list(); >>> if (children == null) { >>> // Either dir does not exist or >>> is not a directory >>> } >>> else >>> { >>> for (int ii=0; >>> ii<children.length; ii++) >>> { >>> >>> // Get filename of file >>> or directory >>> String file = >>> children[ii]; >>> >>> //System.out.println("The name of file parsed now ==> "+file); >>> nl = >>> ReadDump.getNodeList(filename+"/"+file, "metadata"); >>> if(nl == null) >>> { >>> >>> //System.out.println("Error shoudlnt be thrown ..."); >>> } >>> //Get the metadata >>> element tags from xml file >>> ReadDump rd = new >>> ReadDump(); >>> >>> //Get the Extracted >>> Tags Title, Identifier and Description >>> ArrayList alist_Title = >>> rd.getElements(nl, "dc:title"); >>> ArrayList alist_Descr = >>> rd.getElements(nl, "dc:description"); >>> >>> //Create an Index under >>> DIR >>> IndexWriter writer = >>> new IndexWriter("./FINAL/", new >>> StopStemmingAnalyzer(),false); >>> Document doc = new >>> Document(); >>> >>> //Get Array List >>> Elements and add them as fileds to doc >>> for(int k=0; k < >>> alist_Title.size(); k++) >>> { >>> doc.add(new >>> Field("Title",alist_Title.get(k).toString(), >>> Field.Store.YES, Field.Index.UN_TOKENIZED)); >>> } >>> >>> for(int k=0; k < >>> alist_Descr.size(); k++) >>> { >>> doc.add(new >>> Field("Description",alist_Descr.get(k).toString >>> (), >>> Field.Store.YES, Field.Index.UN_TOKENIZED)); >>> } >>> >>> >>> //Add the document created out of those fields to the >>> IndexWriter which >>> will create and index >>> writer.addDocument(doc); >>> writer.optimize(); >>> writer.close(); >>> } >>> >>> >>> This is the main file which does indexing. >>> >>> Hope this will give you an idea. >>> >>> >>> Lokeya: >>> Can you confirm my supposition? And I'd still post the code >>> Grant requested if you can..... >>> >>> So, you're talking about indexing 10,000 xml files in 2-3 hours, >>> 8 minutes or so which is spent reading/parsing, right? It'll be >>> important to know how much data you're indexing and now, so >>> the code snippet is doubly important.... >>> >>> Erick >>> >>> On 3/18/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: >>>> >>>> Can you post the relevant indexing code? Are you doing things like >>>> optimizing after every file? Both the parsing and the indexing >>>> sound >>>> really long. How big are these files? >>>> >>>> Also, I assume you machine is at least somewhat current, right? >>>> >>>> On Mar 18, 2007, at 1:00 AM, Lokeya wrote: >>>> >>>>> >>>>> Thanks for your reply. I tried to check if the I/O and Parsing is >>>>> taking time >>>>> separately and Indexing time also. I observed that I/O and Parsing >>>>> 70 files >>>>> totally takes 80 minutes where as when I combine this with Indexing >>>>> for a >>>>> single Metadata file it nearly 2 to 3 hours. So looks like >>>>> IndexWriter takes >>>>> time that too when we are appending to the Index file this happens. >>>>> >>>>> So what is the best approach to handle this? >>>>> >>>>> Thanks in Advance. >>>>> >>>>> >>>>> Erick Erickson wrote: >>>>>> >>>>>> See below... >>>>>> >>>>>> On 3/17/07, Lokeya <[EMAIL PROTECTED]> wrote: >>>>>>> >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I am trying to index the content from XML files which are >>>>>>> basically the >>>>>>> metadata collected from a website which have a huge collection of >>>>>>> documents. >>>>>>> This metadata xml has control characters which causes errors >>>>>>> while trying >>>>>>> to >>>>>>> parse using the DOM parser. I tried to use encoding = UTF-8 but >>>>>>> looks >>>>>>> like >>>>>>> it doesn't cover all the unicode characters and I get error. Also >>>>>>> when I >>>>>>> tried to use UTF-16, I am getting Prolog content not allowed >>>>>>> here. So my >>>>>>> guess is there is no enoding which is going to cover almost all >>>>>>> unicode >>>>>>> characters. So I tried to split my metadata files into small >>>>>>> files and >>>>>>> processing records which doesnt throw parsing error. >>>>>>> >>>>>>> But by breaking metadata file into smaller files I get, 10,000 >>>>>>> xml files >>>>>>> per >>>>>>> metadata file. I have 70 metadata files, so altogether it becomes >>>>>>> 7,00,000 >>>>>>> files. Processing them individually takes really long time using >>>>>>> Lucene, >>>>>>> my >>>>>>> guess is I/O is time consuing, like opening every small xml file >>>>>>> loading >>>>>>> in >>>>>>> DOM extracting required data and processing. >>>>>> >>>>>> >>>>>> >>>>>> So why don't you measure and find out before trying to make the >>>>>> indexing >>>>>> step more efficient? You simply cannot optimize without knowing >>>>>> where >>>>>> you're spending your time. I can't tell you how often I've been >>>>>> wrong >>>>>> about >>>>>> "why my program was slow" <G>. >>>>>> >>>>>> In this case, it should be really simple. Just comment out the >>>>>> part where >>>>>> you index the data and run, say, one of your metadata files.. I >>>>>> suspect >>>>>> that >>>>>> Cheolgoo Kang's response is cogent, and you indeed are spending >>>>>> your >>>>>> time parsing the XML. I further suspect that the problem is not >>>>>> disk IO, >>>>>> but the time spent parsing. But until you measure, you have no >>>>>> clue >>>>>> whether you should mess around with the Lucene parameters, or find >>>>>> another parser, or just live with it.. Assuming that you >>>>>> comment out >>>>>> Lucene and things are still slow, the next step would be to just >>>>>> read in >>>>>> each file and NOT parse it to figure out whether it's the IO or >>>>>> the >>>>>> parsing. >>>>>> >>>>>> Then you can worry about how to fix it.. >>>>>> >>>>>> Best >>>>>> Erick >>>>>> >>>>>> >>>>>> Qn 1: Any suggestion to get this indexing time reduced? It >>>>>> would be >>>>>> really >>>>>>> great. >>>>>>> >>>>>>> Qn 2 : Am I overlooking something in Lucene with respect to >>>>>>> indexing? >>>>>>> >>>>>>> Right now 12 metadata files take 10 hrs nearly which is really a >>>>>>> long >>>>>>> time. >>>>>>> >>>>>>> Help Appreciated. >>>>>>> >>>>>>> Much Thanks. >>>>>>> -- >>>>>>> View this message in context: >>>>>>> http://www.nabble.com/Issue-while-parsing-XML-files-due-to- >>>>>>> control-characters%2C-help-appreciated.-tf3418085.html#a9526527 >>>>>>> Sent from the Lucene - Java Users mailing list archive at >>>>>>> Nabble.com. >>>>>>> >>>>>>> >>>>>>> ----------------------------------------------------------------- >>>>>>> --- >>>>>>> - >>>>>>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>>>>>> For additional commands, e-mail: [EMAIL PROTECTED] >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>>> -- >>>>> View this message in context: http://www.nabble.com/Issue-while- >>>>> parsing-XML-files-due-to-control-characters%2C-help-appreciated.- >>>>> tf3418085.html#a9536099 >>>>> Sent from the Lucene - Java Users mailing list archive at >>>>> Nabble.com. >>>>> >>>>> >>>>> ------------------------------------------------------------------- >>>>> -- >>>>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>>>> For additional commands, e-mail: [EMAIL PROTECTED] >>>>> >>>> >>>> -------------------------- >>>> Grant Ingersoll >>>> Center for Natural Language Processing >>>> http://www.cnlp.org >>>> >>>> Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ >>>> LuceneFAQ >>>> >>>> >>>> >>>> -------------------------------------------------------------------- >>>> - >>>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>>> For additional commands, e-mail: [EMAIL PROTECTED] >>>> >>>> >>> >>> >> >> -- >> View this message in context: http://www.nabble.com/Issue-while- >> parsing-XML-files-due-to-control-characters%2C-help-appreciated.- >> tf3418085.html#a9540232 >> Sent from the Lucene - Java Users mailing list archive at Nabble.com. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> > > ------------------------------------------------------ > Grant Ingersoll > http://www.grantingersoll.com/ > http://lucene.grantingersoll.com > http://www.paperoftheweek.com/ > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > -- View this message in context: http://www.nabble.com/Issue-while-parsing-XML-files-due-to-control-characters%2C-help-appreciated.-tf3418085.html#a9542471 Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]