Michael Glavassevich wrote:
Perhaps this behaviour could be affected by my use of org.apache.xerces.util.XMLCatalogResolver?

How are you using it?
       XMLReader r = factory.newSAXParser().getXMLReader();
       r.setEntityResolver(entityResolver);

with catalog.xml containing:

<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
 <!-- US applications use "us-sequence-listing.dtd"
               grants use "us-sequence-listing-2004-03-09.dtd"
I've only found the later at the USPTO, so we make the former refer to the later.
 -->
<system systemId="us-sequence-listing.dtd" uri="dtd/us-sequence-listing-2004-03-09.dtd"/>

 <!-- works with apache xerces XMLCatalogResolver -->
<rewriteSystem systemIdStartString="c:\pap\dtds\entities\" rewritePrefix="dtd/entities/"/>
 <rewriteSystem systemIdStartString="c:\pap\dtds\" rewritePrefix="dtd/"/>
<rewriteSystem systemIdStartString=".\entities\" rewritePrefix="dtd/entities/"/>
 <rewriteSystem systemIdStartString=".\" rewritePrefix="dtd/"/>
 <rewriteSystem systemIdStartString="" rewritePrefix="dtd/"/>

</catalog>
I'm processing US patent application data from the USPTO using their DTD's:

    * us-patent-application-v41-2005-08-25.dtd
    * us-patent-application-v40-2004-12-02.dtd
    * us-sequence-listing-2004-03-09.dtd
    * pap-v16-2002-01-01.dtd
    * pap-v15-2001-01-31.dtd

From a quick perusal these DTDs (including the external entities they reference) look very large. It's not just the entity declarations. Just about everything in these DTDs which match the Name production from the XML spec gets added to the SymbolTable. I assume each document you parse only references one of them. Perhaps it's the sum of the unique names from each of the DTDs which leads to your app running out of memory
Yes they are quite large, however I still think there is a problem because:

1) even when using "java -Xmx7000M" (thats 7 salesman's gigabytes) it falls over (whereas 300Mb is enough if I use a new parser for each doc);

2) profiling shows that symbol table entries exist with a continuously growing number of different garbage collection generations (new entries are continuously being added without the old ones being cleaned up). If the cache was working new entries would not be created once each DTD had been read once.

Is it possible that I'm messing things up by having xercesImpl-2.8.0 in the classpath without pointing to it with -Djava.*endorsed*.*dirs?

Cheers,
  Neil.
*

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to