Delay during parsing

John Jones Sat, 02 Aug 2014 07:18:22 -0700

Hi,
I am : using Java 1.7 with xerces 2.11.0 in Eclipse.
       ; inexperienced with  xerces and xml.
       : have a simple  program (Which seems to work as intended) to pick
out some required details from an xml document.


The problem I have is that there is a delay of around 20 seconds before the
parsing completes. There is no significant network or cpu activity during
the parse which suggested a network time-out. I have checked that the http
references all exist.
Consulting the web faqs suggested trying to ensure validation is turned off.
I have tried to do this but it leads to an impossible cast.

So I now have three questions:
The traffic on this mailing list seems quite light. Is this question
appropriate for this mailing list? Can you think of a more appropriate
forum?
Can you suggest other likely reasons why there is a delay in processing a
short XML document?
Can you explain where I am going wrong in turning off validation?

The document is an html bank statement with just two transactions. It is
suitable for input to Excel and there is no delay when opening it with
Excel.
The first line of the xml was added manually when xerces reported a bad
UTF-8 character. It doesn't appear to affect Excel.
The start of the document is:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "
http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd";>
<html xmlns="http://www.w3.org/1999/xhtml"; lang="es"
xml:lang="es"><head><STYLE type="text/css">
 #CabeceraCuerpo {font-size:10pt;font-family:arial;}
#CabeceraTitulo {font-size:14.0pt;font-family:Arial;text-align:left;}
 #CabeceraFechaDescarga {font-size:9pt;font-family:arial}
#CabeceraCuerpoNoCuenta {font-size:12pt;font-family:arial;}
 #CabeceraSubTitulo {font-size:11pt;font-family:arial;}
#CuerpoDetalleTitulo {font-size:9pt;font-family:arial;}
 
#CuerpoDetalleDoble{width:60pt;font-size:10pt;font-family:arial;text-align:left;border-bottom:.5pt
hairline silver}
#CuerpoDetalle
{font-size:10pt;font-family:arial;text-align:center;border-bottom:.5pt
hairline silver}
 #TDNoWrappedLeft {text-align:left;white-space:nowrap;width:20pt}
#TDNoWrappedLeftDoble {text-align:left;white-space:nowrap;width:100pt}
 #TDNoWrappedRight {text-align:right;white-space:nowrap}
#TDNoWrappedTituloBorderBottom
{text-align:left;white-space:nowrap;border-bottom:1pt solid windowtext}
 #TDNoWrappedColCuerpoTituloBorderBottom
{text-align:center;white-space:nowrap;border-bottom:.5pt solid windowtext}
#TDNoWrappedSubTituloBorderBottom
{text-align:left;white-space:nowrap;border-bottom:.5pt solid windowtext}
 #TDSeparadorInicial {width:10pt}
#TDSeparadorDoble {width:100pt; border-bottom:.5pt hairline silver}
#TDTituloSeparadorBorderBotton {width:10pt;border-bottom:1pt solid
windowtext}
 #TDTituloBorderBotton {border-bottom:1pt solid windowtext}
#TDSubTituloSeparadorBorderBotton {width:10pt;border-bottom:.5pt solid
windowtext}
 #TDSubTituloBorderBotton {border-bottom:.5pt solid windowtext}
#TDListadoBorderBottom {border-bottom:.5pt hairline silver}
</STYLE><meta content="text/html; charset=iso-8859-1"
http-equiv="Content-Type" /></head><body><table><tr><td /><td
style="TDSeparadorDoble">Transactions</td></tr><tr
style="vertical-align:middle"><td id="TDSeparadorInicial" /><td
id="TDNoWrappedLeftDoble"><font id="CabeceraCuerpo">XXXX XXXX XXXX 4271:
 </font></td><td /><td><font id="CuerpoDetalle">01/02/2014
to
 01/08/2014</font></td>


The start of the  Java for opening and parsing the document was obtained
from the web several years ago and I don't fully understand what it is
doing but it works with a different program!   It is :

public class BaseXML
{

    /** Default namespaces support (true). */
    protected static final boolean DEFAULT_NAMESPACES = true;

    /** Default validation support (false). */
    protected static final boolean DEFAULT_VALIDATION = false;

    /** Default Schema validation support (false). */
    protected static final boolean DEFAULT_SCHEMA_VALIDATION = false;

    //Set false in the first constructor called. N.B not synchronised etc
so can be fooled.
    static boolean firstCaller=true;

    LSParser builder=null;
    DOMImplementationRegistry registry = null;
    DOMImplementationLS impl = null;
    DOMConfiguration config = null;
    DOMErrorHandler errorHandler = null;
    LSParserFilter filter = null;

    HashMap<Object, Object> bookShelf;


  public BaseXML()
  {     if( ! firstCaller )
        { System.err.println(" XML work class already initialised.
(Fatal)");
          System.exit(1);
        }
        firstCaller=false;
        try {
            // get DOM Implementation using DOM Registry

System.setProperty(DOMImplementationRegistry.PROPERTY,"org.apache.xerces.dom.DOMXSImplementationSourceImpl");
         //
System.setProperty(DOMImplementationRegistry.PROPERTY,"org.apache.xerces.dom.DOMImplementationSourceImpl");
            System.out.println("DOM Impl");

System.out.print(System.getProperty(DOMImplementationRegistry.PROPERTY ));

            registry = DOMImplementationRegistry.newInstance();

            impl = (DOMImplementationLS)registry.getDOMImplementation("LS");

            // create DOMBuilder
            builder =
impl.createLSParser(DOMImplementationLS.MODE_SYNCHRONOUS, null);

            config = builder.getDomConfig();

            // create Error Handler
            errorHandler = new Handlers();
            // create filter
            filter = new Handlers();

            builder.setFilter(filter);
            try { ((org.apache.xerces.parsers.SAXParser)
builder).setFeature("http://xml.org/sax/features/validation";, false);
            } catch (org.xml.sax.SAXException e)
            {
            System.out.println("error in setting up parser feature");
            e.toString();
            }
            // set error handler
            config.setParameter("error-handler", errorHandler);

            // set validation feature
            config.setParameter("validate",Boolean.TRUE);

            // set schema language
            config.setParameter("schema-type", "
http://www.w3.org/2001/XMLSchema";);

        } catch ( Exception ex ) {
            ex.printStackTrace();
        }
        bookShelf=new HashMap<Object, Object>();
  }


A bit later on is the routine I call to parse the input document:

  public boolean openXMLFile(String bookName, String FileName)
  {
    Document doc=null;
    try
    { doc = builder.parseURI(FileName);
    }catch (Exception e)
    { System.err.println(" Exception during parse of
"+bookName+":"+e.getMessage());
    };
    if( doc != null )
    { bookShelf.put(bookName,doc);
      return true;
    }else
    { return false;
    }
  }


The delay is during the call to parseURI.
The exception that arises when trying to turn off validation as suggested
on the web is :

org.apache.xerces.dom.DOMXSImplementationSourceImpljava.lang.ClassCastException:
org.apache.xerces.parsers.DOMParserImpl cannot be cast to
org.apache.xerces.parsers.SAXParser
 at BaseXML.<init>(BaseXML.java:80)
at SanPost.<clinit>(SanPost.java:45)


Regards

John

Delay during parsing

Reply via email to