Re: Apache POI works but it uses too RAM: is it a limit of your library?

Mark Beardsley Sat, 13 Apr 2013 02:38:22 -0700

Like HTML, XML is a markup language that defines a set of rules. These rules
render a document both human and machine readable. Unlike HTML however, the
rules are defined almost completely by the designer of the document.


An XML document consists of, crudely, a series of elements. Typically, each
element has a start tag and a matching end tag, although an empty element
can simply consist of an empty element tag. The elements content will be
found between the tags and it will consist of other markup such as child
elements or character data.

Elements may have attributes which consist of a name/value pair. The name
and value will be separated by the equals sign (=) and the value will be
enclosed by speech marks (").

Knowing this, consider the following fragment of markup;

<row r="1" spans="1:1">
     <c r="A1" t="s">
          <v>30</v>
     </c>
</row>

It represents a single row on an Excel worksheet. The row has an opening tag
with the name 'row' and a matching closing tag. It contains one child
element which defines the markup for a single cell. Note that the name of
the cell tag is 'c' and that there are two attributes defined within the
opening tag, r and t. The r attribute has the value A1 and this tells us
that the markup relates to cell A1 on the worksheet while the 't' attribute
has the value 's' and this actually tells us that the cell contains a
string. The c (cell) element is a child element of the row element and it
has child elements of its own. In this case, the c element has a child
element with the name 'v' and this element simply contains the value that
will appear in the cell.

A parser analyses the contents of a file, in this case, and converts it into
a form that can be further manipulated. In order to make the process as
flexible as possible, it is possible to create handlers that define how the
conversion should take place. Using a technique referred to as a callback,
the parser will reference methods defined within the handler to process the
contents of the file.

Referring to the event example, the SheetHandler class contains the methods
the parser will call as it reads the contents of the Excel workbook. When
the parser encounters a start element tag, it will call the startElement()
method and pass to it information such as the name of the tag, the
document's namespace and any attributes that were defined in the start
element tag. When the parser encounters an end element tag, it will call the
endElement() method of the handler and pass a similar collection of
parameters to it - with the exception of the list of attributes. If an
element contains character data, then the parser will call the handler's
characters() method and pass it it an arry containing the elements character
data along with the index of the first character in the array and the length
of the arrays contents.

Knowing this, it is easy to see how the fragment of XML above could be
parsed and the sequence of calls that are made to the handler;

1. A call will be made to the startElement() and while values will be passed
to all four parameters, we are only interetsted in the name and attributes
parameters values here. As the parser has found a row element, ''row" will
be passed to the name parameter and the 'r' and 'spans' attributes to the
'attributes' parameter. The code we write in the startElement() method will
need to react to this information and, in this case, we will create a new
ArrayList to hold the values recovered from the cells on the row.
2. Another call will be made to the startElement() method and, this time,
the value 'c' will be passed to the name parameter and the attributes r and
to to the attributes parameter. Again, we must write code to interpret and
react to this and the first thing to do is to check the type of the cell and
this is done, in part, because strings may be handled in a special way, they
may be written into the SharedStringsTable (more of that later).
3. Another call is made to the startElement() method and the value 'v' will
be passed to the name parameter, informing us that this element holds the
cells value. But we do not yet have access to that value, we must wait until
the parser has read it from the element and this will be accomplished by a
call to the characters() method. The parser will make this call and pass an
array - of type char - that simply contains the characters '3' and '0'.
4. A call will now be made to the endElement() method and the value 'v' will
be passed to the name parameter. By checking the value of the name
parameter, and because we know this elements parent is a cell - c - element,
that we have the cells value to hand. We must though make one further check
because the true contents of the cell could be held in the Shared Strings
Table if we are dealing with a cell of type 's'. If we are dealing with a
string then we use the integer value the cell held - int this case 30 - to
index into the Shared String Table and retrieve the actual value displayed
in the cell. Once we have the cells value, we can store that away into the
ArrayList.
5. A call is made to the endElement() method and the value 'c' is passed to
the name parameter. We do not need to react to the closing cell element tag
in this case.
6. A call is made to the endElement() method and the value 'row' is passed
to the name parameter and we could choose to react to this. When we
encounter an end element with the name of 'row', we know that we have
recovered the contents from every cell on a single row. Now would be the
ideal time to convert the ArrayList into an array of type Object and to
store that into the sheetData ArrayList.

To make life a little more straightforward for you, let's take the
SheetHandler class out of the ExampleEventUserModel class and make it into a
stand-alone class, a little like this;

import org.apache.poi.xssf.usermodel.XSSFRichTextString;
import org.apache.poi.xssf.model.SharedStringsTable;
import org.xml.sax.helpers.DefaultHandler;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import java.util.List;
import java.util.ArrayList;

class SheetHandler extends DefaultHandler {

    private SharedStringsTable sst;
    private String lastContents;
    private boolean nextIsString;
    private Object[][] sheetData = null;
    private List<String> rowData = null;
    private List<Object[]> tempSheetData = null;

    /**
     * Create an instance of the SheetHandler class using the following
     * parameters
     * 
     * @param sst An instance of the SharedStringsTable class that 
     *        encapsulates the workbook's shared strings table.
     */
    public SheetHandler(SharedStringsTable sst) {
       this.sst = sst;
       this.rowData = new ArrayList<String>();
       this.tempSheetData = new ArrayList<Object[]>();
    }

    /**
     * Called whenever the parser encounters the start tag of an element
     * within the xml markup.
     * 
     * @param uri An instance of the String class that will encapsulate the
     *        name of the Namespace. It will be empty if namespaces are not
     *        used.
     * @param localName An instance of the String class that will 
     *        encapsulatethe local name for the element.
     * @param name An instance of the String class that will encapsulate the
     *        qualified name for the element.
     * @param attributes An instance of the Attributes class that will 
     *        encapsulate any of the element's attributes. Empty if there 
     *        are none.
     * @throws SAXException Thrown if a problem is encountered parsing the
     *         xml markup.
     */
    public void startElement(String uri, String localName, String name,
            Attributes attributes) throws SAXException {

        // If the element marks the start of the markup for a cell...
        if (name.equals("c")) {
            // Figure out if the value is an index in the SST set a flag
            // to indicate that the Shared Strings Table should be 
            // interrogated for the cells contents.
            String cellType = attributes.getValue("t");
            if (cellType != null && cellType.equals("s")) {
                nextIsString = true;
            } else {
                nextIsString = false;
            }
        }
        // Clear contents cache
        lastContents = "";
    }

    /**
     * Called when the parser encounters the end element tag within the
     * xml markup.
     * 
     * @param uri An instance of the String class that will encapsulate the
     *        name of the Namespace. It will be empty if namespaces are not
     *        used.
     * @param localName An instance of the String class that will 
     *        encapsulatethe local name for the element.
     * @param name An instance of the String class that will encapsulate the
     *        qualified name for the element.
     * @throws SAXException Thrown if a problem is encountered parsing the
     *         xml markup.
     */
    public void endElement(String uri, String localName, String name)
            throws SAXException {
        
        // This is the end of the row so convert the ArrayList into an array
of
        // type Object
        if(name.equals("row")) {
            this.tempSheetData.add(this.rowData.toArray());
            this.rowData = new ArrayList<String>();
        }
        else if(name.equals("v")) {
            // This is the closing tag of a cells contents so handle that by
            // getting the cells value from the SST if the cell held a
String
            // and then store that value into the ArrayList
            if (nextIsString) {
                int idx = Integer.parseInt(lastContents);
                lastContents = new
XSSFRichTextString(sst.getEntryAt(idx)).toString();
                nextIsString = false;
            }
            // Add the cells contents to the row ArrayList.
            this.rowData.add(lastContents);
        }
    }

    /**
     * Called to process the elements character data.
     * 
     * @param ch An array of type character that contains the character data
     *        recovered from the element.
     * @param start A primitive int whose value indicates the start position
     *        of the character data within the ch array.
     * @param length A primitive int whose value indicates how many 
     *        characters were read.
     * @throws SAXException Thrown if a problem is encountered whilst 
     *         parsing the xml markup.
     */
    public void characters(char[] ch, int start, int length)
            throws SAXException {
        lastContents += new String(ch, start, length);
    }
    
    /**
     * Called by the parser once processing of the document has been
completed,
     * this method will simply create a multi-dimensional array of type
object
     * which holds the contents of the sheet's cells. It will copy a
reference 
     * to that object into the sheetData variable thus making it available
to
     * the class that created the handler.
     */
    public void endDocument() {
        this.sheetData = new Object[this.tempSheetData.size()][];
        for(int i = 0; i < this.tempSheetData.size(); i++) {
            this.sheetData[i] = this.tempSheetData.get(i);
        }
    }
    
    /**
     * Get the sheet's data.
     * 
     * @return A multi-dimensional array of type Object whose elements 
     *         encapsulate the contents of the sheet.
     */
    public Object[][] getSheetData() {
        return(this.sheetData);
    }
}

Next, create another class that uses this handler to parse the xml markup
for a specific sheet in the workbook. It could look a little like this;

public class NewArrayExample {
    
    /**
     * Recover the contents the cells from a specific sheet from an Excel
workbook
     * as elements in a multi-dimensional array.
     * 
     * @param filename An instance of the String class that encapsulates the
name
     *        of and path to a valid Excel OOXML (.xlsx) file.
     * @param sheetIndex A primitive int whose value indicates which sheet
in
     *        the workbook should be processed. Just as with XSSFWorkbook,
sheet
     *        indices begin at zero. This is to say that the first sheet in
the
     *        workbook will be at index 0, the second sheet at index 1, etc.
     * @return A multi-dimensional array of type Object which should
encapsulate
     *         the contents of the cells found on the sheet. Each entry in
the
     *         array will itself by an array of type Object and each entry
in
     *         this array will be an Object that encapsulates the cells
value,
     *         as a String.
     * @throws Exception 
     */
    public static Object[][] xlsxRead(String filename, int sheetIndex)
throws Exception {
        String sheetName = null;
        OPCPackage pkg = null;
        XSSFReader reader = null;
        SharedStringsTable sst = null;
        XMLReader parser = null;
        SheetHandler contentHandler = null;
        InputStream sheet = null;
        InputSource sheetSource = null;
        
        pkg = OPCPackage.open(filename);
        reader = new XSSFReader(pkg);
        sst = reader.getSharedStringsTable();
        contentHandler = new SheetHandler(sst);
        parser = fetchSheetParser(contentHandler);
        
        // The name of the sheet can be either 'rIDn' or 'rSheetn' where n
is the
        // number of the sheet. Test, firstly to see if the sheets name is
        // 'rIdn' here. If that fails....
        sheetName = ("rId" + (sheetIndex + 1));
        sheet = reader.getSheet(sheetName);
        if(sheet != null) {
            processSheet(sheet, parser);
        }
        else {
            // ...then check to see if the sheets name is 'rSheetn'. If this
fails
            // throw an exception. If it suceeds, process the sheet.
            sheetName = ("rSheet" + (sheetIndex + 1));
            if(sheet != null) {
                processSheet(sheet, parser);
            }
            else {
                throw new IllegalArgumentException("No sheet exists in the "
                        + "workbook with the index specified: " +
sheetIndex);
            }
        }

        // Get the sheets contents and return to the caller.
        return(contentHandler.getSheetData());
    }
    
    /**
     * Process (parse) the sheet recovered from the Excel workbook.
     * 
     * @param sheet An instance of the InputStream class 'attached' to the
     *        sheet's xml markup and through which the parser can access it
     *        contents.
     * @param parser An instance of the XMLReader class that encapsulates
the
     *        necessary logic to both parse the xml markup and interpret
what
     *        the parser finds (the SheetHandler).
     * @throws IOException
     * @throws SAXException 
     */
    private static void processSheet(InputStream sheet, XMLReader parser) 
            throws IOException, SAXException {
        InputSource sheetSource = new InputSource(sheet);
        parser.parse(sheetSource);
        sheet.close();
    }
    
    /**
     * Get an XMLReader to handle parsing of the workbook's xml markup.
     * 
     * @param contentHandler An instance of the SheetHandler class that 
     *        encapsulates the logic to interpret the sheet's xml markup.
     * @throws SAXException 
     */
    private static XMLReader fetchSheetParser(ContentHandler contentHandler)
            throws SAXException {
        XMLReader parser = XMLReaderFactory.createXMLReader(
                "org.apache.xerces.parsers.SAXParser");
        parser.setContentHandler(contentHandler);
        return parser;
    }
}

and it would be called like this;

Object[][] sheetData = NewArrayExample.xlsxRead("C:/temp/Book2.xlsx", 1);

Obviously, you would change the values appsed to the parameters of the
xlsxRead() method to suit your own needs. Remember that the sheet number is
zero based. Passing 0 will process the first sheet in the workbook, passing
1 will process the second sheet, etc.

This really is far more than you should expect from a list which has a very
specific purpose, to assist people in using the POI api. You really should
look into parsers, callbacks and the structure of the files markup yourself
- none of these are specific nor unique to POI. This is the last time I can
offer help to you I am afraid because writing so much code and explaining
all of these details is very time consuming.





--
View this message in context: 
http://apache-poi.1045710.n5.nabble.com/Apache-POI-works-but-it-uses-too-RAM-is-it-a-limit-of-your-library-tp5712425p5712543.html
Sent from the POI - User mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Apache POI works but it uses too RAM: is it a limit of your library?

Reply via email to