Hi Nick,

Thanks for the steer for XLSX files.
I have tried this 
ReadMsOfficeFiles<http://codezrule.wordpress.com/2012/01/05/extract-text-from-ms-office-2007-files-docx-pptx-xlsx/>
 program and I think I may have found the cause of my particular issue i.e. 
text extraction of doubles gives very large scary looking numbers 
("9.2999999999999999E-2" instead of "0.093").

In XSSFExcelExtractor.getText():

...

// Rows and cells
for (Object rawR : sheet) {
Row row = (Row)rawR;
for(Iterator<Cell> ri = row.cellIterator(); ri.hasNext();) {
Cell cell = ri.next();

// Is it a formula one?
if(cell.getCellType() == Cell.CELL_TYPE_FORMULA && formulasNotResults) {
text.append(cell.getCellFormula());
} else if(cell.getCellType() == Cell.CELL_TYPE_STRING) {
text.append(cell.getRichStringCellValue().getString());
} else {
XSSFCell xc = (XSSFCell)cell;
text.append(xc.getRawValue());     // shouldn't this just be 
text.append(cell.toString()); ?
}

// Output the comment, if requested and exists
       Comment comment = cell.getCellComment();
if(includeCellComments && comment != null) {
   // Replace any newlines with spaces, otherwise it
   //  breaks the output
   String commentText = comment.getString().getString().replace('\n', ' ');
   text.append(" Comment by ").append(comment.getAuthor()).append(": 
").append(commentText);
}

if(ri.hasNext())
text.append("\t");
}
text.append("\n");
}

...

The highlighted line spits out the raw double in all its glory rather than just 
the text equivalent.
As this class is designed to produce text it seems reasonable to me that 
toString() would be sufficient, what do you think?
I have a spreadsheet which exhibits the problem, would you like me to send it?  
If so, how?

Thanks,

- Chris

On 2 Nov 2012, at 15:04, Nick Burch wrote:

On Fri, 2 Nov 2012, Chris Bamford wrote:
The XLS extraction is going great.  For XLSX can I use the same mechanism?

Similar. The low level file formats are very different, but there's an 
analagous extractor that uses SAX XML events rather than record events

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: 
[email protected]<mailto:[email protected]>
For additional commands, e-mail: 
[email protected]<mailto:[email protected]>


Chris Bamford
Senior Developer

2 - 8 Balfe Street
Kings Cross,
London, N1 9EG

mobile +44 7860 405292
tel: +44 (0) 207 843 2300
web www.mimecast.com


The information contained in this communication from [email protected] is 
confidential and may be legally privileged. It is intended solely for use by 
[email protected] and others authorized to receive it. If you are not 
[email protected] you are hereby notified that any disclosure, copying, 
distribution or taking action in reliance of the contents of this information 
is strictly prohibited and may be unlawful.


Mimecast Ltd. is a company registered in England and Wales with the company 
number 4698693 VAT No. GB 123 4197 34
Registered Office:2 - 8 Balfe Street, Kings Cross London, N1 9EG Email Address: 
[email protected]

This email message has been scanned for viruses by Mimecast.
Mimecast delivers a complete managed email solution from a single web based 
platform.
For more information please visit http://www.mimecast.com

Reply via email to