[ https://issues.apache.org/jira/browse/TIKA-3544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17411814#comment-17411814 ]
Nick Burch commented on TIKA-3544: ---------------------------------- Apache POI provides the DataFormatter class which attempts to turn the number into a string similar to the one shown in Excel, based on the formatting rules applied to the cell. That ought to be being used by Tika. Doesn't help completely if Excel has thrown away the last few digits though... > Extraction of long sequences of digits from Excel spreadsheets using Tika > 1.20 doesn’t yield the expected results > ----------------------------------------------------------------------------------------------------------------- > > Key: TIKA-3544 > URL: https://issues.apache.org/jira/browse/TIKA-3544 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.20 > Reporter: Jitin Jindal > Priority: Major > Attachments: Credit Card Numbers.xlsx > > > If an Excel spreadsheet contains a long sequence of digits, such as a credit > card number, Tika 1.13 will emit the said sequence in scientific notation. > For example, the credit card number “6011799905775830” is extracted from the > attached spreadsheet as 6.480195344642784E15, which clearly is not the > desired output. > I think the impact of this issue is significant. There’s plenty of > information that can no longer be reliably extracted from spreadsheets. Think > credit card numbers, telephone numbers and product identifiers to name a few. -- This message was sent by Atlassian Jira (v8.3.4#803005)