[ 
https://issues.apache.org/jira/browse/NIFI-14426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17943521#comment-17943521
 ] 

Piotr Zalas commented on NIFI-14426:
------------------------------------

I have created a 
[commit|https://github.com/apache/nifi/commit/5cae9b89e9413815d49eb25a9b0e0171d3bae5ab]
 with implementation. I'm not able to create a PR to NiFi repository, as it 
fails with "Pull request creation failed. Validation failed: must be a 
collaborator" error.

Implementation of logic for detecting file type is more complicated than I 
expected. Unencrypted XLSX files are stored as OOXML. Legacy XLS files are 
stored as OLE2 file. The hard part is with encrypted XLSX files, because they 
are wrapped in OLE2 file, and inside of this OLE2 there is encrypted OOXML. 
Apache POI differentiates these files by presence of specific file root entries 
(as visible in WorkbookFactory#create(InputStream, String) implementation). It 
seems that to open OLE2 files (encrypted XLSX files and legacy XLS files), the 
whole content of file must be loaded to memory by POIFSFileSystem class. The 
class is already used by StreamingReader to decrypt XLSX files (see 
StreamingWorkbookReader#init(InputStream)).

[~dstiegli1], I haven't tested yet the memory usage of implementation. Can we 
come up with some yes-no acceptance criteria? E.g. can we simply say that 
reader performance is acceptable when "ExcelReader can load files having 20 MB 
of size", or some additional conditions are needed (e.g. measured memory usage 
- if yes how to measure it, or testing with NiFi instance having some specific 
settings, configuration, etc.)?

> Add support for HSSF format in ExcelReader processor
> ----------------------------------------------------
>
>                 Key: NIFI-14426
>                 URL: https://issues.apache.org/jira/browse/NIFI-14426
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Extensions
>            Reporter: Piotr Zalas
>            Assignee: Piotr Zalas
>            Priority: Major
>
> Currently ExcelReader processor supports only files in new XSSF (.xlsx) 
> format. Add support for legacy HSSF (.xls) format.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to