[ https://issues.apache.org/jira/browse/NIFI-14426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17943521#comment-17943521 ]
Piotr Zalas commented on NIFI-14426: ------------------------------------ I have created a [commit|https://github.com/apache/nifi/commit/5cae9b89e9413815d49eb25a9b0e0171d3bae5ab] with implementation. I'm not able to create a PR to NiFi repository, as it fails with "Pull request creation failed. Validation failed: must be a collaborator" error. Implementation of logic for detecting file type is more complicated than I expected. Unencrypted XLSX files are stored as OOXML. Legacy XLS files are stored as OLE2 file. The hard part is with encrypted XLSX files, because they are wrapped in OLE2 file, and inside of this OLE2 there is encrypted OOXML. Apache POI differentiates these files by presence of specific file root entries (as visible in WorkbookFactory#create(InputStream, String) implementation). It seems that to open OLE2 files (encrypted XLSX files and legacy XLS files), the whole content of file must be loaded to memory by POIFSFileSystem class. The class is already used by StreamingReader to decrypt XLSX files (see StreamingWorkbookReader#init(InputStream)). [~dstiegli1], I haven't tested yet the memory usage of implementation. Can we come up with some yes-no acceptance criteria? E.g. can we simply say that reader performance is acceptable when "ExcelReader can load files having 20 MB of size", or some additional conditions are needed (e.g. measured memory usage - if yes how to measure it, or testing with NiFi instance having some specific settings, configuration, etc.)? > Add support for HSSF format in ExcelReader processor > ---------------------------------------------------- > > Key: NIFI-14426 > URL: https://issues.apache.org/jira/browse/NIFI-14426 > Project: Apache NiFi > Issue Type: Improvement > Components: Extensions > Reporter: Piotr Zalas > Assignee: Piotr Zalas > Priority: Major > > Currently ExcelReader processor supports only files in new XSSF (.xlsx) > format. Add support for legacy HSSF (.xls) format. -- This message was sent by Atlassian Jira (v8.20.10#820010)