tl;dr: I'm still searching for a way to open "Document" streams for PPT/DOC files, but I have leads I need to check!
Hey Javen, Thanks for the link! I was looking at one "[MS-xxx].pdf" file at a time, thru their online versions, but wasn't aware a zip was available. DocBleach already handles embedded files for multiple formats, including PDF/OOXML. Thus, adding support OLE2 embedded files as easy as calling a method. ActiveX controls are/should be removed too. I'm note sure I handle this correctly, and I don't have a proper tests suite, but these are supposed to work. I haven't checked hyperlinks yet, and didn't even think they could be a threat. Thanks! -- I'll check the VBAMacroReader class again, maybe does it contain the answers I'm looking for. > Feel free to continue this discussion over on d...@poi.apache.org, I'll stick here for now, I may migrate when I start working on POI code. Thanks for your cycles! > On 3 May 2017, at 8:31 AM, Javen O'Neal <one...@apache.org> wrote: > > If you haven't discovered it already, Microsoft has published the > specifications for the Microsoft Office OLE2 binary file formats. > > https://msdn.microsoft.com/library/cc313118.aspx > Scroll down to Download "Office File Formats PDF .zip file" > At the very least, you'll want to skim through [MS-OVBA] and [MS-XLS]. > > One other thing to think about: you can embed almost any kind of file > inside inside of an office document. That embedded file could be a > standalone executable, a script, or an malicious macro-enabled > document. Are you planning on recursively searching through embedded > documents and removing these threats? Are you also scanning for > hyperlinks and ActiveX controls or other components that may not > require VBA code but perform some potentially dangerous underlying > action? Something as benign as loading an external resource (loading > content from an external file or sending an http request that could > send data encoded in the URL to some malicious server). Or something > as benign as asking the OS to use the system's default media player to > play an audio clip or video could cause problems with a cleverly > crafted payload and a security hole in that application. DocBleach > would be helpful for disabling/removing anything that Microsoft Office > or other target office applications haven't sufficiently mitigated > against. > > On Tue, May 2, 2017 at 10:59 PM, Javen O'Neal <one...@apache.org> wrote: >> Hey, PunKeel, this is great! >> >> If your software is based on POI and you'd like to upstream some of your >> changes to POI to make your library more straight forward, send us a >> pull request and we'll review it, give feedback, and commit it. >> >> I have barely dabbled in how VBA projects are saved in the OLE2 >> formats (VBAMacroReader class), but perhaps others have some ideas and >> a few free cycles to spare (being a volunteer project and all). >> >> Keep in mind that PPT files save macros in a different part of the >> OLE2 file structure than XLS and DOC. A reminder that these binary >> formats weren't simultaneously developed by the same software team at >> the same time. >> >> The following bugs might be of interest to you: >> https://bz.apache.org/bugzilla/buglist.cgi?bug_id=52949%2C59302%2C60273%2C59830%2C59858%2C60158&list_id=159842 >> >> Feel free to continue this discussion over on d...@poi.apache.org, >> where they might be a better technical audience who could point out >> some of the POI internals. Most of the POI devs monitor both mailing >> lists, so which mailing list probably doesn't matter too much. >> >> Javen >> >> On Tue, May 2, 2017 at 6:19 PM, PunKeel <punk...@me.com> wrote: >>> Hello, >>> >>> I've found out the solution for Excel files: removing the "ObProj" record of >>> the Workbook. >>> >>> The code I'm using is available here: >>> https://gist.github.com/PunKeel/0e72ccde78cb0150383a9ced094c2bce >>> I don't think this is the cleanest way to achieve it, so I'm open to >>> suggestions. >>> >>> It also doesn't work for Word/PPT files. If I understand correctly (let's >>> hope I do), the >>> (Current User, PowerPoint Document) and (WordDocument, 0Table, 1Table) >>> streams >>> need to be edited by hand because Apache POI is lacking the APIs for these >>> formats. >>> ~> How? Are they like "DocumentStreams"? May I use the "RecordFactory"? >>> >>> Am I right? I am more than open to suggestions/help, please! >>> >>> Best regards, >>> >>> >>> On 1 May 2017, at 6:15 PM, PunKeel <punk...@me.com> wrote: >>> >>> Hello, >>> >>> I am currently building an open-source software to disarm Office files, >>> named DocBleach [1] >>> but I am stuck with some specificities the OLE2 format. >>> >>> First of all, I would like to thank you for the great library that is Apache >>> POI! >>> >>> When opening disarmed OLE2 files in Office Excel/Word on Windows 10 (haven't >>> checked other versions), >>> an alert is displayed, depending on the editor: >>> >>> - Excel says that the file is corrupted and needs to be repaired. The error: >>> "Lost Visual Basic project". >>> ~> Once repaired, the Macro Viewer is unable to tell us the name of the >>> Macros >>> >>> - Words tells us that the file is unsafe because it contains Macros. >>> ~> The "Macro Viewer" is able to tell us the name of the Macros >>> >>> ---- >>> >>> As you know, OLE2 files are "file systems in a file". >>> In order to remove the Macros of a document, I remove the Macros >>> "directory". >>> >>> Sample log, for the record. (The process being the same for Word/Excel, I'll >>> only give one). >>> Relative code is available at [2] >>> Sample files, with their sanitised form (named "-free") >>> ~> https://www.mediafire.com/folder/yh122tgbyzadw/Sample_2017_May_1 >>> >>> ##### >>> >>> $ java -jar docbleach.jar -vv -in ../Goodware/macro.doc -out - > >>> macro-free.doc >>> [main] DEBUG xyz.docbleach.cli.Main - Log Level: TRACE >>> [main] TRACE xyz.docbleach.api.bleach.CompositeBleach - Using bleach: OLE2 >>> Bleach >>> [main] DEBUG xyz.docbleach.module.ole2.OLE2Bleach - Entries before: >>> [CompObj, 1Table, SummaryInformation, WordDocument, >>> DocumentSummaryInformation, Macros] >>> [main] DEBUG xyz.docbleach.module.ole2.OLE2Bleach - Root ClassID: >>> {00020906-0000-0000-C000-000000000046} >>> >>> [main] TRACE xyz.docbleach.module.ole2.OLE2Bleach - copyNodesRecursively: >>> SummaryInformation, parent: >>> org.apache.poi.poifs.filesystem.DirectoryNode@2e5c649 >>> [main] TRACE xyz.docbleach.module.ole2.OLE2Bleach - copyNodesRecursively: >>> DocumentSummaryInformation, parent: >>> org.apache.poi.poifs.filesystem.DirectoryNode@2e5c649 >>> [main] TRACE xyz.docbleach.module.ole2.OLE2Bleach - copyNodesRecursively: >>> WordDocument, parent: org.apache.poi.poifs.filesystem.DirectoryNode@2e5c649 >>> [main] TRACE xyz.docbleach.module.ole2.OLE2Bleach - copyNodesRecursively: >>> 1Table, parent: org.apache.poi.poifs.filesystem.DirectoryNode@2e5c649 >>> >>> [main] DEBUG xyz.docbleach.module.ole2.OLE2Bleach - Entries after: [1Table, >>> SummaryInformation, WordDocument, DocumentSummaryInformation] >>> >>> ##### >>> >>> The CompObj and Macros entries are removed (not copied), so the Macros >>> *can't* work. >>> >>> I've been trying a lot of things, especially with Excel files (they only >>> contain a Workbook, >>> SummaryInformation and DocumentSummaryInformation) and I've found out the >>> Workbook >>> was in fault: the two summaries did not contain the "macro reference", and I >>> recreate the file >>> from scratch so it has to be in an entry. >>> >>> If I understand correctly, there are "entries" in the Workbook/WordDocument >>> holding the Macros. >>> I found "stwUser" in the Word documentation [3], and I assume that I need to >>> remove it, but couldn't >>> find an unified API to achieve it for Word/Excel/PowerPoint documents. >>> >>> My question: is there an "easy" API to interact with these entries, removing >>> parts of it? >>> If so, could you please give me some leads/examples on how to do it? >>> If not, do you have tips on how to achieve something similar? >>> >>> I could iterate over the Workbook/Document to copy it over manually, without >>> the Macros… >>> but if the API allowing it is not unified, I would have to do it for >>> XLS/Word/PPT files, right? >>> Doesn't seem like the easy path! :-( >>> >>> Thanks in advance! >>> >>> - PunKeel >>> >>> [1]: https://github.com/docbleach/DocBleach >>> [2]: >>> https://github.com/docbleach/DocBleach/blob/master/module/module-office/src/main/java/xyz/docbleach/module/ole2/OLE2Bleach.java >>> [3]: https://msdn.microsoft.com/en-us/library/dd923194(v=office.12).aspx >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@poi.apache.org >>> For additional commands, e-mail: user-h...@poi.apache.org >>> >>> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@poi.apache.org > For additional commands, e-mail: user-h...@poi.apache.org >
signature.asc
Description: Message signed with OpenPGP