If you haven't discovered it already, Microsoft has published the specifications for the Microsoft Office OLE2 binary file formats.
https://msdn.microsoft.com/library/cc313118.aspx Scroll down to Download "Office File Formats PDF .zip file" At the very least, you'll want to skim through [MS-OVBA] and [MS-XLS]. One other thing to think about: you can embed almost any kind of file inside inside of an office document. That embedded file could be a standalone executable, a script, or an malicious macro-enabled document. Are you planning on recursively searching through embedded documents and removing these threats? Are you also scanning for hyperlinks and ActiveX controls or other components that may not require VBA code but perform some potentially dangerous underlying action? Something as benign as loading an external resource (loading content from an external file or sending an http request that could send data encoded in the URL to some malicious server). Or something as benign as asking the OS to use the system's default media player to play an audio clip or video could cause problems with a cleverly crafted payload and a security hole in that application. DocBleach would be helpful for disabling/removing anything that Microsoft Office or other target office applications haven't sufficiently mitigated against. On Tue, May 2, 2017 at 10:59 PM, Javen O'Neal <one...@apache.org> wrote: > Hey, PunKeel, this is great! > > If your software is based on POI and you'd like to upstream some of your > changes to POI to make your library more straight forward, send us a > pull request and we'll review it, give feedback, and commit it. > > I have barely dabbled in how VBA projects are saved in the OLE2 > formats (VBAMacroReader class), but perhaps others have some ideas and > a few free cycles to spare (being a volunteer project and all). > > Keep in mind that PPT files save macros in a different part of the > OLE2 file structure than XLS and DOC. A reminder that these binary > formats weren't simultaneously developed by the same software team at > the same time. > > The following bugs might be of interest to you: > https://bz.apache.org/bugzilla/buglist.cgi?bug_id=52949%2C59302%2C60273%2C59830%2C59858%2C60158&list_id=159842 > > Feel free to continue this discussion over on d...@poi.apache.org, > where they might be a better technical audience who could point out > some of the POI internals. Most of the POI devs monitor both mailing > lists, so which mailing list probably doesn't matter too much. > > Javen > > On Tue, May 2, 2017 at 6:19 PM, PunKeel <punk...@me.com> wrote: >> Hello, >> >> I've found out the solution for Excel files: removing the "ObProj" record of >> the Workbook. >> >> The code I'm using is available here: >> https://gist.github.com/PunKeel/0e72ccde78cb0150383a9ced094c2bce >> I don't think this is the cleanest way to achieve it, so I'm open to >> suggestions. >> >> It also doesn't work for Word/PPT files. If I understand correctly (let's >> hope I do), the >> (Current User, PowerPoint Document) and (WordDocument, 0Table, 1Table) >> streams >> need to be edited by hand because Apache POI is lacking the APIs for these >> formats. >> ~> How? Are they like "DocumentStreams"? May I use the "RecordFactory"? >> >> Am I right? I am more than open to suggestions/help, please! >> >> Best regards, >> >> >> On 1 May 2017, at 6:15 PM, PunKeel <punk...@me.com> wrote: >> >> Hello, >> >> I am currently building an open-source software to disarm Office files, >> named DocBleach [1] >> but I am stuck with some specificities the OLE2 format. >> >> First of all, I would like to thank you for the great library that is Apache >> POI! >> >> When opening disarmed OLE2 files in Office Excel/Word on Windows 10 (haven't >> checked other versions), >> an alert is displayed, depending on the editor: >> >> - Excel says that the file is corrupted and needs to be repaired. The error: >> "Lost Visual Basic project". >> ~> Once repaired, the Macro Viewer is unable to tell us the name of the >> Macros >> >> - Words tells us that the file is unsafe because it contains Macros. >> ~> The "Macro Viewer" is able to tell us the name of the Macros >> >> ---- >> >> As you know, OLE2 files are "file systems in a file". >> In order to remove the Macros of a document, I remove the Macros >> "directory". >> >> Sample log, for the record. (The process being the same for Word/Excel, I'll >> only give one). >> Relative code is available at [2] >> Sample files, with their sanitised form (named "-free") >> ~> https://www.mediafire.com/folder/yh122tgbyzadw/Sample_2017_May_1 >> >> ##### >> >> $ java -jar docbleach.jar -vv -in ../Goodware/macro.doc -out - > >> macro-free.doc >> [main] DEBUG xyz.docbleach.cli.Main - Log Level: TRACE >> [main] TRACE xyz.docbleach.api.bleach.CompositeBleach - Using bleach: OLE2 >> Bleach >> [main] DEBUG xyz.docbleach.module.ole2.OLE2Bleach - Entries before: >> [CompObj, 1Table, SummaryInformation, WordDocument, >> DocumentSummaryInformation, Macros] >> [main] DEBUG xyz.docbleach.module.ole2.OLE2Bleach - Root ClassID: >> {00020906-0000-0000-C000-000000000046} >> >> [main] TRACE xyz.docbleach.module.ole2.OLE2Bleach - copyNodesRecursively: >> SummaryInformation, parent: >> org.apache.poi.poifs.filesystem.DirectoryNode@2e5c649 >> [main] TRACE xyz.docbleach.module.ole2.OLE2Bleach - copyNodesRecursively: >> DocumentSummaryInformation, parent: >> org.apache.poi.poifs.filesystem.DirectoryNode@2e5c649 >> [main] TRACE xyz.docbleach.module.ole2.OLE2Bleach - copyNodesRecursively: >> WordDocument, parent: org.apache.poi.poifs.filesystem.DirectoryNode@2e5c649 >> [main] TRACE xyz.docbleach.module.ole2.OLE2Bleach - copyNodesRecursively: >> 1Table, parent: org.apache.poi.poifs.filesystem.DirectoryNode@2e5c649 >> >> [main] DEBUG xyz.docbleach.module.ole2.OLE2Bleach - Entries after: [1Table, >> SummaryInformation, WordDocument, DocumentSummaryInformation] >> >> ##### >> >> The CompObj and Macros entries are removed (not copied), so the Macros >> *can't* work. >> >> I've been trying a lot of things, especially with Excel files (they only >> contain a Workbook, >> SummaryInformation and DocumentSummaryInformation) and I've found out the >> Workbook >> was in fault: the two summaries did not contain the "macro reference", and I >> recreate the file >> from scratch so it has to be in an entry. >> >> If I understand correctly, there are "entries" in the Workbook/WordDocument >> holding the Macros. >> I found "stwUser" in the Word documentation [3], and I assume that I need to >> remove it, but couldn't >> find an unified API to achieve it for Word/Excel/PowerPoint documents. >> >> My question: is there an "easy" API to interact with these entries, removing >> parts of it? >> If so, could you please give me some leads/examples on how to do it? >> If not, do you have tips on how to achieve something similar? >> >> I could iterate over the Workbook/Document to copy it over manually, without >> the Macros… >> but if the API allowing it is not unified, I would have to do it for >> XLS/Word/PPT files, right? >> Doesn't seem like the easy path! :-( >> >> Thanks in advance! >> >> - PunKeel >> >> [1]: https://github.com/docbleach/DocBleach >> [2]: >> https://github.com/docbleach/DocBleach/blob/master/module/module-office/src/main/java/xyz/docbleach/module/ole2/OLE2Bleach.java >> [3]: https://msdn.microsoft.com/en-us/library/dd923194(v=office.12).aspx >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@poi.apache.org >> For additional commands, e-mail: user-h...@poi.apache.org >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@poi.apache.org For additional commands, e-mail: user-h...@poi.apache.org