tl;dr: I'm still searching for a way to open "Document" streams for
PPT/DOC files, but I have leads I need to check!


Hey Javen,

Thanks for the link! I was looking at one "[MS-xxx].pdf" file at a time,
thru their online versions, but wasn't aware a zip was available.

DocBleach already handles embedded files for multiple formats,
including PDF/OOXML. Thus, adding support OLE2 embedded
files as easy as calling a method.
ActiveX controls are/should be removed too.
I'm note sure I handle this correctly, and I don't have a proper
tests suite, but these are supposed to work.

I haven't checked hyperlinks yet, and didn't even think they could
be a threat. Thanks!

--

I'll check the VBAMacroReader class again, maybe does it contain
the answers I'm looking for.

> Feel free to continue this discussion over on d...@poi.apache.org,
I'll stick here for now, I may migrate when I start working on POI code.


Thanks for your cycles!


> On 3 May 2017, at 8:31 AM, Javen O'Neal <one...@apache.org> wrote:
> 
> If you haven't discovered it already, Microsoft has published the
> specifications for the Microsoft Office OLE2 binary file formats.
> 
> https://msdn.microsoft.com/library/cc313118.aspx
> Scroll down to Download "Office File Formats PDF .zip file"
> At the very least, you'll want to skim through [MS-OVBA] and [MS-XLS].
> 
> One other thing to think about: you can embed almost any kind of file
> inside inside of an office document. That embedded file could be a
> standalone executable, a script, or an malicious macro-enabled
> document. Are you planning on recursively searching through embedded
> documents and removing these threats? Are you also scanning for
> hyperlinks and ActiveX controls or other components that may not
> require VBA code but perform some potentially dangerous underlying
> action? Something as benign as loading an external resource (loading
> content from an external file or sending an http request that could
> send data encoded in the URL to some malicious server). Or something
> as benign as asking the OS to use the system's default media player to
> play an audio clip or video could cause problems with a cleverly
> crafted payload and a security hole in that application. DocBleach
> would be helpful for disabling/removing anything that Microsoft Office
> or other target office applications haven't sufficiently mitigated
> against.
> 
> On Tue, May 2, 2017 at 10:59 PM, Javen O'Neal <one...@apache.org> wrote:
>> Hey, PunKeel, this is great!
>> 
>> If your software is based on POI and you'd like to upstream some of your
>> changes to POI to make your library more straight forward, send us a
>> pull request and we'll review it, give feedback, and commit it.
>> 
>> I have barely dabbled in how VBA projects are saved in the OLE2
>> formats (VBAMacroReader class), but perhaps others have some ideas and
>> a few free cycles to spare (being a volunteer project and all).
>> 
>> Keep in mind that PPT files save macros in a different part of the
>> OLE2 file structure than XLS and DOC. A reminder that these binary
>> formats weren't simultaneously developed by the same software team at
>> the same time.
>> 
>> The following bugs might be of interest to you:
>> https://bz.apache.org/bugzilla/buglist.cgi?bug_id=52949%2C59302%2C60273%2C59830%2C59858%2C60158&list_id=159842
>> 
>> Feel free to continue this discussion over on d...@poi.apache.org,
>> where they might be a better technical audience who could point out
>> some of the POI internals. Most of the POI devs monitor both mailing
>> lists, so which mailing list probably doesn't matter too much.
>> 
>> Javen
>> 
>> On Tue, May 2, 2017 at 6:19 PM, PunKeel <punk...@me.com> wrote:
>>> Hello,
>>> 
>>> I've found out the solution for Excel files: removing the "ObProj" record of
>>> the Workbook.
>>> 
>>> The code I'm using is available here:
>>> https://gist.github.com/PunKeel/0e72ccde78cb0150383a9ced094c2bce
>>> I don't think this is the cleanest way to achieve it, so I'm open to
>>> suggestions.
>>> 
>>> It also doesn't work for Word/PPT files. If I understand correctly (let's
>>> hope I do), the
>>> (Current User, PowerPoint Document) and (WordDocument, 0Table, 1Table)
>>> streams
>>> need to be edited by hand because Apache POI is lacking the APIs for these
>>> formats.
>>> ~> How? Are they like "DocumentStreams"? May I use the "RecordFactory"?
>>> 
>>> Am I right? I am more than open to suggestions/help, please!
>>> 
>>> Best regards,
>>> 
>>> 
>>> On 1 May 2017, at 6:15 PM, PunKeel <punk...@me.com> wrote:
>>> 
>>> Hello,
>>> 
>>> I am currently building an open-source software to disarm Office files,
>>> named DocBleach [1]
>>> but I am stuck with some specificities the OLE2 format.
>>> 
>>> First of all, I would like to thank you for the great library that is Apache
>>> POI!
>>> 
>>> When opening disarmed OLE2 files in Office Excel/Word on Windows 10 (haven't
>>> checked other versions),
>>> an alert is displayed, depending on the editor:
>>> 
>>> - Excel says that the file is corrupted and needs to be repaired. The error:
>>> "Lost Visual Basic project".
>>> ~> Once repaired, the Macro Viewer is unable to tell us the name of the
>>> Macros
>>> 
>>> - Words tells us that the file is unsafe because it contains Macros.
>>> ~> The "Macro Viewer" is able to tell us the name of the Macros
>>> 
>>> ----
>>> 
>>> As you know, OLE2 files are "file systems in a file".
>>> In order to remove the Macros of a document, I remove the Macros
>>> "directory".
>>> 
>>> Sample log, for the record. (The process being the same for Word/Excel, I'll
>>> only give one).
>>> Relative code is available at [2]
>>> Sample files, with their sanitised form (named "-free")
>>> ~> https://www.mediafire.com/folder/yh122tgbyzadw/Sample_2017_May_1
>>> 
>>> #####
>>> 
>>> $ java -jar docbleach.jar -vv -in ../Goodware/macro.doc -out - >
>>> macro-free.doc
>>> [main] DEBUG xyz.docbleach.cli.Main - Log Level: TRACE
>>> [main] TRACE xyz.docbleach.api.bleach.CompositeBleach - Using bleach: OLE2
>>> Bleach
>>> [main] DEBUG xyz.docbleach.module.ole2.OLE2Bleach - Entries before:
>>> [CompObj, 1Table, SummaryInformation, WordDocument,
>>> DocumentSummaryInformation, Macros]
>>> [main] DEBUG xyz.docbleach.module.ole2.OLE2Bleach - Root ClassID:
>>> {00020906-0000-0000-C000-000000000046}
>>> 
>>> [main] TRACE xyz.docbleach.module.ole2.OLE2Bleach - copyNodesRecursively:
>>> SummaryInformation, parent:
>>> org.apache.poi.poifs.filesystem.DirectoryNode@2e5c649
>>> [main] TRACE xyz.docbleach.module.ole2.OLE2Bleach - copyNodesRecursively:
>>> DocumentSummaryInformation, parent:
>>> org.apache.poi.poifs.filesystem.DirectoryNode@2e5c649
>>> [main] TRACE xyz.docbleach.module.ole2.OLE2Bleach - copyNodesRecursively:
>>> WordDocument, parent: org.apache.poi.poifs.filesystem.DirectoryNode@2e5c649
>>> [main] TRACE xyz.docbleach.module.ole2.OLE2Bleach - copyNodesRecursively:
>>> 1Table, parent: org.apache.poi.poifs.filesystem.DirectoryNode@2e5c649
>>> 
>>> [main] DEBUG xyz.docbleach.module.ole2.OLE2Bleach - Entries after: [1Table,
>>> SummaryInformation, WordDocument, DocumentSummaryInformation]
>>> 
>>> #####
>>> 
>>> The CompObj and Macros entries are removed (not copied), so the Macros
>>> *can't* work.
>>> 
>>> I've been trying a lot of things, especially with Excel files (they only
>>> contain a Workbook,
>>> SummaryInformation and DocumentSummaryInformation) and I've found out the
>>> Workbook
>>> was in fault: the two summaries did not contain the "macro reference", and I
>>> recreate the file
>>> from scratch so it has to be in an entry.
>>> 
>>> If I understand correctly, there are "entries" in the Workbook/WordDocument
>>> holding the Macros.
>>> I found "stwUser" in the Word documentation [3], and I assume that I need to
>>> remove it, but couldn't
>>> find an unified API to achieve it for Word/Excel/PowerPoint documents.
>>> 
>>> My question: is there an "easy" API to interact with these entries, removing
>>> parts of it?
>>> If so, could you please give me some leads/examples on how to do it?
>>> If not, do you have tips on how to achieve something similar?
>>> 
>>> I could iterate over the Workbook/Document to copy it over manually, without
>>> the Macros…
>>> but if the API allowing it is not unified, I would have to do it for
>>> XLS/Word/PPT files, right?
>>> Doesn't seem like the easy path! :-(
>>> 
>>> Thanks in advance!
>>> 
>>> - PunKeel
>>> 
>>> [1]: https://github.com/docbleach/DocBleach
>>> [2]:
>>> https://github.com/docbleach/DocBleach/blob/master/module/module-office/src/main/java/xyz/docbleach/module/ole2/OLE2Bleach.java
>>> [3]: https://msdn.microsoft.com/en-us/library/dd923194(v=office.12).aspx
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@poi.apache.org
>>> For additional commands, e-mail: user-h...@poi.apache.org
>>> 
>>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@poi.apache.org
> For additional commands, e-mail: user-h...@poi.apache.org
> 

Attachment: signature.asc
Description: Message signed with OpenPGP

Reply via email to