Re: List of Slides for my China talk this coming weekend

Yegor Kozlov Mon, 15 Oct 2018 02:01:18 -0700

Hi Dave,

(2) POI


>         When it started in Jakarta the simple use case.
>         End of Jakarta
>

It's worth mentioning how hard it was to develop the APIs for the binary
formats with very little or no documentation. You can say that the letter
'H' in HSSF stands for 'horrible'  and that the early work involved a lot
of guessing and reverse engineering.



> (3) OOXML and the Microsoft Open Specification Promise
>         The OSP
>         The flame war
>         OpenXML4J -
> http://incubator.apache.org/ip-clearance/openxml4j.html <
> http://incubator.apache.org/ip-clearance/openxml4j.html>
>         XSSF, XSLF, and SS
> (4) Tika and OOXML lite
>         Apachecon Oakland 2009 - Jukka asked Nick, Yegor and I during
> BarCamp if we could something about the 13MB ooxml jar. Yegor came up with
> a solution in a day.
>         Unit Test and your Beans are included
>         —> Anyone: anything to add? XMLBeans impacts?
> (5) Graphics2D
>         Discuss output techniques developed.
>         —> Yegor - is there some sample code you might share.
>

We have a good collection of examples at
http://svn.apache.org/repos/asf/poi/trunk/src/examples/src/org/apache/poi



> (6) Tika Text Extraction
>         —> Could use pointers to the basic tutorial.
>

Say that Tika is a de-facto standard for extracting text in the Java world.
Every time a Java project extracts text from a MS Office file, it does it
through Tika and POI. Solr, Jackrabbit and Nutch are examples.


> (7) Common Crawler - 1TB of samples
>         Common Crawler - commoncrawl.org
>         Common Crawler Download - centic9
>         Regression sets for POI, Tika and PDFBox
>         —> Are there other Apache projects that use these documents?
>
(8) The POI Toolbox
>         A table of the various formats with input, output, and remarks.
>

Give a quick overview of the supported features. Excel, PowerPoint and Word
are the "big three" that are the most mature.
To manipulate the formats we provide a la DOM APIs that construct  a tree
of objects in memory .
To extract data we provide single pass, a la SAX parsers which lower memory
footprint.
Show the how-to code snippets from the POI site.
Mention that POI can evaluate Excel formulas .

(9) XMLBeans 3
>         Bringing the product out of the attic.
>         —> Any reasons besides better control of Entity Expansion attacks?
> (10) Contributing to POI and Tika Will Improve Your Solr Search Results
>         How Solr and similar architectures depend on Tika and Tika depends
> on POI
>         Example is Headers and Footers choices on Word documents on the
> Tika List this past week.
>
>
It might be worth mentioning the Panama Papers story, when the information
from the leaked documents was extracted using Tika. If Tika and POI didn't
exist it would have taken years to process these files. With Tika it was a
matter of hours.

Yegor

>
>

Re: List of Slides for my China talk this coming weekend

Reply via email to