Re: Writer and .docx

Damjan Jovanovic Fri, 16 Oct 2020 08:36:44 -0700

On Fri, Oct 16, 2020 at 4:24 PM Carl Marcum <cmar...@apache.org> wrote:


> Hi Damjan,
>
> On 10/16/20 9:23 AM, Damjan Jovanovic wrote:
> > On Fri, Oct 16, 2020 at 2:05 PM Dave Fisher <wave4d...@comcast.net>
> wrote:
> >
> >> Hi -
> >>
> >> Sent from my iPhone
> >>
> >>> On Oct 16, 2020, at 4:04 AM, Mechtilde <o...@mechtilde.de> wrote:
> >>>
> >>> Hello Joost,
> >>>
> >>> I'm very happy to read from you.
> >>>
> >>>> Am 16.10.20 um 12:50 schrieb Joost Andrae:
> >>>> Hi Simon,
> >>>>
> >>>> it's an honor to me to see a sign of life of you here. Welcome !
> >>>>
> >>>> Instead of user picking here to get users leave from AOO to LO a
> >>>> developer could create a Java based OOo/LO extension that uses Apache
> >>>> POI to export OpenDocument type documents to MSXML formats by using
> the
> >>>> binary MSO export to export those documents to the MSXML format in
> >>>> between. Or maybe it's possible to XSL this document format by using
> >>>> OpenOffice together with Apache POI. Using XSL scripts (in AOO menu
> item
> >>>> XML filter settings) to make document conversions is possible within
> >> OOo.
> >>> I offer my help to test the implementation. sorry but I'm not a
> >>> programmer. So we as the project need help from Java programmers to
> work
> >>> on it and contribute it.
> >> I’m a PMC Member of Apache POI for over 12 years. My team donated the
> >> initial PowerPoint support and were involved in the initial support for
> >> OOXML.
> >>
> >> POI is embedded into Apache SOLr and Tika along with commercial
> products.
> >> The project took over the dormant XMLBeans project and is releasing a
> 4.0
> >> that supports modern Java.
> >>
> >> An OSGi bundle of POI will be available in the next release if you build
> >> from source.
> >>
> >> The Tika, POI, and PDFBox projects maintain a large regression corpus
> >> scraped from the internet using CommonCrawl. I’m sure that this could be
> >> shared in one way or another.
> >>
> >> Regards,
> >> Dave
> >>
> >>
> > Hi
> >
> > I did start writing a POI-based OOXML export filter for AOO some years
> ago
> > (search the dev mailing list), and got it to the point of being able to
> > save very basic spreadsheets (no formulas, no formatting, just text and
> > numbers).
> >
> > There were several major problems with using POI.
> >
> > Firstly the code in POI is at various stages of completeness. The legacy
> > XLS filter is very good, supports SAX parsing, etc. The DOC filter is
> > minimal and unmaintained. What we would need, the OOXML filter for at
> least
> > XLSX, is somewhere in between. AFAIK it only supports DOM parsing,
> meaning
> > everything needs to be in memory before it can be written to disk, so a
> big
> > spreadsheet could consume gigabytes of RAM during saving, and if you
> don't
> > have enough memory free, you can't save!
> >
> > Also I do use POI at work, and it's outstanding for parsing spreadsheets
> > (it can even parse some that AOO can't), but it's very memory hungry. A
> > spreadsheet with 100000 rows consumed 6 GB of RAM, compared to 200 MB in
> LO
> > (30 times less). That isn't really POI's fault, Java has too much
> > per-object overhead and there are a great many objects in a spreadsheet
> > that big. So DOM + Java really do not add up to efficient memory usage.
> By
> > comparison, our current OOXML reading is not only SAX-based, but converts
> > XML tags to integers for faster comparisons and lower memory usage.
> >
> > Finally AOO itself had limitations that made developing a filter in Java
> > difficult. Each sheet in a spreadsheet has 1 billion cells. Obviously
> only
> > a minority of these contain data - most are empty. In C++ there are
> special
> > iterators that can be used to access only the non-empty cells, but these
> > are not exposed to UNO, or through it, to Java. The only way to tell
> which
> > cells are in use is to iterate over all 1 billion cells (per sheet),
> which
> > is hopelessly slow.
> >
> > Some of these problems can be solved. We can expose the cell iterators
> over
> > UNO. The memory usage might not matter that much in practice, and we
> could
> > patch POI to do SAX parsing/saving at a later stage. But users expect
> > fonts, styles, charts, images, custom formats, OLE, pivot tables, VBA
> > macros, form controls, mathematical formulas, change tracking, etc. all
> > saved losslessly and 100% compatible with Excel, which doesn't only
> require
> > work in the filter, but in the rest of AOO too, and POI probably doesn't
> > support all of those features either.
> I'm not sure if you've look at the newer Streaming Usermodel API SXSSF.
> It may help for memory consumption in this case.
>
>
Can SXSSF work with formulas that reference earlier cells?


> >
> > I might get back into this next month, especially if others want to
> > collaborate, but don't expect something generally usable, let alone
> > Excel-quality XSLX saving, any time soon.
> >
> > Regards
> > Damjan
> >
> Yes I'm definitely interested in collaborating on this.
> Do you have a branch with your work in it?
>
>
It's been 5 years and the code is in bits and pieces, but I'll try to put
together a working branch over the weekend.


> Thanks,
> Carl
>
>
Thank you
Damjan

Re: Writer and .docx

Reply via email to