Hi,

The current official guarantee that we make is described in the FAQs at
https://poi.apache.org/help/faq.html and it actually already states our
stance on the "per-sheet" level locking:

20. Can I access/modify workbooks/documents/slideshows in multiple threads?
What are the multi-threading guarantees that Apache POI makes

In short: *Handling different document-objects in different threads will
work. Accessing the same document in multiple threads will not work.*

This means the workbook/document/slideshow objects are not checked for
thread safety, but any globally held object like global caches or other
data structures are guarded against multi threaded access accordingly.

There have been discussions
<https://mail-archives.apache.org/mod_mbox/poi-user/201109.mbox/%3c1314859350817-4757295.p...@n5.nabble.com%3E>
about accessing different Workbook-sheets in different threads
concurrently. While this may work to some degree, it may lead to very hard
to track errors as multi-threading issues typically only manifest after
long runtime when many threads are active and the system is under high
load, i.e. in production use! Also it might break in future versions of
Apache POI as we do not specifically test using the library this way.


There are likely much more than 0.0x% people interested in performance when
handling large workbooks, both single-threaded or multi-threaded with
per-workbook threads. I rather think the "I need to parallelise per sheet"
a fairly uncommon use-case.

However I also don't think performance would suffer in any way if we add a
simple synchronized-lock, especially if only done for a small amount of
code during sheet creation.

But fully making it 100% safe on that level is hard to guarantee with many
caches and things kept on workbook-level..

So currently you are basically lucky if it works for you already that far!

I would be open to add some specific locks if it helps your case as long as
we consider performance impact small. So if you send a PR and none of the
active developers objects, we can see what we can do.

However I would strongly opt against officially declaring support at that
level because it would require a very thorough and specific set of tests to
uncover issues that might creep up and it also would limit us from doing
performance optimizations in the future.

Dominik.

On Thu, May 19, 2022 at 8:12 PM Andreas Reichel <
andr...@manticore-projects.com> wrote:

> Thank you for writing, Nick.
>
> On Thu, 2022-05-19 at 18:09 +0100, Nick Burch wrote:
> > On Mon, 16 May 2022, Andreas Reichel wrote:
> > > Of course I am fully aware that POI is not thread safety on
> > > workbook
> > > level and threads can only be used as long as every sheet is
> > > processed
> > > within its own thread (without altering styles).
> >
> > We only ensure that you are safe if each workbook is created in its
> > own
> > thread, we generally advise against having multiple threads working
> > on the
> > same workbook as it often goes wrong. It sounds like you've pushed
> > things
> > well beyond our normally recommended limits already...
>
> Oh! I thought we would be thread safe down to sheet level?
> Because I am able to create MANY and LARGE sheets in parallel just fine
> -- as long as actions within each thread refer to only the sheet.
>
> Styles and sheets are to be created before the threads spawn.
> Workbook formula evaluation is done after the threads joined.
>
> The only confusion was about:
>
> Sheet sheet = workbook.createSheet(p[4]);
>
> Because this returns the sheet (and so would have been expected within
> the thread), but in fact its Workbook related and must happen before
> the threads spawn.
>
> > Won't that slow down the 99.9% case, where people are just using a
> > single
> > thread for the whole workbook?
>
> Locking a Semaphore does not take (much) time.
> The 99.9% case creates some simple sheets and will not notice 100 ms
> for the locking.
>
> The remaining 0.01% who are pushing it and create 500 MB Excel files
> will save multiple minutes from parallelism.
> (My report builder reduce time from 20 mins to less than 2 minutes just
> by writing to the sheets in parallel. Its a huge benefit for us!)
>
> > And doesn't it risk people trying to do
> > things that aren't generally thread-safe because "that bit seemed
> > fine"
> > and getting stuck later?
>
> Very correct and agreed. I am still curious what the ACTUAL limitations
> are: Are we just overly defensive/careful declaring it "threadsafe as
> per Workbook" only?
> Or are there real show stoppers to work on different sheets in parallel
> (each sheet in its own thread without cross references of course).
>
> I feel like its the second one, and then we could synchronize those
> methods .
>
> Cheers
> Andreas
>

Reply via email to