Greetings,
We use PDFBox alongside Tika to support full-text search indexing and
querying. Our Windows test agents (fairly powerful AWS instances) began
timing out many tests after we upgraded PDFBox from 2.0.29 to 2.0.31. We
tracked the problem down to the on-disk font cache population process
which is taking between two and four MINUTES to complete on these
instances. For test consistency purposes, these agents are fairly
"clean" when they start up; they don't have an on-disk font cache so
it's created the first time a PDF is parsed. This has never been a
problem before.
Our temporary workaround is to force the font cache to be generated in
the background at server startup. We call FontMapperImpl.getProvider()
via reflection; I wish there were a cleaner way to do this, but it gets
the job done. We are risking a race condition here, however, since the
tests could easily start indexing PDFs before the cache is written.
We log elapsed time for populating this cache. The last three runs show:
- Ensuring PDFBox on-disk font cache took 182.3 seconds
- Ensuring PDFBox on-disk font cache took 230.6 seconds
- Ensuring PDFBox on-disk font cache took 137.6 seconds
Significant variance, but always multiple minutes. My local (Windows)
laptop takes a fraction of a second to recreate this cache, so our
ability to debug or profile this performance problem is limited. The
problem showed up with 2.0.31, so we don't have timings from previous
PDFBox versions.
I realize there's not a lot of information to go on here, but I'm
curious if anyone else has experienced this with 2.0.31. We're happy to
provide more information from our instances... maybe turning on
additional logging would be helpful? Count and size of fonts on these
instances?
Thanks,
Adam
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org