Over on Apache Tika (via PDFBox!), we report the number of characters
without Unicode mappings, and, if you add our tika-eval jar, you can also
get an "out of vocabulary" statistic that is an indicator that extracted
text is garbage. Happy to chat over on u...@tika.apache.org on either of
those topics.

Would be interesting to see if veraPDF is also extracting unmapped Unicode
chars...missing/broken fonts etc.

On Tue, Jun 27, 2023 at 11:30 AM Susan Borda <sbo...@umich.edu> wrote:

> Thanks Tillman, exactly the info I needed.
>
> On Mon, Jun 26, 2023 at 10:21 PM Tilman Hausherr <thaush...@t-online.de>
> wrote:
>
> > Hi,
> > PDFBox preflight only checks for PDF/A-1b, not for any accessibility
> > topics. Maybe your PDF isn't meant to be accessible to prevent scraping.
> > Try https://verapdf.org/
> > Tilman
> >
> > On 26.06.2023 19:36, Susan Borda wrote:
> > > Hi All-
> > > I'd like to check PDFs that have character encoding issues, does
> > Preflight
> > > do that? I checked the accessibility of a pdf file in Adobe Pro and it
> > gave
> > > me a "Character encoding -Failed" message. When I checked this same
> file
> > in
> > > Preflight I got this:
> > >
> > > Jun 26, 2023 1:24:41 PM
> > org.apache.pdfbox.pdmodel.graphics.color.PDICCBased
> > > ensureDisplayProfile
> > > WARNING: ICC profile is Perceptual, ignoring, treating as Display class
> > > Jun 26, 2023 1:24:41 PM
> > org.apache.pdfbox.pdmodel.graphics.color.PDICCBased
> > > ensureDisplayProfile
> > > WARNING: ICC profile is Perceptual, ignoring, treating as Display class
> > > Jun 26, 2023 1:24:41 PM
> > org.apache.pdfbox.pdmodel.graphics.color.PDICCBased
> > > ensureDisplayProfile
> > > WARNING: ICC profile is Perceptual, ignoring, treating as Display class
> > > The file BritishLibrary-PDF_Assessment_v1.3.pdf is a valid PDF/A-1b
> file
> > >
> > > When I try to copy/paste the text from this PDF it's all garbage and
> the
> > > CMap is missing.
> > >
> > > Any advice would be greatly appreciated.
> > > Thanks,
> > > susan
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> > For additional commands, e-mail: users-h...@pdfbox.apache.org
> >
> >
>
> --
> Susan Borda
> Digital Preservation Projects Manager
> Digital Preservation Unit
> University of Michigan Libraries
> Buhr Building
> sbo...@umich.edu
> *My office phone number is temporarily disconnected while I work remotely
> due to COVID-19. Please contact me via email.*
>

Reply via email to