[
https://issues.apache.org/jira/browse/SOLR-10934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hoss Man updated SOLR-10934:
----------------------------
Attachment: SOLR-10934.patch
Ok, I'm attaching a really rough and dirty patch that includes:
* A quick and dirty CheckPDFLinksAndAnchors inspired by the SO post mentioned
and the original PrintURLs.java demo from pdfbox
* a build.xml 'nocommit' target to run it against our PDF
* some "broken" changes to our ref-guide content to deliberatey introduce a few
errors...
*# anchor duplicated in multiple source pages
*# links to each of the diff dup anchors
*# link to an anchor that doesn't exist in the specified source doc, but does
exist in a diff doc
*# links to an source doc thta doesn't exist
*# links to an anchor that doesn't exist (in a source doc that does)
The results aren't promising...
# FAIL: the dup anchors cause asciidoctor to print a WARNING (even w/o any link
checking) that i'd forgotten about, but as far as i can tell from my
exploration of the {{PDDocumentCatalog}} that duplicated information is lost in
the underlying PDF (or if it does make it into the PDF, PDFBox loses it when
parsing the PDF, because the "Catalog" is just a Map)
# FAIL: the PDF Annotations to each of the dup links both wind up mapping to
the page with the first occurange -- again: either because the catalog in the
file can only track one location for a given anchor, or because that's just how
PDF Box deals with the precedence of dup dict keys when reading the file
# FAIL: if an anchor doesn't exist in the specified source {{\*.adoc}} file,
but does exist somehwere else in the final PDF, then that's where asciidoctor
points the generated link -- there's nothing weird about it i can detect from
PDFBox
# GOOD: link's to a source {{\*.adoc}} file that doesn't actaully exist on disk
are fairly easy to detect -- asciidoctor's default behavior is to assume that
these are links to other docs that will be converted seperately, so they show
up as "relative URIs" which we can treat as a failure (ie: if a link in a PDF
is to a non-absolute URI, it must be a content error)
# GOOD: link's to an anchor that doesn't exist are likewise easy to identify:
the "annotation" is preserved but has no destiation, which we can treat as a
failure.
The important bits of the output w/this patch are included below...
{noformat}
-build-raw-pdf:
[asciidoctor:convert] Render SolrRefGuide-all.adoc from
/home/hossman/lucene/dev/solr/build/solr-ref-guide/content/pdf to
/home/hossman/lucene/dev/solr/build/solr-ref-guide/pdf-tmp with backend=pdf
[asciidoctor:convert] asciidoctor: ERROR: about-this-guide.adoc: line 1:
invalid part, must have at least one section (e.g., chapter, appendix, etc.)
[asciidoctor:convert] asciidoctor: ERROR: solr-glossary.adoc: line 1: invalid
part, must have at least one section (e.g., chapter, appendix, etc.)
[asciidoctor:convert] asciidoctor: WARNING: errata.adoc: line 30: id assigned
to section already in use: nocommit_dup_anchor_name
[asciidoctor:convert] asciidoctor: ERROR: SolrRefGuide-all.adoc: line 37:
invalid part, must have at least one section (e.g., chapter, appendix, etc.)
[move] Moving 1 file to
/home/hossman/lucene/dev/solr/build/solr-ref-guide/pdf-tmp
...
nocommit:
[java] Page 753:'Link to bogus page @ anchor that does not exist'=> BOGUS
URI: nocommit_bogus_page.pdf#nocommit_bogus_x2
[java] Page 753:'Link to about @ anchor that does not exist' => link with
no page dest
{noformat}
----
All in all these results are disappointing.
The "Single Page" output behavior of asciidoctor, combined with the "bugs" in
asciidoctors handling of duplicated anchors in page includes, combined with the
underlying structure of the PDF, make it really hard to find the same types of
failures we can find when parsing the jekyll generated pages using our
white-box knowledge of "there must be no dup anchors across all pages"
> create a link+anchor checker for the ref-guide PDF using PDFBox
> ---------------------------------------------------------------
>
> Key: SOLR-10934
> URL: https://issues.apache.org/jira/browse/SOLR-10934
> Project: Solr
> Issue Type: Sub-task
> Security Level: Public(Default Security Level. Issues are Public)
> Components: documentation
> Reporter: Hoss Man
> Attachments: SOLR-10934.patch
>
>
> We currently have CheckLinksAndAnchors.java which is automatically run
> against the ref-guide HTML as part of the build to use JSoup to find bad
> links/anchors that asciidoctor doesn't complain about -- but not everyone
> does/can build the HTML version of the ref-guide sincif we can e it requires
> manually installing jekyll.
> The PDF build only requires things installed by ivy (via JRuby) and we
> already have some PDFBox based code in ReducePDFSize.java that operates on
> this PDF every time it's run -- so if we can find a way to do similar checks
> using the PDFBox API we could catch these broken links faster.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]