On Mon, 18 Nov 2024 16:04:20 -0500 Bruce Momjian <br...@momjian.us> wrote:
> On Mon, Nov 11, 2024 at 10:02:15PM +0900, Yugo Nagata wrote: > > On Tue, 5 Nov 2024 10:08:17 +0100 > > Peter Eisentraut <pe...@eisentraut.org> wrote: > > > > > > > >> So you convert LATIN1 characters to HTML entities so that it's easier > > > >> to detect non-LATIN1 characters is in the SGML docs? If my > > > >> understanding is correct, it can be also achieved by using some tools > > > >> like: > > > >> > > > >> iconv -t ISO-8859-1 -f UTF-8 release-17.sgml > > > >> > > > >> If there are some non-LATIN1 characters in release-17.sgml, > > > >> it will complain like: > > > >> > > > >> iconv: illegal input sequence at position 175 > > > >> > > > >> An advantage of this is, we don't need to covert each LATIN1 > > > >> characters to HTML entities and make the sgml file authors life a > > > >> little bit easier. > > > > > I think the iconv approach is an idea worth checking out. > > > > > > It's also not necessarily true that the set of characters provided by > > > the built-in PDF fonts is exactly the set of characters in Latin 1. It > > > appears to be close enough, but I'm not sure, and I haven't found any > > > authoritative information on that. > > > > I found a description in FAQ on Apache FOP [1] that explains some glyphs for > > Latin1 character set are not contained in the standard text fonts. > > > > The standard text fonts supplied with Acrobat Reader have mostly glyphs for > > characters from the ISO Latin 1 character set. For a variety of reasons, > > even > > those are not completely guaranteed to work, for example you can't use the > > fi > > ligature from the standard serif font. > > So, the failure of ligatures is caused usually by not using the right > Adobe Font Metric (AFM) file, I think. I have seen faulty ligature > rendering in PDFs but was alway able to fix it by using the right AFM > file. Odds are, failure is caused by using a standard Latin1 AFM file > and not the AFM file that matches the font being used. > > > [1] https://xmlgraphics.apache.org/fop/faq.html#pdf-characters > > > > However, it seems that using iconv to detect non-Latin1 characters may be > > still > > useful because these are likely not displayed in PDF. For example, we can > > do this > > in make check as the attached patch 0002. It cannot show the filname where > > one > > is found, though. > > I was thinking something like: > > grep -l --recursive -P '[\x80-\xFF]' . | > while read FILE > do iconv -f UTF-8 -t ISO-8859-1 "$FILE" || exit 1 > done > > This only checks files with non-ASCII characters. Checking non-latin1 after non-ASCII characters seems good idea. I attached a updated patch (0002) that uses perl instead of grep because non-GNU grep could not have escape sequences for hex. > > > > Another approach for a fix would be > > > to get FOP produce the required warnings or errors more reliably. I > > > know it has a bunch of logging settings (ultimately via log4j), so there > > > might be some possibilities. > > > > When a character that cannot be displayed in PDF is found, a warning > > "Glyph ... not available in font ...." is output in fop's log. We can > > prevent such characters from being contained in PDF by checking > > the message as the attached patch 0001. However, this is checked after > > the pdf is generated since I could not have an idea how to terminate the > > generation immediately when such character is detected. > > So, are we sure this will be the message even for non-English users? I > thought checking for warning message text was too fragile. I am not sure whether fop has messages in non-English, although I've never seen Japanese messages output. I wonder we can get unified results if executed with LANG=C. The updated patch 0001 is fixed in this direction. Regards, -- Yugo NAGATA <nag...@sraoss.co.jp>
>From d73024303b4bbac3d6a7e861f7b3b91b0541a5ba Mon Sep 17 00:00:00 2001 From: Yugo Nagata <nag...@sraoss.co.jp> Date: Mon, 11 Nov 2024 19:45:18 +0900 Subject: [PATCH v2 2/2] Check non-latin1 characters in make check --- doc/src/sgml/Makefile | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile index 18bf87d031..55dd2da299 100644 --- a/doc/src/sgml/Makefile +++ b/doc/src/sgml/Makefile @@ -36,6 +36,10 @@ ifndef FOP FOP = $(missing) fop endif +ifndef ICONV +ICONV = $(missing) iconv +endif + PANDOC = pandoc XMLINCLUDE = --path . --path $(srcdir) @@ -160,7 +164,6 @@ XSLTPROC_FO_FLAGS += --stringparam img.src.path '$(srcdir)/' awk 'BEGIN{err=0}{print}/not available in font/{err=1}END{exit err}' 1>&2 || \ (echo "Found characters that cannot be displayed in PDF" 1>&2; exit 1) - ## ## EPUB ## @@ -197,7 +200,7 @@ MAKEINFO = makeinfo ## # Quick syntax check without style processing -check: postgres.sgml $(ALL_SGML) check-tabs check-nbsp +check: postgres.sgml $(ALL_SGML) check-tabs check-nbsp check-non-latin1 $(XMLLINT) $(XMLINCLUDE) --noout --valid $< @@ -270,6 +273,12 @@ check-nbsp: $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl $(srcdir)/images/*.xsl) ) || \ (echo "Non-breaking spaces appear in SGML/XML files" 1>&2; exit 1) +# Non-Latin1 characters cannot be displayed in PDF. +check-non-latin1: + @ ( $(PERL) -ne '/[\x80-\xFF]/ and `${ICONV} -t ISO-8859-1 -f UTF-8 "$$ARGV" 2>/dev/null` and print("$$ARGV:$$_"),$$n++; END {exit($$n>0)}' \ + $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl $(srcdir)/images/*.xsl) ) || \ + (echo "Non-Latin1 characters appear in SGML/XML files" 1>&2; exit 1) + ## ## Clean ## -- 2.34.1
>From 3abf606f693776410dd667bd59b0d33b9b6a75f3 Mon Sep 17 00:00:00 2001 From: Yugo Nagata <nag...@sraoss.co.jp> Date: Mon, 11 Nov 2024 19:22:02 +0900 Subject: [PATCH v2 1/2] Disallow characters that cannot be displayed in PDF --- doc/src/sgml/Makefile | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile index a04c532b53..18bf87d031 100644 --- a/doc/src/sgml/Makefile +++ b/doc/src/sgml/Makefile @@ -156,7 +156,9 @@ XSLTPROC_FO_FLAGS += --stringparam img.src.path '$(srcdir)/' $(XSLTPROC) $(XMLINCLUDE) $(XSLTPROCFLAGS) $(XSLTPROC_FO_FLAGS) --stringparam paper.type USletter -o $@ $^ %.pdf: %.fo $(ALL_IMAGES) - $(FOP) -fo $< -pdf $@ + CLANG=C $(FOP) -fo $< -pdf $@ 2>&1 | \ + awk 'BEGIN{err=0}{print}/not available in font/{err=1}END{exit err}' 1>&2 || \ + (echo "Found characters that cannot be displayed in PDF" 1>&2; exit 1) ## -- 2.34.1