On Mon, 18 Nov 2024 16:04:20 -0500
Bruce Momjian <br...@momjian.us> wrote:

> On Mon, Nov 11, 2024 at 10:02:15PM +0900, Yugo Nagata wrote:
> > On Tue, 5 Nov 2024 10:08:17 +0100
> > Peter Eisentraut <pe...@eisentraut.org> wrote:
> > 
> > 
> > > >> So you convert LATIN1 characters to HTML entities so that it's easier
> > > >> to detect non-LATIN1 characters is in the SGML docs? If my
> > > >> understanding is correct, it can be also achieved by using some tools
> > > >> like:
> > > >>
> > > >> iconv -t ISO-8859-1 -f UTF-8 release-17.sgml
> > > >>
> > > >> If there are some non-LATIN1 characters in release-17.sgml,
> > > >> it will complain like:
> > > >>
> > > >> iconv: illegal input sequence at position 175
> > > >>
> > > >> An advantage of this is, we don't need to covert each LATIN1
> > > >> characters to HTML entities and make the sgml file authors life a
> > > >> little bit easier.
> > 
> > > I think the iconv approach is an idea worth checking out.
> > > 
> > > It's also not necessarily true that the set of characters provided by 
> > > the built-in PDF fonts is exactly the set of characters in Latin 1.  It 
> > > appears to be close enough, but I'm not sure, and I haven't found any 
> > > authoritative information on that.  
> > 
> > I found a description in FAQ on Apache FOP [1] that explains some glyphs for
> > Latin1 character set are not contained in the standard text fonts.
> > 
> >  The standard text fonts supplied with Acrobat Reader have mostly glyphs for
> >  characters from the ISO Latin 1 character set. For a variety of reasons, 
> > even
> >  those are not completely guaranteed to work, for example you can't use the 
> > fi
> >  ligature from the standard serif font.
> 
> So, the failure of ligatures is caused usually by not using the right
> Adobe Font Metric (AFM) file, I think.  I have seen faulty ligature
> rendering in PDFs but was alway able to fix it by using the right AFM
> file.  Odds are, failure is caused by using a standard Latin1 AFM file
> and not the AFM file that matches the font being used.
> 
> > [1] https://xmlgraphics.apache.org/fop/faq.html#pdf-characters
> > 
> > However, it seems that using iconv to detect non-Latin1 characters may be 
> > still
> > useful because these are likely not displayed in PDF. For example, we can 
> > do this
> > in make check as the attached patch 0002. It cannot show the filname where 
> > one
> > is found, though.
> 
> I was thinking something like:
> 
>       grep -l --recursive  -P '[\x80-\xFF]' . |
>       while read FILE
>       do  iconv -f UTF-8 -t ISO-8859-1 "$FILE" || exit 1
>       done
> 
> This only checks files with non-ASCII characters.

Checking non-latin1 after non-ASCII characters seems good idea.
I attached a updated patch (0002) that uses perl instead of grep
because non-GNU grep could not have escape sequences for hex.

> 
> > > Another approach for a fix would be 
> > > to get FOP produce the required warnings or errors more reliably.  I 
> > > know it has a bunch of logging settings (ultimately via log4j), so there 
> > > might be some possibilities.
> > 
> > When a character that cannot be displayed in PDF is found, a warning
> > "Glyph ... not available in font ...." is output in fop's log. We can
> > prevent such characters from being contained in PDF by checking
> > the message as the attached patch 0001. However, this is checked after
> > the pdf is generated since I could not have an idea how to terminate the
> > generation immediately when such character is detected.
> 
> So, are we sure this will be the message even for non-English users? I
> thought checking for warning message text was too fragile.

I am not sure whether fop has messages in non-English, although I've never
seen Japanese messages output. 

I wonder we can get unified results if executed with LANG=C.
The updated patch 0001 is fixed in this direction.

Regards,

-- 
Yugo NAGATA <nag...@sraoss.co.jp>
>From d73024303b4bbac3d6a7e861f7b3b91b0541a5ba Mon Sep 17 00:00:00 2001
From: Yugo Nagata <nag...@sraoss.co.jp>
Date: Mon, 11 Nov 2024 19:45:18 +0900
Subject: [PATCH v2 2/2] Check non-latin1 characters in make check

---
 doc/src/sgml/Makefile | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
index 18bf87d031..55dd2da299 100644
--- a/doc/src/sgml/Makefile
+++ b/doc/src/sgml/Makefile
@@ -36,6 +36,10 @@ ifndef FOP
 FOP = $(missing) fop
 endif
 
+ifndef ICONV
+ICONV = $(missing) iconv
+endif
+
 PANDOC = pandoc
 
 XMLINCLUDE = --path . --path $(srcdir)
@@ -160,7 +164,6 @@ XSLTPROC_FO_FLAGS += --stringparam img.src.path '$(srcdir)/'
 	awk 'BEGIN{err=0}{print}/not available in font/{err=1}END{exit err}' 1>&2  || \
 	(echo "Found characters that cannot be displayed in PDF" 1>&2;  exit 1)
 
-
 ##
 ## EPUB
 ##
@@ -197,7 +200,7 @@ MAKEINFO = makeinfo
 ##
 
 # Quick syntax check without style processing
-check: postgres.sgml $(ALL_SGML) check-tabs check-nbsp
+check: postgres.sgml $(ALL_SGML) check-tabs check-nbsp check-non-latin1
 	$(XMLLINT) $(XMLINCLUDE) --noout --valid $<
 
 
@@ -270,6 +273,12 @@ check-nbsp:
 	  $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl $(srcdir)/images/*.xsl) ) || \
 	(echo "Non-breaking spaces appear in SGML/XML files" 1>&2;  exit 1)
 
+# Non-Latin1 characters cannot be displayed in PDF.
+check-non-latin1:
+	@ ( $(PERL) -ne '/[\x80-\xFF]/ and `${ICONV} -t ISO-8859-1 -f UTF-8 "$$ARGV" 2>/dev/null` and print("$$ARGV:$$_"),$$n++; END {exit($$n>0)}' \
+	  $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl $(srcdir)/images/*.xsl) ) || \
+	(echo "Non-Latin1 characters appear in SGML/XML files" 1>&2;  exit 1)
+
 ##
 ## Clean
 ##
-- 
2.34.1

>From 3abf606f693776410dd667bd59b0d33b9b6a75f3 Mon Sep 17 00:00:00 2001
From: Yugo Nagata <nag...@sraoss.co.jp>
Date: Mon, 11 Nov 2024 19:22:02 +0900
Subject: [PATCH v2 1/2] Disallow characters that cannot be displayed in PDF

---
 doc/src/sgml/Makefile | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
index a04c532b53..18bf87d031 100644
--- a/doc/src/sgml/Makefile
+++ b/doc/src/sgml/Makefile
@@ -156,7 +156,9 @@ XSLTPROC_FO_FLAGS += --stringparam img.src.path '$(srcdir)/'
 	$(XSLTPROC) $(XMLINCLUDE) $(XSLTPROCFLAGS) $(XSLTPROC_FO_FLAGS) --stringparam paper.type USletter -o $@ $^
 
 %.pdf: %.fo $(ALL_IMAGES)
-	$(FOP) -fo $< -pdf $@
+	CLANG=C $(FOP) -fo $< -pdf $@ 2>&1 | \
+	awk 'BEGIN{err=0}{print}/not available in font/{err=1}END{exit err}' 1>&2  || \
+	(echo "Found characters that cannot be displayed in PDF" 1>&2;  exit 1)
 
 
 ##
-- 
2.34.1

Reply via email to