On Tue, 5 Nov 2024 10:08:17 +0100
Peter Eisentraut <pe...@eisentraut.org> wrote:


> >> So you convert LATIN1 characters to HTML entities so that it's easier
> >> to detect non-LATIN1 characters is in the SGML docs? If my
> >> understanding is correct, it can be also achieved by using some tools
> >> like:
> >>
> >> iconv -t ISO-8859-1 -f UTF-8 release-17.sgml
> >>
> >> If there are some non-LATIN1 characters in release-17.sgml,
> >> it will complain like:
> >>
> >> iconv: illegal input sequence at position 175
> >>
> >> An advantage of this is, we don't need to covert each LATIN1
> >> characters to HTML entities and make the sgml file authors life a
> >> little bit easier.

> I think the iconv approach is an idea worth checking out.
> 
> It's also not necessarily true that the set of characters provided by 
> the built-in PDF fonts is exactly the set of characters in Latin 1.  It 
> appears to be close enough, but I'm not sure, and I haven't found any 
> authoritative information on that.  

I found a description in FAQ on Apache FOP [1] that explains some glyphs for
Latin1 character set are not contained in the standard text fonts.

 The standard text fonts supplied with Acrobat Reader have mostly glyphs for
 characters from the ISO Latin 1 character set. For a variety of reasons, even
 those are not completely guaranteed to work, for example you can't use the fi
 ligature from the standard serif font.

[1] https://xmlgraphics.apache.org/fop/faq.html#pdf-characters

However, it seems that using iconv to detect non-Latin1 characters may be still
useful because these are likely not displayed in PDF. For example, we can do 
this
in make check as the attached patch 0002. It cannot show the filname where one
is found, though.

> Another approach for a fix would be 
> to get FOP produce the required warnings or errors more reliably.  I 
> know it has a bunch of logging settings (ultimately via log4j), so there 
> might be some possibilities.

When a character that cannot be displayed in PDF is found, a warning
"Glyph ... not available in font ...." is output in fop's log. We can
prevent such characters from being contained in PDF by checking
the message as the attached patch 0001. However, this is checked after
the pdf is generated since I could not have an idea how to terminate the
generation immediately when such character is detected.

Regards,
Yugo Nagata

-- 
Yugo Nagata <nag...@sraoss.co.jp>
>From b6bed0089fa510480dc410969ecff42a55ea7442 Mon Sep 17 00:00:00 2001
From: Yugo Nagata <nag...@sraoss.co.jp>
Date: Mon, 11 Nov 2024 19:45:18 +0900
Subject: [PATCH 2/2] Check non-latin1 characters in make check

---
 doc/src/sgml/Makefile | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
index edc3725e5a..39822082c8 100644
--- a/doc/src/sgml/Makefile
+++ b/doc/src/sgml/Makefile
@@ -157,10 +157,9 @@ XSLTPROC_FO_FLAGS += --stringparam img.src.path '$(srcdir)/'
 
 %.pdf: %.fo $(ALL_IMAGES)
 	$(FOP) -fo $< -pdf $@ 2>&1 | \
-	awk 'BEGIN{err=0}{print}/not available in font/{err=1}END{exit err}' 1>&2 || \
+	awk 'BEGIN{err=0}{print}/not available in font/{err=1}END{exit err}' 1>&2  || \
 	(echo "Found characters that cannot be displayed in PDF" 1>&2;  exit 1)
 
-
 ##
 ## EPUB
 ##
@@ -197,7 +196,7 @@ MAKEINFO = makeinfo
 ##
 
 # Quick syntax check without style processing
-check: postgres.sgml $(ALL_SGML) check-tabs check-nbsp
+check: postgres.sgml $(ALL_SGML) check-tabs check-nbsp check-non-latin1
 	$(XMLLINT) $(XMLINCLUDE) --noout --valid $<
 
 
@@ -270,6 +269,11 @@ check-nbsp:
 	  $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl $(srcdir)/images/*.xsl) ) || \
 	(echo "Non-breaking spaces appear in SGML/XML files" 1>&2;  exit 1)
 
+# Non-Latin1 characters cannot be displayed in PDF.
+check-non-latin1:
+	@ (iconv -t ISO-8859-1 -f UTF-8 $(wildcard $(srcdir)/*.sgml $(srcdir)/ref/*.sgml $(srcdir)/*.xsl) >/dev/null 2>&1) || \
+	(echo "Non-Latin1 characters appear in SGML/XML files" 1>&2;  exit 1)
+
 ##
 ## Clean
 ##
-- 
2.34.1

>From 7e6a612c15bf65169e31906371218cdf13fcacdb Mon Sep 17 00:00:00 2001
From: Yugo Nagata <nag...@sraoss.co.jp>
Date: Mon, 11 Nov 2024 19:22:02 +0900
Subject: [PATCH 1/2] Disallow characters that cannot be displayed in PDF

---
 doc/src/sgml/Makefile | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/Makefile b/doc/src/sgml/Makefile
index a04c532b53..edc3725e5a 100644
--- a/doc/src/sgml/Makefile
+++ b/doc/src/sgml/Makefile
@@ -156,7 +156,9 @@ XSLTPROC_FO_FLAGS += --stringparam img.src.path '$(srcdir)/'
 	$(XSLTPROC) $(XMLINCLUDE) $(XSLTPROCFLAGS) $(XSLTPROC_FO_FLAGS) --stringparam paper.type USletter -o $@ $^
 
 %.pdf: %.fo $(ALL_IMAGES)
-	$(FOP) -fo $< -pdf $@
+	$(FOP) -fo $< -pdf $@ 2>&1 | \
+	awk 'BEGIN{err=0}{print}/not available in font/{err=1}END{exit err}' 1>&2 || \
+	(echo "Found characters that cannot be displayed in PDF" 1>&2;  exit 1)
 
 
 ##
-- 
2.34.1

Reply via email to