Re: Missing unicode information from a font.

NH Rao Mon, 14 Apr 2025 09:53:02 -0700

Hello again,

While I agree with your reasoning, I am not really sure if it applies in
this case. From API consistency  point of view, PDFont has a
bunch of  abstract methods defined. These are the only two methods which
are declared as protected. It raises some questions for example why is
getWidthFromFont is public and getStandard14Width protected? Method encode
can be interesting, but all it's doing is, given a value, it's returning a
byte array.


I am not able to understand how making two methods public will cause more
support requests. I am not able imagine under what scenarios, someone else
can use/call method "encode" or getStandard14Width without understanding
what they are doing.

I understand my needs are not common/standard. Right now reflection is
working for us but I'm nervous to use it. Just making these two methods
public will provide us some peace of mind. Of course I am not a expert and
don't know all the use cases for current implementation, but hoping that
you will understand our use case  and help us here.

Regards,

Niranjan

On Sun, Apr 13, 2025 at 6:49 AM Tilman Hausherr <thaush...@t-online.de>
wrote:

> Hi,
>
> We're usually reluctant to make stuff public, because this brings up new
> risks, more support requests and also prevents us to change that API.
>
> Tilman
>
> On 12.04.2025 02:52, NH Rao wrote:
> > Greetings,
> >
> > Thank you for the reply. I managed to get it working using reflection.
> > However I'm a bit worried that I am accessing methods that are not part
> of
> > the public API.
> >
> > Essentially my solution is to override the showGlyph method from
> > PDFTextStripper, create a wrapper for the font and use the base class
> > method with the wrapper. This works. Only issue is for two
> > methods getStandard14Width and encode of
> > org.apache.pdfbox.pdmodel.font.PDFont class. I am forced to use
> reflection
> > as these methods are not public. Is it possible to make these methods
> > public?
> >
> > For all practical purposes, my wrapper just implements toUnicode method
> and
> > delegates everything else to the wrapped object.
> >
> > Thank you,
> >
> > Niranjan
> >
> > On Fri, Apr 11, 2025 at 5:40 AM Tilman Hausherr <thaush...@t-online.de>
> > wrote:
> >
> >> No there is no official solution to handle this.
> >>
> >> Here's what could be done (in addition to just fork the project, or use
> >> reflection)
> >>
> >> - do a setUnicode() on the TextPosition elements in the stripper
> >> - create the encoding and replace the fonts before extracting. For that
> >> you'd have to have to find out how the encoding is stored. (probably in
> >> "differences")
> >> If it doesn't work, you may have to disable the cache or use your own.
> >>
> >> Tilman
> >>
> >> On 10.04.2025 23:48, NH Rao wrote:
> >>> Greetings,
> >>>
> >>> Some of the PDF files we process do not have unicode information
> defined
> >>> for its type 3 fonts. I am in the process of migrating ancient code
> >> (based
> >>> on version 1.8 to the latest version). Since the characters are imited
> to
> >>> ASCII characters, we dumped checksum of a glyph and character to a map.
> >>> With processing enough files, we managed to get checksums for all the
> >>> characters we care about. At runtime, we get font glyph, compute it's
> >>> checksum  and set equivalent unicode using code that looks similar to
> >>> follows
> >>>
> >>> font.getFontEncoding().addCharacterEncoding(letterChar, charName);
> >>> font.getToUnicodeCMap().addMapping(new byte[] { (byte) i }, letter);
> >>>
> >>> With these changes, the rest of the text stripper code works as
> expected
> >> as
> >>> it's able to find the required information.
> >>>
> >>> We're trying to migrate to the latest released version of PDF. I
> believe
> >>> some of these methods are now package protected
> >>> e.g. org.apache.pdfbox.pdmodel.font.encoding.Encoding.add(int, String).
> >>> Also comment on the method seems to discourage our workaround.
> >>>
> >>> I am not able to figure out which method I need to call for unicode
> >> mapping
> >>> in the second line of the above code example.
> >>>
> >>> What will be a solution to handle this? The solution of mapping glyph
> to
> >>> character  does work for us even though we created the map manually.
> >>>
> >>> Regards,
> >>>
> >>> Niranjan
> >>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> >> For additional commands, e-mail: users-h...@pdfbox.apache.org
> >>
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
> For additional commands, e-mail: users-h...@pdfbox.apache.org
>
>

Re: Missing unicode information from a font.

Reply via email to