On Nov 29, 2008, at 3:16 PM, Torsten Curdt wrote:

I just assume that the actual content is hidden inside the page's
content stream(s).

Raw content, mostly, sometimes. But the draw commands are what put it all
together.

For instance, you might have a paragraph of text where there is one draw command per line, or you might have a paragraph of text where is one draw
command per character.

Getting to the individual draw commands for the text/characters would
be a first step ...and maybe even enough for what I am after. Is this
what the CGPDFOperatorTableSetCallback() is for?

For an image that fills the page, you might have one
content stream and one draw command, or you might have multiple image slices
with one content stream and one draw command for each slice.

Would a PDF writer really slice the images up?

IOW, what you want is not so simple.

I see.

Well, I probably don't really need the image extraction
Just getting the text draw commands might suffice.


At my day job, we use pdfbox (see www.pdfbox.org) in automated tests. It basically grabs raw textual data and spits out two-dimensional arrays of strings.

While it's java based, it may shed a light on how text extraction can be done. I do not, however, know if their licensing model will fit your needs (i.e. if you base your code on theirs, is that even allowed).

There's some links on their site (http://www.pdfbox.org/ references.html) which shows how someone wrote a Cocoa app and used the Java bridge to interface with pdfbox.

___________________________________________________________
Ricky A. Sharp         mailto:[EMAIL PROTECTED]
Instant Interactive(tm)   http://www.instantinteractive.com



_______________________________________________

Cocoa-dev mailing list (Cocoa-dev@lists.apple.com)

Please do not post admin requests or moderator comments to the list.
Contact the moderators at cocoa-dev-admins(at)lists.apple.com

Help/Unsubscribe/Update your Subscription:
http://lists.apple.com/mailman/options/cocoa-dev/archive%40mail-archive.com

This email sent to [EMAIL PROTECTED]

Reply via email to