Hi, this is Yan from Japan. I'm also a user of PDFBox. About your problem, I've not understood clearly. Do you want to process the contents inside a form?
I can give a sample code used in my project. It use PDFStreamEngine to get form objects in PDF. I hope it can help you. -----Original Message----- From: Andrea Vacondio [mailto:andrea.vacon...@gmail.com] Sent: Thursday, December 1, 2016 6:02 PM To: users@pdfbox.apache.org Subject: Text extraction and clip area Hi, I had a couple of issues with text extraction and I tried to dig a bit into the code. As far as I can see the "current clipping area" is never used during text extraction, is this correct? My issue is with a form xobject where the bounding box clips out part of the text but that text is returned by the text stripper.
import java.io.File; import java.io.IOException; import java.util.List; import org.apache.pdfbox.contentstream.PDFStreamEngine; import org.apache.pdfbox.contentstream.operator.DrawObject; import org.apache.pdfbox.contentstream.operator.Operator; import org.apache.pdfbox.contentstream.operator.state.Concatenate; import org.apache.pdfbox.contentstream.operator.state.Restore; import org.apache.pdfbox.contentstream.operator.state.Save; import org.apache.pdfbox.contentstream.operator.state.SetGraphicsStateParameters; import org.apache.pdfbox.contentstream.operator.state.SetMatrix; import org.apache.pdfbox.cos.COSBase; import org.apache.pdfbox.cos.COSName; import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.pdmodel.PDPage; import org.apache.pdfbox.pdmodel.PDResources; import org.apache.pdfbox.pdmodel.graphics.PDXObject; import org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace; import org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject; import org.apache.pdfbox.pdmodel.graphics.form.PDTransparencyGroupAttributes; import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject; public class GetImageColorSpace extends PDFStreamEngine { public GetImageColorSpace() { addOperator(new Concatenate()); addOperator(new DrawObject()); addOperator(new SetGraphicsStateParameters()); addOperator(new Save()); addOperator(new Restore()); addOperator(new SetMatrix()); } public static void main(String[] args) throws IOException { PDDocument document = null; try { document = PDDocument.load(new File(args[0])); GetImageColorSpace printer = new GetImageColorSpace(); int pageNum = 0; for(PDPage page : document.getPages()) { pageNum++; System.out.println( "Processing page: " + pageNum); printer.processPage(page); } } finally { if(document != null) { document.close(); } } } /** * This is used to handle an operation. * * @param operator The operation to perform. * @param operands The list of arguments. * * @throws IOException If there is an error processing the operation. */ @Override protected void processOperator(Operator operator, List<COSBase> operands) throws IOException { String operation = operator.getName(); if("Do".equals(operation)) { COSName objectName = (COSName) operands.get(0); PDXObject xobject = getResources().getXObject(objectName); if(xobject instanceof PDFormXObject) { PDFormXObject form = (PDFormXObject)xobject; PDTransparencyGroupAttributes forGroup = form.getGroup(); // processing form's content goes here. } } else { super.processOperator(operator, operands); } } }
--------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org For additional commands, e-mail: users-h...@pdfbox.apache.org