Hi, this is Yan from Japan.
I'm also a user of PDFBox.

About your problem, I've not understood clearly.
Do you want to process the contents inside a form?

I can give a sample code used in my project.
It use PDFStreamEngine to get form objects in PDF.
I hope it can help you.

 



-----Original Message-----
From: Andrea Vacondio [mailto:andrea.vacon...@gmail.com] 
Sent: Thursday, December 1, 2016 6:02 PM
To: users@pdfbox.apache.org
Subject: Text extraction and clip area

Hi, I had a couple of issues with text extraction and I tried to dig a bit into 
the code. As far as I can see the "current clipping area" is never used during 
text extraction, is this correct? My issue is with a form xobject where the 
bounding box clips out part of the text but that text is returned by the text 
stripper.
import java.io.File;
import java.io.IOException;
import java.util.List;

import org.apache.pdfbox.contentstream.PDFStreamEngine;
import org.apache.pdfbox.contentstream.operator.DrawObject;
import org.apache.pdfbox.contentstream.operator.Operator;
import org.apache.pdfbox.contentstream.operator.state.Concatenate;
import org.apache.pdfbox.contentstream.operator.state.Restore;
import org.apache.pdfbox.contentstream.operator.state.Save;
import 
org.apache.pdfbox.contentstream.operator.state.SetGraphicsStateParameters;
import org.apache.pdfbox.contentstream.operator.state.SetMatrix;
import org.apache.pdfbox.cos.COSBase;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.PDResources;
import org.apache.pdfbox.pdmodel.graphics.PDXObject;
import org.apache.pdfbox.pdmodel.graphics.color.PDColorSpace;
import org.apache.pdfbox.pdmodel.graphics.form.PDFormXObject;
import org.apache.pdfbox.pdmodel.graphics.form.PDTransparencyGroupAttributes;
import org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject;

public class GetImageColorSpace extends PDFStreamEngine {

    public GetImageColorSpace()
    {
        addOperator(new Concatenate());
        addOperator(new DrawObject());
        addOperator(new SetGraphicsStateParameters());
        addOperator(new Save());
        addOperator(new Restore());
        addOperator(new SetMatrix());
    }

    public static void main(String[] args) throws IOException {
        PDDocument document = null;
        try
        {
            document = PDDocument.load(new File(args[0]));
            GetImageColorSpace printer = new GetImageColorSpace();
            int pageNum = 0;
            for(PDPage page : document.getPages())
            {
                pageNum++;
                System.out.println( "Processing page: " + pageNum);
                printer.processPage(page);
            }
        }
        finally
        {
            if(document != null)
            {
                document.close();
            }
        }

    }

    /**
     * This is used to handle an operation.
     *
     * @param operator The operation to perform.
     * @param operands The list of arguments.
     *
     * @throws IOException If there is an error processing the operation.
     */
    @Override
    protected void processOperator(Operator operator, List<COSBase> operands) 
throws IOException
    {
        String operation = operator.getName();
        if("Do".equals(operation))
        {
            COSName objectName = (COSName) operands.get(0);
            PDXObject xobject = getResources().getXObject(objectName);
            if(xobject instanceof PDFormXObject)
            {
                PDFormXObject form = (PDFormXObject)xobject;
                PDTransparencyGroupAttributes  forGroup = form.getGroup();
                
                // processing form's content goes here.
            }
        }
        else
        {
            super.processOperator(operator, operands);
        }
    }

}
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: users-h...@pdfbox.apache.org

Reply via email to