[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17853241#comment-17853241
 ] 

Tim Allison commented on TIKA-4243:
-----------------------------------

This is what the json currently looks like.

{code:json}
{
    "emitter": "fse",
    "fetchKey": "testPDFTwoTextBoxes.pdf",
    "fetcher": "fsf",
    "id": "myId",
    "onParseException": "emit",
    "parseContext": {
        "org.apache.tika.parser.pdf.PDFParserConfig": {
            "_class": "org.apache.tika.parser.pdf.PDFParserConfig",
            "accessChecker": {
                "_class": "org.apache.tika.parser.pdf.AccessChecker"
            },
            "averageCharTolerance": 0.3,
            "catchIntermediateIOExceptions": true,
            "detectAngles": false,
            "dropThreshold": 2.5,
            "enableAutoSpace": true,
            "extractAcroFormContent": true,
            "extractActions": false,
            "extractAnnotationText": true,
            "extractBookmarksText": true,
            "extractFontNames": false,
            "extractIncrementalUpdateInfo": false,
            "extractInlineImages": false,
            "extractMarkedContent": false,
            "extractUniqueInlineImagesOnly": true,
            "ifXFAExtractOnlyXFA": false,
            "imageGraphicsEngineFactory": {
                "_class": 
"org.apache.tika.parser.pdf.image.ImageGraphicsEngineFactory"
            },
            "imageStrategy": "NONE",
            "maxIncrementalUpdates": 10,
            "maxMainMemoryBytes": 536870912,
            "ocrDPI": 300,
            "ocrImageFormatName": "png",
            "ocrImageQuality": 1.0,
            "ocrImageType": "GRAY",
            "ocrRenderingStrategy": "ALL",
            "ocrStrategy": "AUTO",
            "parseIncrementalUpdates": false,
            "renderer": null,
            "setKCMS": false,
            "sortByPosition": true,
            "spacingTolerance": 0.5,
            "suppressDuplicateOverlappingText": false,
            "throwOnEncryptedPayload": false
        }
    }
}{code}


> tika configuration overhaul
> ---------------------------
>
>                 Key: TIKA-4243
>                 URL: https://issues.apache.org/jira/browse/TIKA-4243
>             Project: Tika
>          Issue Type: New Feature
>          Components: config
>    Affects Versions: 3.0.0
>            Reporter: Nicholas DiPiazza
>            Priority: Major
>             Fix For: 3.0.0
>
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to