[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17853241#comment-17853241 ]
Tim Allison commented on TIKA-4243: ----------------------------------- This is what the json currently looks like. {code:json} { "emitter": "fse", "fetchKey": "testPDFTwoTextBoxes.pdf", "fetcher": "fsf", "id": "myId", "onParseException": "emit", "parseContext": { "org.apache.tika.parser.pdf.PDFParserConfig": { "_class": "org.apache.tika.parser.pdf.PDFParserConfig", "accessChecker": { "_class": "org.apache.tika.parser.pdf.AccessChecker" }, "averageCharTolerance": 0.3, "catchIntermediateIOExceptions": true, "detectAngles": false, "dropThreshold": 2.5, "enableAutoSpace": true, "extractAcroFormContent": true, "extractActions": false, "extractAnnotationText": true, "extractBookmarksText": true, "extractFontNames": false, "extractIncrementalUpdateInfo": false, "extractInlineImages": false, "extractMarkedContent": false, "extractUniqueInlineImagesOnly": true, "ifXFAExtractOnlyXFA": false, "imageGraphicsEngineFactory": { "_class": "org.apache.tika.parser.pdf.image.ImageGraphicsEngineFactory" }, "imageStrategy": "NONE", "maxIncrementalUpdates": 10, "maxMainMemoryBytes": 536870912, "ocrDPI": 300, "ocrImageFormatName": "png", "ocrImageQuality": 1.0, "ocrImageType": "GRAY", "ocrRenderingStrategy": "ALL", "ocrStrategy": "AUTO", "parseIncrementalUpdates": false, "renderer": null, "setKCMS": false, "sortByPosition": true, "spacingTolerance": 0.5, "suppressDuplicateOverlappingText": false, "throwOnEncryptedPayload": false } } }{code} > tika configuration overhaul > --------------------------- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Issue Type: New Feature > Components: config > Affects Versions: 3.0.0 > Reporter: Nicholas DiPiazza > Priority: Major > Fix For: 3.0.0 > > > In 3.0.0 when dealing with Tika, it would greatly help to have a Typed > Configuration schema. > In 3.x can we remove the old way of doing configs and replace with Json > Schema? > Json Schema can be converted to Pojos using a maven plugin > [https://github.com/joelittlejohn/jsonschema2pojo] > This automatically creates a Java Pojo model we can use for the configs. > This can allow for the legacy tika-config XML to be read and converted to the > new pojos easily using an XML mapper so that users don't have to use JSON > configurations yet if they do not want. > When complete, configurations can be set as XML, JSON or YAML > tika-config.xml > tika-config.json > tika-config.yaml > Replace all instances of tika config annotations that used the old syntax, > and replace with the Pojo model serialized from the xml/json/yaml. -- This message was sent by Atlassian Jira (v8.20.10#820010)