[ https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17851727#comment-17851727 ]
Tim Allison edited comment on TIKA-4243 at 6/3/24 5:02 PM: ----------------------------------------------------------- I spent a bit of time trying to serialize ParseContext, and I now remember/newly appreciate what a challenge that is. For one, everything has to be serializable, as we knew, with Jackson annotations or other Jackson based methods. I suspect that someone who really understands Jackson could do a good job of this. I know the basics, and I am not a Jackson expert. There are two main challenges: inheritance and embedded objects (as opposed to parameterizable primitives). Inheritance is complicated with Jackson. If we want to support, for example, {{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the base class name as the key and the instantiated class. I think I found out how to do this with Jackson, but it is _messy_ (reference: [https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]). We'd want to deal with embedded objects for the obvious use cases of the CompositeDetectors, etc. where we want to specify a list of detectors. And, we want to be able to cover the cases of setting an object as a parameter – for example, setting some of the slightly more complex classes in the PDFParserConfig. I'm wondering if it would be simpler to backoff to a Map<String, String[]> properties kind of thing where we identify the config class and then instantiate it for the ParseContext with the "properties". We're currently doing something like this in tika-server where we have custom serialization classes for each config we support (PDFParserConfig and TesseractOCRParserConfig) based on the http-headers. We'd want to extend this to handle inheritance and embedded objects... Something along these lines in json: {code:json} { "settings" : { "org.apache.tika.parser.pdf.PDFParserConfig": { "ocrDPI":300, "sortByPosition": true, }, "org.apache.tika.parser.Parser": { "_class":"org.apache.tika.parser.EmptyParser" } } {code} Then we'd have a small bit of code (I'd hope?) that would take the settings and create the config class with the map of its values: PDFParserConfig pdfParserConfig = new PDFParserConfig(Map<String, String[]>) *What I don't like about this is that we're back in the game of creating our own serialization framework. :(* was (Author: talli...@mitre.org): I spent a bit of time trying to serialize ParseContext, and I now remember/newly appreciate what a challenge that is. For one, everything has to be serializable, as we knew, with Jackson annotations or other Jackson based methods. I suspect that someone who really understands Jackson could do a good job of this. I know the basics, and I am not a Jackson expert. There are two main challenges: inheritance and embedded objects (as opposed to parameterizable primitives). Inheritance is complicated with Jackson. If we want to support, for example, {{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the base class name as the key and the instantiated class. I think I found out how to do this with Jackson, but it is _messy_ (reference: [https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]). We'd want to deal with embedded objects for the obvious use cases of the CompositeDetectors, etc. where we want to specify a list of detectors. And, we want to be able to cover the cases of setting an object as a parameter – for example, setting some of the slightly more complex classes in the PDFParserConfig. I'm wondering if it would be simpler to backoff to a Map<String, String[]> properties kind of thing where we identify the config class and then instantiate it for the ParseContext with the "properties". We're currently doing something like this in tika-server where we have custom serialization classes for each config we support (PDFParserConfig and TesseractOCRParserConfig) based on the http-headers. We'd want to extend this to handle inheritance and embedded objects... Something along these lines in json: {code:json} { "settings" : { "org.apache.tika.parser.pdf.PDFParserConfig": { "ocrDPI":300, "sortByPosition": true, }, { "org.apache.tika.parser.Parser": { "_class":"org.apache.tika.parser.EmptyParser" } } {code} Then we'd have a small bit of code (I'd hope?) that would take the settings and create the config class with the map of its values: PDFParserConfig pdfParserConfig = new PDFParserConfig(Map<String, String[]>) *What I don't like about this is that we're back in the game of creating our own serialization framework. :(* > tika configuration overhaul > --------------------------- > > Key: TIKA-4243 > URL: https://issues.apache.org/jira/browse/TIKA-4243 > Project: Tika > Issue Type: New Feature > Components: config > Affects Versions: 3.0.0 > Reporter: Nicholas DiPiazza > Priority: Major > > In 3.0.0 when dealing with Tika, it would greatly help to have a Typed > Configuration schema. > In 3.x can we remove the old way of doing configs and replace with Json > Schema? > Json Schema can be converted to Pojos using a maven plugin > [https://github.com/joelittlejohn/jsonschema2pojo] > This automatically creates a Java Pojo model we can use for the configs. > This can allow for the legacy tika-config XML to be read and converted to the > new pojos easily using an XML mapper so that users don't have to use JSON > configurations yet if they do not want. > When complete, configurations can be set as XML, JSON or YAML > tika-config.xml > tika-config.json > tika-config.yaml > Replace all instances of tika config annotations that used the old syntax, > and replace with the Pojo model serialized from the xml/json/yaml. -- This message was sent by Atlassian Jira (v8.20.10#820010)