[ 
https://issues.apache.org/jira/browse/TIKA-4243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17851727#comment-17851727
 ] 

Tim Allison edited comment on TIKA-4243 at 6/3/24 5:02 PM:
-----------------------------------------------------------

I spent a bit of time trying to serialize ParseContext, and I now 
remember/newly appreciate what a challenge that is.

For one, everything has to be serializable, as we knew, with Jackson 
annotations or other Jackson based methods.

I suspect that someone who really understands Jackson could do a good job of 
this. I know the basics, and I am not a Jackson expert.

There are two main challenges: inheritance and embedded objects (as opposed to 
parameterizable primitives).

Inheritance is complicated with Jackson. If we want to support, for example, 
{{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the 
base class name as the key and the instantiated class. I think I found out how 
to do this with Jackson, but it is _messy_ (reference: 
[https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]).

We'd want to deal with embedded objects for the obvious use cases of the 
CompositeDetectors, etc. where we want to specify a list of detectors. And, we 
want to be able to cover the cases of setting an object as a parameter – for 
example, setting some of the slightly more complex classes in the 
PDFParserConfig.

I'm wondering if it would be simpler to backoff to a Map<String, String[]> 
properties kind of thing where we identify the config class and then 
instantiate it for the ParseContext with the "properties". We're currently 
doing something like this in tika-server where we have custom serialization 
classes for each config we support (PDFParserConfig and 
TesseractOCRParserConfig) based on the http-headers. We'd want to extend this 
to handle inheritance and embedded objects...

Something along these lines in json:
{code:json}
{
"settings" : {
   "org.apache.tika.parser.pdf.PDFParserConfig": { 
        "ocrDPI":300,
        "sortByPosition": true,
   },
   "org.apache.tika.parser.Parser": {
         "_class":"org.apache.tika.parser.EmptyParser"
   }
}
{code}
Then we'd have a small bit of code (I'd hope?) that would take the settings and 
create the config class with the map of its values:

PDFParserConfig pdfParserConfig = new PDFParserConfig(Map<String, String[]>)

 

*What I don't like about this is that we're back in the game of creating our 
own serialization framework. :(*


was (Author: talli...@mitre.org):
I spent a bit of time trying to serialize ParseContext, and I now 
remember/newly appreciate what a challenge that is.

For one, everything has to be serializable, as we knew, with Jackson 
annotations or other Jackson based methods.

I suspect that someone who really understands Jackson could do a good job of 
this. I know the basics, and I am not a Jackson expert.

There are two main challenges: inheritance and embedded objects (as opposed to 
parameterizable primitives).

Inheritance is complicated with Jackson. If we want to support, for example, 
{{{}parseContext.set(Parser.class, new EmptyParser()){}}}, we have to store the 
base class name as the key and the instantiated class. I think I found out how 
to do this with Jackson, but it is _messy_ (reference: 
[https://www.baeldung.com/jackson-inheritance#bd-subtype-handling-scenarios]).

We'd want to deal with embedded objects for the obvious use cases of the 
CompositeDetectors, etc. where we want to specify a list of detectors. And, we 
want to be able to cover the cases of setting an object as a parameter – for 
example, setting some of the slightly more complex classes in the 
PDFParserConfig.

I'm wondering if it would be simpler to backoff to a Map<String, String[]> 
properties kind of thing where we identify the config class and then 
instantiate it for the ParseContext with the "properties". We're currently 
doing something like this in tika-server where we have custom serialization 
classes for each config we support (PDFParserConfig and 
TesseractOCRParserConfig) based on the http-headers. We'd want to extend this 
to handle inheritance and embedded objects...

Something along these lines in json:
{code:json}
{
"settings" : {
   "org.apache.tika.parser.pdf.PDFParserConfig": { 
        "ocrDPI":300,
        "sortByPosition": true,
   },
   { "org.apache.tika.parser.Parser": {
         "_class":"org.apache.tika.parser.EmptyParser"
   }
}
{code}
Then we'd have a small bit of code (I'd hope?) that would take the settings and 
create the config class with the map of its values:

PDFParserConfig pdfParserConfig = new PDFParserConfig(Map<String, String[]>)

 

*What I don't like about this is that we're back in the game of creating our 
own serialization framework. :(*

> tika configuration overhaul
> ---------------------------
>
>                 Key: TIKA-4243
>                 URL: https://issues.apache.org/jira/browse/TIKA-4243
>             Project: Tika
>          Issue Type: New Feature
>          Components: config
>    Affects Versions: 3.0.0
>            Reporter: Nicholas DiPiazza
>            Priority: Major
>
> In 3.0.0 when dealing with Tika, it would greatly help to have a Typed 
> Configuration schema. 
> In 3.x can we remove the old way of doing configs and replace with Json 
> Schema?
> Json Schema can be converted to Pojos using a maven plugin 
> [https://github.com/joelittlejohn/jsonschema2pojo]
> This automatically creates a Java Pojo model we can use for the configs. 
> This can allow for the legacy tika-config XML to be read and converted to the 
> new pojos easily using an XML mapper so that users don't have to use JSON 
> configurations yet if they do not want.
> When complete, configurations can be set as XML, JSON or YAML
> tika-config.xml
> tika-config.json
> tika-config.yaml
> Replace all instances of tika config annotations that used the old syntax, 
> and replace with the Pojo model serialized from the xml/json/yaml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to