[
https://issues.apache.org/jira/browse/NIFI-7790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224872#comment-17224872
]
Pierre Gramme commented on NIFI-7790:
-------------------------------------
Thanks for the detailed feedback !
I agree that providing a schema is definitely the most robust option. I was
actually hoping to get a first version of the schema inferred from the XML
records, that I would then refine manually.
This was just a minimal reproducible example. In my use case, I have an XSD
schema for the input XML, but no Avro. This schema is quite big and complex,
involving enums, min/max values, abstract classes, etc. So manually converting
it to Avro schema seems a bad option, initially time-consuming and later hard
to maintain.
Comments under your [blog
post|https://pierrevillard.com/2018/06/28/nifi-1-7-xml-reader-writer-and-forkrecord-processor/]
suggested that a XSD-based parser might be on its way. But after reading
comments in NIFI-4185, I don't think it is possible to specify the input schema
as XSD, is it?
If not, I will investigate the following method, using JAXB to convert XSD ->
Java classes -> Avro schema (code under Apache 2 licence):
[https://github.com/mit-ll/xml-avro-converter/blob/master/doc/tutorial.md#conversion-of-xml-schemas-and-data]
Or would you suggest some other automated way of converting the XSD to Avro?
> XML record reader - failure on well-formed XML
> ----------------------------------------------
>
> Key: NIFI-7790
> URL: https://issues.apache.org/jira/browse/NIFI-7790
> Project: Apache NiFi
> Issue Type: Bug
> Components: Extensions
> Affects Versions: 1.11.4
> Reporter: Pierre Gramme
> Priority: Major
> Labels: records, xml
> Attachments: bug-parse-xml.xml
>
>
> I am using ConvertRecord in order to parse XML flowfiles to Avro, with the
> Infer Schema strategy. Some input flowfiles are sent to the failure output
> queue whereas they are well-formed:
> {code:java}
> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
> <root>
> <authors>
> <item>
> <name>Neil Gaiman</name>
> </item>
> </authors>
> <editors>
> <item>
> <commercialName>Hachette</commercialName>
> </item>
> </editors>
> </root>
> {code}
> Note the use of authors/item/name on one side, and
> editors/item/commercialName on the other side.
> On the other hand, this gets correctly parsed:
> {code:java}
> <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
> <root>
> <authors>
> <item>
> <name>Neil Gaiman</name>
> </item>
> </authors>
> <editors>
> <item>
> <name>Hachette</name>
> </item>
> </editors>
> </root>
> {code}
> See the attached template for minimal reproducible example.
>
> My interpretation is that the failure in the first case is due to 2
> independent XML node types having the same name (<item> in this case) but
> having different types and occurring in different parents with different
> types. In the second case, both <item>'s actually have the same node type. I
> didn't use any Schema Inference Cache, so both item types should be inferred
> independently.
> Since the first document is legal XML (an XSD could be written for it) and
> can also be represented in Avro, its conversion shouldn't fail.
> I'll be happy to provide more details if needed.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)