Re: [DISCUSS] SPIP: XML data source support

Franco Patano Wed, 19 Jul 2023 09:22:31 -0700

+1

Many people have struggled with incorporating this separate library into
their Spark pipelines.


On Wed, Jul 19, 2023 at 10:53 AM Burak Yavuz <brk...@gmail.com> wrote:

> +1 on adding to Spark. Community involvement will make the XML reader
> better.
>
> Best,
> Burak
>
> On Wed, Jul 19, 2023 at 3:25 AM Martin Andersson <
> martin.anders...@kambi.com> wrote:
>
>> Alright, makes sense to add it then.
>> ------------------------------
>> *From:* Hyukjin Kwon <gurwls...@apache.org>
>> *Sent:* Wednesday, July 19, 2023 11:01
>> *To:* Martin Andersson <martin.anders...@kambi.com>
>> *Cc:* Sandip Agarwala <sandip.agarw...@databricks.com>;
>> dev@spark.apache.org <dev@spark.apache.org>
>> *Subject:* Re: [DISCUSS] SPIP: XML data source support
>>
>>
>> EXTERNAL SENDER. Do not click links or open attachments unless you
>> recognize the sender and know the content is safe. DO NOT provide your
>> username or password.
>>
>> Here are the benefits of having it as a built-in source:
>>
>>    - We can leverage the community to improve the Spark XML (not within
>>    Databricks repositories).
>>    - We can share the same core for XML expressions (e.g., from_xml and
>>    to_xml like from_csv, from_json, etc.).
>>    - It is more to embrace the commonly used datasource, just like the
>>    existing builtin data sources we have.
>>    -
>>
>>    Users wouldn't have to set the jars or maven coordinates, e.g., for
>>    now, if they have network problems, etc, it would be harder to use them by
>>    default.
>>
>> XML is arguably more used than CSV that is already our built-in source,
>> see e.g., https://insights.stackoverflow.com/trends?tags=xml%2Cjson%2Ccsv
>> and
>> https://www.reddit.com/r/programming/comments/bak5qt/a_comparison_of_serialization_formats_csv_json/
>>
>>
>> On Wed, 19 Jul 2023 at 17:51, Martin Andersson <
>> martin.anders...@kambi.com> wrote:
>>
>> How much of an effort is it to use the spark-xml library today? What's
>> the drawback to keeping this as an external library as-is?
>>
>> Best Regards, Martin
>> ------------------------------
>> *From:* Hyukjin Kwon <gurwls...@apache.org>
>> *Sent:* Wednesday, July 19, 2023 01:27
>> *To:* Sandip Agarwala <sandip.agarw...@databricks.com>
>> *Cc:* dev@spark.apache.org <dev@spark.apache.org>
>> *Subject:* Re: [DISCUSS] SPIP: XML data source support
>>
>>
>> EXTERNAL SENDER. Do not click links or open attachments unless you
>> recognize the sender and know the content is safe. DO NOT provide your
>> username or password.
>>
>> Yeah I support this. XML is pretty outdated format TBH but still used in
>> many legacy systems. For example, Wikipedia dump is one case.
>>
>> Even when you take a look from stats CVS vs XML vs JSON, some show that
>> XML is more used in CSV.
>>
>> On Wed, Jul 19, 2023 at 12:58 AM Sandip Agarwala <
>> sandip.agarw...@databricks.com> wrote:
>>
>> Dear Spark community,
>>
>> I would like to start a discussion on "XML data source support".
>>
>> XML is a widely used data format. An external spark-xml package (
>> https://github.com/databricks/spark-xml) is available to read and write
>> XML data in spark. Making spark-xml built-in will provide a better user
>> experience for Spark SQL and structured streaming. The proposal is to
>> inline code from the spark-xml package.
>> I am collaborating with Hyukjin Kwon, who is the original author of
>> spark-xml, for this effort.
>>
>> SPIP link:
>>
>> https://docs.google.com/document/d/1ZaOBT4-YFtN58UCx2cdFhlsKbie1ugAn-Fgz_Dddz-Q/edit?usp=sharing
>>
>> JIRA:
>> https://issues.apache.org/jira/browse/SPARK-44265
>>
>> Looking forward to your feedback.
>> Thanks, Sandip
>>
>>

Re: [DISCUSS] SPIP: XML data source support

Reply via email to