Re: [DISCUSS] SPIP: XML data source support

Maciej Wed, 19 Jul 2023 10:29:59 -0700

That's a great idea, as long as we can keep additional dependencies under control.


Best regards,
Maciej Szymkiewicz


Web:https://zero323.net
PGP: A30CEF0C31A501EC

On 7/19/23 18:22, Franco Patano wrote:

+1

Many people have struggled with incorporating this separate library into their Spark pipelines.

On Wed, Jul 19, 2023 at 10:53 AM Burak Yavuz <brk...@gmail.com> wrote:

+1 on adding to Spark. Community involvement will make the XML
reader better.

Best,
Burak

On Wed, Jul 19, 2023 at 3:25 AM Martin Andersson
<martin.anders...@kambi.com> wrote:

Alright, makes sense to add it then.
------------------------------------------------------------------------
*From:* Hyukjin Kwon <gurwls...@apache.org>
*Sent:* Wednesday, July 19, 2023 11:01
*To:* Martin Andersson <martin.anders...@kambi.com>
*Cc:* Sandip Agarwala <sandip.agarw...@databricks.com>;
dev@spark.apache.org <dev@spark.apache.org>
*Subject:* Re: [DISCUSS] SPIP: XML data source support
EXTERNAL SENDER. Do not click links or open attachments unless
you recognize the sender and know the content is safe. DO NOT
provide your username or password.

Here are the benefits of having it as a built-in source:

* We can leverage the community to improve the Spark XML
(not within Databricks repositories).
* We can share the same core for XML expressions (e.g.,
from_xml and to_xml like from_csv, from_json, etc.).
* It is more to embrace the commonly used datasource, just
like the existing builtin data sources we have.
*

Users wouldn't have to set the jars or maven coordinates,
e.g., for now, if they have network problems, etc, it
would be harder to use them by default.

XML is arguably more used than CSV that is already our
built-in source, see e.g.,
https://insights.stackoverflow.com/trends?tags=xml%2Cjson%2Ccsv
and

https://www.reddit.com/r/programming/comments/bak5qt/a_comparison_of_serialization_formats_csv_json/

On Wed, 19 Jul 2023 at 17:51, Martin Andersson
<martin.anders...@kambi.com> wrote:

How much of an effort is it to use the spark-xml library
today? What's the drawback to keeping this as an external
library as-is?

Best Regards, Martin

------------------------------------------------------------------------
*From:* Hyukjin Kwon <gurwls...@apache.org>
*Sent:* Wednesday, July 19, 2023 01:27
*To:* Sandip Agarwala <sandip.agarw...@databricks.com>
*Cc:* dev@spark.apache.org <dev@spark.apache.org>
*Subject:* Re: [DISCUSS] SPIP: XML data source support
EXTERNAL SENDER. Do not click links or open attachments
unless you recognize the sender and know the content is
safe. DO NOT provide your username or password.

Yeah I support this. XML is pretty outdated format TBH but
still used in many legacy systems. For example, Wikipedia
dump is one case.

Even when you take a look from stats CVS vs XML vs JSON,
some show that XML is more used in CSV.

On Wed, Jul 19, 2023 at 12:58 AM Sandip Agarwala
<sandip.agarw...@databricks.com> wrote:

Dear Spark community,

I would like to start a discussion on "XML data source
support".

XML is a widely used data format. An external
spark-xml package
(https://github.com/databricks/spark-xml) is available
to read and write XML data in spark. Making spark-xml
built-in will provide a better user experience for
Spark SQL and structured streaming. The proposal is to
inline code from the spark-xml package.
I am collaborating with Hyukjin Kwon, who is the
original author of spark-xml, for this effort.

SPIP link:

https://docs.google.com/document/d/1ZaOBT4-YFtN58UCx2cdFhlsKbie1ugAn-Fgz_Dddz-Q/edit?usp=sharing

JIRA:
https://issues.apache.org/jira/browse/SPARK-44265

Looking forward to your feedback.
Thanks, Sandip

OpenPGP_signature
Description: OpenPGP digital signature

Re: [DISCUSS] SPIP: XML data source support

Reply via email to