+1 Many people have struggled with incorporating this separate library into their Spark pipelines.
On Wed, Jul 19, 2023 at 10:53 AM Burak Yavuz <brk...@gmail.com> wrote: > +1 on adding to Spark. Community involvement will make the XML reader > better. > > Best, > Burak > > On Wed, Jul 19, 2023 at 3:25 AM Martin Andersson < > martin.anders...@kambi.com> wrote: > >> Alright, makes sense to add it then. >> ------------------------------ >> *From:* Hyukjin Kwon <gurwls...@apache.org> >> *Sent:* Wednesday, July 19, 2023 11:01 >> *To:* Martin Andersson <martin.anders...@kambi.com> >> *Cc:* Sandip Agarwala <sandip.agarw...@databricks.com>; >> dev@spark.apache.org <dev@spark.apache.org> >> *Subject:* Re: [DISCUSS] SPIP: XML data source support >> >> >> EXTERNAL SENDER. Do not click links or open attachments unless you >> recognize the sender and know the content is safe. DO NOT provide your >> username or password. >> >> Here are the benefits of having it as a built-in source: >> >> - We can leverage the community to improve the Spark XML (not within >> Databricks repositories). >> - We can share the same core for XML expressions (e.g., from_xml and >> to_xml like from_csv, from_json, etc.). >> - It is more to embrace the commonly used datasource, just like the >> existing builtin data sources we have. >> - >> >> Users wouldn't have to set the jars or maven coordinates, e.g., for >> now, if they have network problems, etc, it would be harder to use them by >> default. >> >> XML is arguably more used than CSV that is already our built-in source, >> see e.g., https://insights.stackoverflow.com/trends?tags=xml%2Cjson%2Ccsv >> and >> https://www.reddit.com/r/programming/comments/bak5qt/a_comparison_of_serialization_formats_csv_json/ >> >> >> On Wed, 19 Jul 2023 at 17:51, Martin Andersson < >> martin.anders...@kambi.com> wrote: >> >> How much of an effort is it to use the spark-xml library today? What's >> the drawback to keeping this as an external library as-is? >> >> Best Regards, Martin >> ------------------------------ >> *From:* Hyukjin Kwon <gurwls...@apache.org> >> *Sent:* Wednesday, July 19, 2023 01:27 >> *To:* Sandip Agarwala <sandip.agarw...@databricks.com> >> *Cc:* dev@spark.apache.org <dev@spark.apache.org> >> *Subject:* Re: [DISCUSS] SPIP: XML data source support >> >> >> EXTERNAL SENDER. Do not click links or open attachments unless you >> recognize the sender and know the content is safe. DO NOT provide your >> username or password. >> >> Yeah I support this. XML is pretty outdated format TBH but still used in >> many legacy systems. For example, Wikipedia dump is one case. >> >> Even when you take a look from stats CVS vs XML vs JSON, some show that >> XML is more used in CSV. >> >> On Wed, Jul 19, 2023 at 12:58 AM Sandip Agarwala < >> sandip.agarw...@databricks.com> wrote: >> >> Dear Spark community, >> >> I would like to start a discussion on "XML data source support". >> >> XML is a widely used data format. An external spark-xml package ( >> https://github.com/databricks/spark-xml) is available to read and write >> XML data in spark. Making spark-xml built-in will provide a better user >> experience for Spark SQL and structured streaming. The proposal is to >> inline code from the spark-xml package. >> I am collaborating with Hyukjin Kwon, who is the original author of >> spark-xml, for this effort. >> >> SPIP link: >> >> https://docs.google.com/document/d/1ZaOBT4-YFtN58UCx2cdFhlsKbie1ugAn-Fgz_Dddz-Q/edit?usp=sharing >> >> JIRA: >> https://issues.apache.org/jira/browse/SPARK-44265 >> >> Looking forward to your feedback. >> Thanks, Sandip >> >>