That's a great idea, as long as we can keep additional dependencies under control.

Best regards,
Maciej Szymkiewicz

Web:https://zero323.net
PGP: A30CEF0C31A501EC

On 7/19/23 18:22, Franco Patano wrote:
+1

Many people have struggled with incorporating this separate library into their Spark pipelines.

On Wed, Jul 19, 2023 at 10:53 AM Burak Yavuz <brk...@gmail.com> wrote:

    +1 on adding to Spark. Community involvement will make the XML
    reader better.

    Best,
    Burak

    On Wed, Jul 19, 2023 at 3:25 AM Martin Andersson
    <martin.anders...@kambi.com> wrote:

        Alright, makes sense to add it then.
        ------------------------------------------------------------------------
        *From:* Hyukjin Kwon <gurwls...@apache.org>
        *Sent:* Wednesday, July 19, 2023 11:01
        *To:* Martin Andersson <martin.anders...@kambi.com>
        *Cc:* Sandip Agarwala <sandip.agarw...@databricks.com>;
        dev@spark.apache.org <dev@spark.apache.org>
        *Subject:* Re: [DISCUSS] SPIP: XML data source support
        EXTERNAL SENDER. Do not click links or open attachments unless
        you recognize the sender and know the content is safe. DO NOT
        provide your username or password.

        Here are the benefits of having it as a built-in source:

          * We can leverage the community to improve the Spark XML
            (not within Databricks repositories).
          * We can share the same core for XML expressions (e.g.,
            from_xml and to_xml like from_csv, from_json, etc.).
          * It is more to embrace the commonly used datasource, just
            like the existing builtin data sources we have.
         *

            Users wouldn't have to set the jars or maven coordinates,
            e.g., for now, if they have network problems, etc, it
            would be harder to use them by default.

        XML is arguably more used than CSV that is already our
        built-in source, see e.g.,
        https://insights.stackoverflow.com/trends?tags=xml%2Cjson%2Ccsv
        and
        
https://www.reddit.com/r/programming/comments/bak5qt/a_comparison_of_serialization_formats_csv_json/


        On Wed, 19 Jul 2023 at 17:51, Martin Andersson
        <martin.anders...@kambi.com> wrote:

            How much of an effort is it to use the spark-xml library
            today? What's the drawback to keeping this as an external
            library as-is?

            Best Regards, Martin
            
------------------------------------------------------------------------
            *From:* Hyukjin Kwon <gurwls...@apache.org>
            *Sent:* Wednesday, July 19, 2023 01:27
            *To:* Sandip Agarwala <sandip.agarw...@databricks.com>
            *Cc:* dev@spark.apache.org <dev@spark.apache.org>
            *Subject:* Re: [DISCUSS] SPIP: XML data source support
            EXTERNAL SENDER. Do not click links or open attachments
            unless you recognize the sender and know the content is
            safe. DO NOT provide your username or password.

            Yeah I support this. XML is pretty outdated format TBH but
            still used in many legacy systems. For example, Wikipedia
            dump is one case.

            Even when you take a look from stats CVS vs XML vs JSON,
            some show that XML is more used in CSV.

            On Wed, Jul 19, 2023 at 12:58 AM Sandip Agarwala
            <sandip.agarw...@databricks.com> wrote:

                Dear Spark community,

                I would like to start a discussion on "XML data source
                support".

                XML is a widely used data format. An external
                spark-xml package
                (https://github.com/databricks/spark-xml) is available
                to read and write XML data in spark. Making spark-xml
                built-in will provide a better user experience for
                Spark SQL and structured streaming. The proposal is to
                inline code from the spark-xml package.
                I am collaborating with Hyukjin Kwon, who is the
                original author of spark-xml, for this effort.

                SPIP link:
                
https://docs.google.com/document/d/1ZaOBT4-YFtN58UCx2cdFhlsKbie1ugAn-Fgz_Dddz-Q/edit?usp=sharing

                JIRA:
                https://issues.apache.org/jira/browse/SPARK-44265

                Looking forward to your feedback.
                Thanks, Sandip

Attachment: OpenPGP_signature
Description: OpenPGP digital signature

Reply via email to