Probable Bug in Spark 3.3.0
Hi Dev Team, https://issues.apache.org/jira/browse/SPARK-44884 We have recently upgraded to Spark 3.3.0 in our Production Dataproc. We have a lot of downstream application that relies on the SUCCESS file. Please let me know if this is a bug or I need to any additional configuration to fix this in Spark 3.3.0. Happy to contribute if you suggest. Thanks Dipayan -- With Best Regards, Dipayan Dev Author of *Deep Learning with Hadoop <https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>* M.Tech (AI), IISc, Bangalore
Re: [VOTE] Release Apache Spark 3.5.0 (RC3)
Can we fix this bug in Spark 3.5.0? https://issues.apache.org/jira/browse/SPARK-44884 On Wed, Aug 30, 2023 at 11:51 AM Sean Owen wrote: > It looks good except that I'm getting errors running the Spark Connect > tests at the end (Java 17, Scala 2.13) It looks like I missed something > necessary to build; is anyone getting this? > > [ERROR] [Error] > /tmp/spark-3.5.0/connector/connect/server/target/generated-test-sources/protobuf/java/org/apache/spark/sql/protobuf/protos/TestProto.java:9:46: > error: package org.sparkproject.spark_protobuf.protobuf does not exist > > On Tue, Aug 29, 2023 at 11:25 AM Yuanjian Li > wrote: > >> Please vote on releasing the following candidate(RC3) as Apache Spark >> version 3.5.0. >> >> The vote is open until 11:59pm Pacific time Aug 31st and passes if a >> majority +1 PMC votes are cast, with a minimum of 3 +1 votes. >> >> [ ] +1 Release this package as Apache Spark 3.5.0 >> >> [ ] -1 Do not release this package because ... >> >> To learn more about Apache Spark, please see http://spark.apache.org/ >> >> The tag to be voted on is v3.5.0-rc3 (commit >> 9f137aa4dc43398aafa0c3e035ed3174182d7d6c): >> >> https://github.com/apache/spark/tree/v3.5.0-rc3 >> >> The release files, including signatures, digests, etc. can be found at: >> >> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc3-bin/ >> >> Signatures used for Spark RCs can be found in this file: >> >> https://dist.apache.org/repos/dist/dev/spark/KEYS >> >> The staging repository for this release can be found at: >> >> https://repository.apache.org/content/repositories/orgapachespark-1447 >> >> The documentation corresponding to this release can be found at: >> >> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc3-docs/ >> >> The list of bug fixes going into 3.5.0 can be found at the following URL: >> >> https://issues.apache.org/jira/projects/SPARK/versions/12352848 >> >> This release is using the release script of the tag v3.5.0-rc3. >> >> >> FAQ >> >> = >> >> How can I help test this release? >> >> = >> >> If you are a Spark user, you can help us test this release by taking >> >> an existing Spark workload and running on this release candidate, then >> >> reporting any regressions. >> >> If you're working in PySpark you can set up a virtual env and install >> >> the current RC and see if anything important breaks, in the Java/Scala >> >> you can add the staging repository to your projects resolvers and test >> >> with the RC (make sure to clean up the artifact cache before/after so >> >> you don't end up building with an out of date RC going forward). >> >> === >> >> What should happen to JIRA tickets still targeting 3.5.0? >> >> === >> >> The current list of open tickets targeted at 3.5.0 can be found at: >> >> https://issues.apache.org/jira/projects/SPARK and search for "Target >> Version/s" = 3.5.0 >> >> Committers should look at those and triage. Extremely important bug >> >> fixes, documentation, and API tweaks that impact compatibility should >> >> be worked on immediately. Everything else please retarget to an >> >> appropriate release. >> >> == >> >> But my bug isn't fixed? >> >> == >> >> In order to make timely releases, we will typically not hold the >> >> release unless the bug in question is a regression from the previous >> >> release. That being said, if there is something which is a regression >> >> that has not been correctly targeted please ping me or a committer to >> >> help target the issue. >> >> Thanks, >> >> Yuanjian Li >> >
Feature to restart Spark job from previous failure point
Hi Team, One of the biggest pain points we're facing is when Spark reads upstream partition data and during Action, the upstream also gets refreshed and the application fails with 'File not exists' error. It could happen that the job has already spent a reasonable amount of time, and re-running the entire application is unwanted. I know the general solution to this is to handle how the upstream is managing the data, but is there a way to tackle this problem from the Spark applicable side? One approach I was thinking of is to at least save some state of operations done by Spark job till that point, and on a retry, resume the operation from that point? With Best Regards, Dipayan Dev
Re: Elasticsearch support for Spark 3.x
++ Dev On Thu, 7 Sep 2023 at 10:22 PM, Dipayan Dev wrote: > Hi, > > Can you please elaborate your last response? I don’t have any external > dependencies added, and just updated the Spark version as mentioned below. > > Can someone help me with this? > > On Fri, 1 Sep 2023 at 5:58 PM, Koert Kuipers wrote: > >> could the provided scope be the issue? >> >> On Sun, Aug 27, 2023 at 2:58 PM Dipayan Dev >> wrote: >> >>> Using the following dependency for Spark 3 in POM file (My Scala version >>> is 2.12.14) >>> >>> >>> >>> >>> >>> >>> *org.elasticsearch >>> elasticsearch-spark-30_2.12 >>> 7.12.0provided* >>> >>> >>> The code throws error at this line : >>> df.write.format("es").mode("overwrite").options(elasticOptions).save("index_name") >>> The same code is working with Spark 2.4.0 and the following dependency >>> >>> >>> >>> >>> >>> *org.elasticsearch >>> elasticsearch-spark-20_2.12 >>> 7.12.0* >>> >>> >>> On Mon, 28 Aug 2023 at 12:17 AM, Holden Karau >>> wrote: >>> >>>> What’s the version of the ES connector you are using? >>>> >>>> On Sat, Aug 26, 2023 at 10:17 AM Dipayan Dev >>>> wrote: >>>> >>>>> Hi All, >>>>> >>>>> We're using Spark 2.4.x to write dataframe into the Elasticsearch >>>>> index. >>>>> As we're upgrading to Spark 3.3.0, it throwing out error >>>>> Caused by: java.lang.ClassNotFoundException: es.DefaultSource >>>>> at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476) >>>>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589) >>>>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522) >>>>> >>>>> Looking at a few responses from Stackoverflow >>>>> <https://stackoverflow.com/a/66452149>. it seems this is not yet >>>>> supported by Elasticsearch-hadoop. >>>>> >>>>> Does anyone have experience with this? Or faced/resolved this issue in >>>>> Spark 3? >>>>> >>>>> Thanks in advance! >>>>> >>>>> Regards >>>>> Dipayan >>>>> >>>> -- >>>> Twitter: https://twitter.com/holdenkarau >>>> Books (Learning Spark, High Performance Spark, etc.): >>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>> >>> >> CONFIDENTIALITY NOTICE: This electronic communication and any files >> transmitted with it are confidential, privileged and intended solely for >> the use of the individual or entity to whom they are addressed. If you are >> not the intended recipient, you are hereby notified that any disclosure, >> copying, distribution (electronic or otherwise) or forwarding of, or the >> taking of any action in reliance on the contents of this transmission is >> strictly prohibited. Please notify the sender immediately by e-mail if you >> have received this email by mistake and delete this email from your system. >> >> Is it necessary to print this email? If you care about the environment >> like we do, please refrain from printing emails. It helps to keep the >> environment forested and litter-free. > >
Re: Elasticsearch support for Spark 3.x
Let me try that and get back. Just wondering, if there a change in the way we pass the format in connector from Spark 2 to 3? On Fri, 8 Sep 2023 at 12:35 PM, Alfie Davidson wrote: > I am pretty certain you need to change the write.format from “es” to > “org.elasticsearch.spark.sql” > > Sent from my iPhone > > On 8 Sep 2023, at 03:10, Dipayan Dev wrote: > > > > ++ Dev > > On Thu, 7 Sep 2023 at 10:22 PM, Dipayan Dev > wrote: > >> Hi, >> >> Can you please elaborate your last response? I don’t have any external >> dependencies added, and just updated the Spark version as mentioned below. >> >> Can someone help me with this? >> >> On Fri, 1 Sep 2023 at 5:58 PM, Koert Kuipers wrote: >> >>> could the provided scope be the issue? >>> >>> On Sun, Aug 27, 2023 at 2:58 PM Dipayan Dev >>> wrote: >>> >>>> Using the following dependency for Spark 3 in POM file (My Scala >>>> version is 2.12.14) >>>> >>>> >>>> >>>> >>>> >>>> >>>> *org.elasticsearch >>>> elasticsearch-spark-30_2.12 >>>> 7.12.0provided* >>>> >>>> >>>> The code throws error at this line : >>>> df.write.format("es").mode("overwrite").options(elasticOptions).save("index_name") >>>> The same code is working with Spark 2.4.0 and the following dependency >>>> >>>> >>>> >>>> >>>> >>>> *org.elasticsearch >>>> elasticsearch-spark-20_2.12 >>>> 7.12.0* >>>> >>>> >>>> On Mon, 28 Aug 2023 at 12:17 AM, Holden Karau >>>> wrote: >>>> >>>>> What’s the version of the ES connector you are using? >>>>> >>>>> On Sat, Aug 26, 2023 at 10:17 AM Dipayan Dev >>>>> wrote: >>>>> >>>>>> Hi All, >>>>>> >>>>>> We're using Spark 2.4.x to write dataframe into the Elasticsearch >>>>>> index. >>>>>> As we're upgrading to Spark 3.3.0, it throwing out error >>>>>> Caused by: java.lang.ClassNotFoundException: es.DefaultSource >>>>>> at >>>>>> java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476) >>>>>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589) >>>>>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522) >>>>>> >>>>>> Looking at a few responses from Stackoverflow >>>>>> <https://stackoverflow.com/a/66452149>. it seems this is not yet >>>>>> supported by Elasticsearch-hadoop. >>>>>> >>>>>> Does anyone have experience with this? Or faced/resolved this issue >>>>>> in Spark 3? >>>>>> >>>>>> Thanks in advance! >>>>>> >>>>>> Regards >>>>>> Dipayan >>>>>> >>>>> -- >>>>> Twitter: https://twitter.com/holdenkarau >>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>> >>>> >>> CONFIDENTIALITY NOTICE: This electronic communication and any files >>> transmitted with it are confidential, privileged and intended solely for >>> the use of the individual or entity to whom they are addressed. If you are >>> not the intended recipient, you are hereby notified that any disclosure, >>> copying, distribution (electronic or otherwise) or forwarding of, or the >>> taking of any action in reliance on the contents of this transmission is >>> strictly prohibited. Please notify the sender immediately by e-mail if you >>> have received this email by mistake and delete this email from your system. >>> >>> Is it necessary to print this email? If you care about the environment >>> like we do, please refrain from printing emails. It helps to keep the >>> environment forested and litter-free. >> >>
Re: Elasticsearch support for Spark 3.x
@Alfie Davidson : Awesome, it worked with "“org.elasticsearch.spark.sql”" But as soon as I switched to *elasticsearch-spark-20_2.12, *"es" also worked. On Fri, Sep 8, 2023 at 12:45 PM Dipayan Dev wrote: > > Let me try that and get back. Just wondering, if there a change in the > way we pass the format in connector from Spark 2 to 3? > > > On Fri, 8 Sep 2023 at 12:35 PM, Alfie Davidson > wrote: > >> I am pretty certain you need to change the write.format from “es” to >> “org.elasticsearch.spark.sql” >> >> Sent from my iPhone >> >> On 8 Sep 2023, at 03:10, Dipayan Dev wrote: >> >> >> >> ++ Dev >> >> On Thu, 7 Sep 2023 at 10:22 PM, Dipayan Dev >> wrote: >> >>> Hi, >>> >>> Can you please elaborate your last response? I don’t have any external >>> dependencies added, and just updated the Spark version as mentioned below. >>> >>> Can someone help me with this? >>> >>> On Fri, 1 Sep 2023 at 5:58 PM, Koert Kuipers wrote: >>> >>>> could the provided scope be the issue? >>>> >>>> On Sun, Aug 27, 2023 at 2:58 PM Dipayan Dev >>>> wrote: >>>> >>>>> Using the following dependency for Spark 3 in POM file (My Scala >>>>> version is 2.12.14) >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> *org.elasticsearch >>>>> elasticsearch-spark-30_2.12 >>>>> 7.12.0provided* >>>>> >>>>> >>>>> The code throws error at this line : >>>>> df.write.format("es").mode("overwrite").options(elasticOptions).save("index_name") >>>>> The same code is working with Spark 2.4.0 and the following dependency >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> *org.elasticsearch >>>>> elasticsearch-spark-20_2.12 >>>>> 7.12.0* >>>>> >>>>> >>>>> On Mon, 28 Aug 2023 at 12:17 AM, Holden Karau >>>>> wrote: >>>>> >>>>>> What’s the version of the ES connector you are using? >>>>>> >>>>>> On Sat, Aug 26, 2023 at 10:17 AM Dipayan Dev >>>>>> wrote: >>>>>> >>>>>>> Hi All, >>>>>>> >>>>>>> We're using Spark 2.4.x to write dataframe into the Elasticsearch >>>>>>> index. >>>>>>> As we're upgrading to Spark 3.3.0, it throwing out error >>>>>>> Caused by: java.lang.ClassNotFoundException: es.DefaultSource >>>>>>> at >>>>>>> java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476) >>>>>>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589) >>>>>>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522) >>>>>>> >>>>>>> Looking at a few responses from Stackoverflow >>>>>>> <https://stackoverflow.com/a/66452149>. it seems this is not yet >>>>>>> supported by Elasticsearch-hadoop. >>>>>>> >>>>>>> Does anyone have experience with this? Or faced/resolved this issue >>>>>>> in Spark 3? >>>>>>> >>>>>>> Thanks in advance! >>>>>>> >>>>>>> Regards >>>>>>> Dipayan >>>>>>> >>>>>> -- >>>>>> Twitter: https://twitter.com/holdenkarau >>>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>>> >>>>> >>>> CONFIDENTIALITY NOTICE: This electronic communication and any files >>>> transmitted with it are confidential, privileged and intended solely for >>>> the use of the individual or entity to whom they are addressed. If you are >>>> not the intended recipient, you are hereby notified that any disclosure, >>>> copying, distribution (electronic or otherwise) or forwarding of, or the >>>> taking of any action in reliance on the contents of this transmission is >>>> strictly prohibited. Please notify the sender immediately by e-mail if you >>>> have received this email by mistake and delete this email from your system. >>>> >>>> Is it necessary to print this email? If you care about the environment >>>> like we do, please refrain from printing emails. It helps to keep the >>>> environment forested and litter-free. >>> >>>
[no subject]
Unsubscribe With Best Regards, Dipayan Dev Author of *Deep Learning with Hadoop <https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>* M.Tech (AI), IISc, Bangalore