Probable Bug in Spark 3.3.0

2023-08-20 Thread Dipayan Dev
Hi Dev Team,

https://issues.apache.org/jira/browse/SPARK-44884


We have recently upgraded to Spark 3.3.0 in our Production Dataproc.
We have a lot of downstream application that relies on the SUCCESS file.

Please let me know if this is a bug or I need to any additional
configuration to fix this in Spark 3.3.0.

Happy to contribute if you suggest.

Thanks
Dipayan


-- 



With Best Regards,

Dipayan Dev
Author of *Deep Learning with Hadoop
<https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>*
M.Tech (AI), IISc, Bangalore


Re: [VOTE] Release Apache Spark 3.5.0 (RC3)

2023-08-29 Thread Dipayan Dev
Can we fix this bug in Spark 3.5.0?
https://issues.apache.org/jira/browse/SPARK-44884

On Wed, Aug 30, 2023 at 11:51 AM Sean Owen  wrote:

> It looks good except that I'm getting errors running the Spark Connect
> tests at the end (Java 17, Scala 2.13) It looks like I missed something
> necessary to build; is anyone getting this?
>
> [ERROR] [Error]
> /tmp/spark-3.5.0/connector/connect/server/target/generated-test-sources/protobuf/java/org/apache/spark/sql/protobuf/protos/TestProto.java:9:46:
>  error: package org.sparkproject.spark_protobuf.protobuf does not exist
>
> On Tue, Aug 29, 2023 at 11:25 AM Yuanjian Li 
> wrote:
>
>> Please vote on releasing the following candidate(RC3) as Apache Spark
>> version 3.5.0.
>>
>> The vote is open until 11:59pm Pacific time Aug 31st and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.5.0
>>
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v3.5.0-rc3 (commit
>> 9f137aa4dc43398aafa0c3e035ed3174182d7d6c):
>>
>> https://github.com/apache/spark/tree/v3.5.0-rc3
>>
>> The release files, including signatures, digests, etc. can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc3-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>>
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>>
>> https://repository.apache.org/content/repositories/orgapachespark-1447
>>
>> The documentation corresponding to this release can be found at:
>>
>> https://dist.apache.org/repos/dist/dev/spark/v3.5.0-rc3-docs/
>>
>> The list of bug fixes going into 3.5.0 can be found at the following URL:
>>
>> https://issues.apache.org/jira/projects/SPARK/versions/12352848
>>
>> This release is using the release script of the tag v3.5.0-rc3.
>>
>>
>> FAQ
>>
>> =
>>
>> How can I help test this release?
>>
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>>
>> an existing Spark workload and running on this release candidate, then
>>
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>>
>> the current RC and see if anything important breaks, in the Java/Scala
>>
>> you can add the staging repository to your projects resolvers and test
>>
>> with the RC (make sure to clean up the artifact cache before/after so
>>
>> you don't end up building with an out of date RC going forward).
>>
>> ===
>>
>> What should happen to JIRA tickets still targeting 3.5.0?
>>
>> ===
>>
>> The current list of open tickets targeted at 3.5.0 can be found at:
>>
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.5.0
>>
>> Committers should look at those and triage. Extremely important bug
>>
>> fixes, documentation, and API tweaks that impact compatibility should
>>
>> be worked on immediately. Everything else please retarget to an
>>
>> appropriate release.
>>
>> ==
>>
>> But my bug isn't fixed?
>>
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>>
>> release unless the bug in question is a regression from the previous
>>
>> release. That being said, if there is something which is a regression
>>
>> that has not been correctly targeted please ping me or a committer to
>>
>> help target the issue.
>>
>> Thanks,
>>
>> Yuanjian Li
>>
>


Feature to restart Spark job from previous failure point

2023-09-04 Thread Dipayan Dev
Hi Team,

One of the biggest pain points we're facing is when Spark reads upstream
partition data and during Action, the upstream also gets refreshed and the
application fails with 'File not exists' error. It could happen that the
job has already spent a reasonable amount of time, and re-running the
entire application is unwanted.

I know the general solution to this is to handle how the upstream is
managing the data, but is there a way to tackle this problem from the Spark
applicable side? One approach I was thinking of is to at least save some
state of operations done by Spark job till that point, and on a retry,
resume the operation from that point?



With Best Regards,

Dipayan Dev


Re: Elasticsearch support for Spark 3.x

2023-09-07 Thread Dipayan Dev
++ Dev

On Thu, 7 Sep 2023 at 10:22 PM, Dipayan Dev  wrote:

> Hi,
>
> Can you please elaborate your last response? I don’t have any external
> dependencies added, and just updated the Spark version as mentioned below.
>
> Can someone help me with this?
>
> On Fri, 1 Sep 2023 at 5:58 PM, Koert Kuipers  wrote:
>
>> could the provided scope be the issue?
>>
>> On Sun, Aug 27, 2023 at 2:58 PM Dipayan Dev 
>> wrote:
>>
>>> Using the following dependency for Spark 3 in POM file (My Scala version
>>> is 2.12.14)
>>>
>>>
>>>
>>>
>>>
>>>
>>> *org.elasticsearch
>>> elasticsearch-spark-30_2.12
>>> 7.12.0provided*
>>>
>>>
>>> The code throws error at this line :
>>> df.write.format("es").mode("overwrite").options(elasticOptions).save("index_name")
>>> The same code is working with Spark 2.4.0 and the following dependency
>>>
>>>
>>>
>>>
>>>
>>> *org.elasticsearch
>>> elasticsearch-spark-20_2.12
>>> 7.12.0*
>>>
>>>
>>> On Mon, 28 Aug 2023 at 12:17 AM, Holden Karau 
>>> wrote:
>>>
>>>> What’s the version of the ES connector you are using?
>>>>
>>>> On Sat, Aug 26, 2023 at 10:17 AM Dipayan Dev 
>>>> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> We're using Spark 2.4.x to write dataframe into the Elasticsearch
>>>>> index.
>>>>> As we're upgrading to Spark 3.3.0, it throwing out error
>>>>> Caused by: java.lang.ClassNotFoundException: es.DefaultSource
>>>>> at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
>>>>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
>>>>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
>>>>>
>>>>> Looking at a few responses from Stackoverflow
>>>>> <https://stackoverflow.com/a/66452149>. it seems this is not yet
>>>>> supported by Elasticsearch-hadoop.
>>>>>
>>>>> Does anyone have experience with this? Or faced/resolved this issue in
>>>>> Spark 3?
>>>>>
>>>>> Thanks in advance!
>>>>>
>>>>> Regards
>>>>> Dipayan
>>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>>
>> CONFIDENTIALITY NOTICE: This electronic communication and any files
>> transmitted with it are confidential, privileged and intended solely for
>> the use of the individual or entity to whom they are addressed. If you are
>> not the intended recipient, you are hereby notified that any disclosure,
>> copying, distribution (electronic or otherwise) or forwarding of, or the
>> taking of any action in reliance on the contents of this transmission is
>> strictly prohibited. Please notify the sender immediately by e-mail if you
>> have received this email by mistake and delete this email from your system.
>>
>> Is it necessary to print this email? If you care about the environment
>> like we do, please refrain from printing emails. It helps to keep the
>> environment forested and litter-free.
>
>


Re: Elasticsearch support for Spark 3.x

2023-09-08 Thread Dipayan Dev
Let me try that and get back. Just wondering, if there a change in  the way
we pass the format in connector from Spark 2 to 3?


On Fri, 8 Sep 2023 at 12:35 PM, Alfie Davidson 
wrote:

> I am pretty certain you need to change the write.format from “es” to
> “org.elasticsearch.spark.sql”
>
> Sent from my iPhone
>
> On 8 Sep 2023, at 03:10, Dipayan Dev  wrote:
>
> 
>
> ++ Dev
>
> On Thu, 7 Sep 2023 at 10:22 PM, Dipayan Dev 
> wrote:
>
>> Hi,
>>
>> Can you please elaborate your last response? I don’t have any external
>> dependencies added, and just updated the Spark version as mentioned below.
>>
>> Can someone help me with this?
>>
>> On Fri, 1 Sep 2023 at 5:58 PM, Koert Kuipers  wrote:
>>
>>> could the provided scope be the issue?
>>>
>>> On Sun, Aug 27, 2023 at 2:58 PM Dipayan Dev 
>>> wrote:
>>>
>>>> Using the following dependency for Spark 3 in POM file (My Scala
>>>> version is 2.12.14)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *org.elasticsearch
>>>> elasticsearch-spark-30_2.12
>>>> 7.12.0provided*
>>>>
>>>>
>>>> The code throws error at this line :
>>>> df.write.format("es").mode("overwrite").options(elasticOptions).save("index_name")
>>>> The same code is working with Spark 2.4.0 and the following dependency
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *org.elasticsearch
>>>> elasticsearch-spark-20_2.12
>>>> 7.12.0*
>>>>
>>>>
>>>> On Mon, 28 Aug 2023 at 12:17 AM, Holden Karau 
>>>> wrote:
>>>>
>>>>> What’s the version of the ES connector you are using?
>>>>>
>>>>> On Sat, Aug 26, 2023 at 10:17 AM Dipayan Dev 
>>>>> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> We're using Spark 2.4.x to write dataframe into the Elasticsearch
>>>>>> index.
>>>>>> As we're upgrading to Spark 3.3.0, it throwing out error
>>>>>> Caused by: java.lang.ClassNotFoundException: es.DefaultSource
>>>>>> at
>>>>>> java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
>>>>>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
>>>>>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
>>>>>>
>>>>>> Looking at a few responses from Stackoverflow
>>>>>> <https://stackoverflow.com/a/66452149>. it seems this is not yet
>>>>>> supported by Elasticsearch-hadoop.
>>>>>>
>>>>>> Does anyone have experience with this? Or faced/resolved this issue
>>>>>> in Spark 3?
>>>>>>
>>>>>> Thanks in advance!
>>>>>>
>>>>>> Regards
>>>>>> Dipayan
>>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>
>>>>
>>> CONFIDENTIALITY NOTICE: This electronic communication and any files
>>> transmitted with it are confidential, privileged and intended solely for
>>> the use of the individual or entity to whom they are addressed. If you are
>>> not the intended recipient, you are hereby notified that any disclosure,
>>> copying, distribution (electronic or otherwise) or forwarding of, or the
>>> taking of any action in reliance on the contents of this transmission is
>>> strictly prohibited. Please notify the sender immediately by e-mail if you
>>> have received this email by mistake and delete this email from your system.
>>>
>>> Is it necessary to print this email? If you care about the environment
>>> like we do, please refrain from printing emails. It helps to keep the
>>> environment forested and litter-free.
>>
>>


Re: Elasticsearch support for Spark 3.x

2023-09-08 Thread Dipayan Dev
@Alfie Davidson  : Awesome, it worked with
"“org.elasticsearch.spark.sql”"
But as soon as I switched to *elasticsearch-spark-20_2.12, *"es" also
worked.


On Fri, Sep 8, 2023 at 12:45 PM Dipayan Dev  wrote:

>
> Let me try that and get back. Just wondering, if there a change in  the
> way we pass the format in connector from Spark 2 to 3?
>
>
> On Fri, 8 Sep 2023 at 12:35 PM, Alfie Davidson 
> wrote:
>
>> I am pretty certain you need to change the write.format from “es” to
>> “org.elasticsearch.spark.sql”
>>
>> Sent from my iPhone
>>
>> On 8 Sep 2023, at 03:10, Dipayan Dev  wrote:
>>
>> 
>>
>> ++ Dev
>>
>> On Thu, 7 Sep 2023 at 10:22 PM, Dipayan Dev 
>> wrote:
>>
>>> Hi,
>>>
>>> Can you please elaborate your last response? I don’t have any external
>>> dependencies added, and just updated the Spark version as mentioned below.
>>>
>>> Can someone help me with this?
>>>
>>> On Fri, 1 Sep 2023 at 5:58 PM, Koert Kuipers  wrote:
>>>
>>>> could the provided scope be the issue?
>>>>
>>>> On Sun, Aug 27, 2023 at 2:58 PM Dipayan Dev 
>>>> wrote:
>>>>
>>>>> Using the following dependency for Spark 3 in POM file (My Scala
>>>>> version is 2.12.14)
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *org.elasticsearch
>>>>> elasticsearch-spark-30_2.12
>>>>> 7.12.0provided*
>>>>>
>>>>>
>>>>> The code throws error at this line :
>>>>> df.write.format("es").mode("overwrite").options(elasticOptions).save("index_name")
>>>>> The same code is working with Spark 2.4.0 and the following dependency
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *org.elasticsearch
>>>>> elasticsearch-spark-20_2.12
>>>>> 7.12.0*
>>>>>
>>>>>
>>>>> On Mon, 28 Aug 2023 at 12:17 AM, Holden Karau 
>>>>> wrote:
>>>>>
>>>>>> What’s the version of the ES connector you are using?
>>>>>>
>>>>>> On Sat, Aug 26, 2023 at 10:17 AM Dipayan Dev 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> We're using Spark 2.4.x to write dataframe into the Elasticsearch
>>>>>>> index.
>>>>>>> As we're upgrading to Spark 3.3.0, it throwing out error
>>>>>>> Caused by: java.lang.ClassNotFoundException: es.DefaultSource
>>>>>>> at
>>>>>>> java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
>>>>>>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
>>>>>>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
>>>>>>>
>>>>>>> Looking at a few responses from Stackoverflow
>>>>>>> <https://stackoverflow.com/a/66452149>. it seems this is not yet
>>>>>>> supported by Elasticsearch-hadoop.
>>>>>>>
>>>>>>> Does anyone have experience with this? Or faced/resolved this issue
>>>>>>> in Spark 3?
>>>>>>>
>>>>>>> Thanks in advance!
>>>>>>>
>>>>>>> Regards
>>>>>>> Dipayan
>>>>>>>
>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>
>>>>>
>>>> CONFIDENTIALITY NOTICE: This electronic communication and any files
>>>> transmitted with it are confidential, privileged and intended solely for
>>>> the use of the individual or entity to whom they are addressed. If you are
>>>> not the intended recipient, you are hereby notified that any disclosure,
>>>> copying, distribution (electronic or otherwise) or forwarding of, or the
>>>> taking of any action in reliance on the contents of this transmission is
>>>> strictly prohibited. Please notify the sender immediately by e-mail if you
>>>> have received this email by mistake and delete this email from your system.
>>>>
>>>> Is it necessary to print this email? If you care about the environment
>>>> like we do, please refrain from printing emails. It helps to keep the
>>>> environment forested and litter-free.
>>>
>>>


[no subject]

2025-01-15 Thread Dipayan Dev
Unsubscribe




With Best Regards,

Dipayan Dev
Author of *Deep Learning with Hadoop
<https://www.amazon.com/Deep-Learning-Hadoop-Dipayan-Dev/dp/1787124762>*
M.Tech (AI), IISc, Bangalore