Hi,
We ran into an interesting issue that I wanted to share as well as get
thoughts on if anything should be done about this. We run our own Hadoop
cluster and recently deployed an Observer Namenode to take some burden off
of our Active Namenode. We mostly use Delta Lake as our format, and
everyth
performant
> fix that can work out-of-the-box. We can hide the calls behind reflection
> to mitigate concerns around compatibility if needed. There is interest from
> our side in pursuing this work, and certainly we would be happy to
> collaborate if there is interest from you or others as
hanks!
On Wed, Aug 18, 2021 at 10:52 AM Adam Binford wrote:
> Ahhh we don't do any RDD checkpointing but that makes sense too. Thanks
> for the tip on setting that on the driver only, I didn't know that was
> possible but it makes a lot of sense.
>
> I couldn't tell
's trying to deal with cloud storage quirks like
> nonatomic dir rename (GCS), slow list/file rename perf (everywhere), deep
> directory delete timeouts, and other cloud storage specific issues.
>
>
> Further reading on the commit problem in general
> https://github.com/stevelou
Test visualization partitioner
val zoomLevel = 2
val newDf = VizPartitioner(spark.table("pixels"), zoomLevel,
"pixel", new Envelope(0, 1000, 0, 1000))
So the main question is, is this a feature or a bug?
--
Adam Binford
t 10:43 AM Wenchen Fan wrote:
> Hi Adam,
>
> Thanks for reporting this issue! Do you have the full stacktrace or a code
> snippet to reproduce the issue at Spark side? It looks like a bug, but it's
> not obvious to me how this bug can happen.
>
> Thanks,
> Wenchen
>
Sorry yeah good question. It happens on the call to spark.table("pixels")
On Mon, Nov 1, 2021 at 12:36 PM Wenchen Fan wrote:
> To confirm: Does the error happen during view creation, or when we read
> the view later?
>
> On Mon, Nov 1, 2021 at 11:28 PM Adam Binford wrot
> >>
>>>>>>>>>> >> Hi Xiao,
>>>>>>>>>> >>
>>>>>>>>>> >> For the following list:
>>>>>>>>>> >>
>>>>>>>>>> >> #35789 [SPARK-32268][SQL] Row-level Runtime Filtering
>>>>>>>>>> >> #34659 [SPARK-34863][SQL] Support complex types for Parquet
>>>>>>>>>> vectorized reader
>>>>>>>>>> >> #35848 [SPARK-38548][SQL] New SQL function: try_sum
>>>>>>>>>> >>
>>>>>>>>>> >> Do you mean we should include them, or exclude them from 3.3?
>>>>>>>>>> >>
>>>>>>>>>> >> Thanks,
>>>>>>>>>> >> Chao
>>>>>>>>>> >>
>>>>>>>>>> >> On Tue, Mar 15, 2022 at 9:56 AM Dongjoon Hyun <
>>>>>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>>>>>> >> >
>>>>>>>>>> >> > The following was tested and merged a few minutes ago. So,
>>>>>>>>>> we can remove it from the list.
>>>>>>>>>> >> >
>>>>>>>>>> >> > #35819 [SPARK-38524][SPARK-38553][K8S] Bump Volcano to v1.5.1
>>>>>>>>>> >> >
>>>>>>>>>> >> > Thanks,
>>>>>>>>>> >> > Dongjoon.
>>>>>>>>>> >> >
>>>>>>>>>> >> > On Tue, Mar 15, 2022 at 9:48 AM Xiao Li <
>>>>>>>>>> gatorsm...@gmail.com> wrote:
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> Let me clarify my above suggestion. Maybe we can wait 3
>>>>>>>>>> more days to collect the list of actively developed PRs that we want
>>>>>>>>>> to
>>>>>>>>>> merge to 3.3 after the branch cut?
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> Please do not rush to merge the PRs that are not fully
>>>>>>>>>> reviewed. We can cut the branch this Friday and continue merging the
>>>>>>>>>> PRs
>>>>>>>>>> that have been discussed in this thread. Does that make sense?
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> Xiao
>>>>>>>>>> >> >>
>>>>>>>>>> >> >>
>>>>>>>>>> >> >>
>>>>>>>>>> >> >> Holden Karau 于2022年3月15日周二 09:10写道:
>>>>>>>>>> >> >>>
>>>>>>>>>> >> >>> May I suggest we push out one week (22nd) just to give
>>>>>>>>>> everyone a bit of breathing space? Rushed software development more
>>>>>>>>>> often
>>>>>>>>>> results in bugs.
>>>>>>>>>> >> >>>
>>>>>>>>>> >> >>> On Tue, Mar 15, 2022 at 6:23 AM Yikun Jiang <
>>>>>>>>>> yikunk...@gmail.com> wrote:
>>>>>>>>>> >> >>>>
>>>>>>>>>> >> >>>> > To make our release time more predictable, let us
>>>>>>>>>> collect the PRs and wait three more days before the branch cut?
>>>>>>>>>> >> >>>>
>>>>>>>>>> >> >>>> For SPIP: Support Customized Kubernetes Schedulers:
>>>>>>>>>> >> >>>> #35819 [SPARK-38524][SPARK-38553][K8S] Bump Volcano to
>>>>>>>>>> v1.5.1
>>>>>>>>>> >> >>>>
>>>>>>>>>> >> >>>> Three more days are OK for this from my view.
>>>>>>>>>> >> >>>>
>>>>>>>>>> >> >>>> Regards,
>>>>>>>>>> >> >>>> Yikun
>>>>>>>>>> >> >>>
>>>>>>>>>> >> >>> --
>>>>>>>>>> >> >>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>> >> >>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>>>>> https://amzn.to/2MaRAG9
>>>>>>>>>> >> >>> YouTube Live Streams:
>>>>>>>>>> https://www.youtube.com/user/holdenkarau
>>>>>>>>>>
>>>>>>>>>
--
Adam Binford
triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>> Maxim Gekk
>>
>> Software Engineer
>>
>> Databricks, Inc.
>>
>
--
Adam Binford
oogle.com/document/d/1gGySrLGvIK8bajKdGjTI_mDqk0-YPvHmPN64YjoWfOQ/edit?usp=sharing
>
> Please take a look and let me know if I missed any major changes or
> something.
>
> Maxim Gekk
>
> Software Engineer
>
> Databricks, Inc.
>
--
Adam Binford
browse/SPARK-37618> seems
> like a bug fix, that's why I didn't put it in the doc.
>
> Maxim Gekk
>
> Software Engineer
>
> Databricks, Inc.
>
>
> On Fri, Jun 3, 2022 at 2:20 PM Adam Binford wrote:
>
>> I don't think I see https://issues.
y. Thanks to
>> the behavior of Trigger.AvailableNow, it handles no-data batch as well
>> before termination of the query.
>>
>> Please review and let us know if you have any feedback or concerns on the
>> proposal.
>>
>> Thanks!
>> Jungtaek Lim
>>
>> 1. https://issues.apache.org/jira/browse/SPARK-36533
>>
>
--
Adam Binford
ri, Jul 8, 2022 at 9:16 AM Jungtaek Lim
wrote:
> Thanks for the input, Adam! Replying inline.
>
> On Fri, Jul 8, 2022 at 8:48 PM Adam Binford wrote:
>
>> We use Trigger.Once a lot, usually for backfilling data for new streams.
>> I feel like I could see a continuing use case
support subexpression elimination in ProjectExec and
>> AggregateExec. We can support improve subexpression elimination framework
>> to support more physical operators.
>>
>>
>> SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42551
>>
>> SPIP Doc:
>> https://docs.google.com/document/d/165cv7hRvkFvuUHlnbapWvamxcxcn9FJj/edit?usp=sharing&ouid=107277827304520252190&rtpof=true&sd=true
>>
>>
>> Thank you
>>
>>
--
Adam Binford
ic support.
>
> We need your thoughts on whether PySpark should support Arrow on a par
> with Pandas, or not: https://github.com/apache/spark/pull/38624
> Cheers,
> Enrico
>
>
--
Adam Binford
ctions), but if it's not the case, you can configure authenticating
> proxies for the gRPC HTTP/2 interface used by Spark Connect.
>
> On Wed, Feb 5, 2025 at 8:14 PM Adam Binford wrote:
>
>> Long time Spark on YARN user with some maybe dumb questions but I'm
>&
n9cfmjfb0vwx4xnrq>,
>>>>> I'd like to start the vote for the proposal "Publish additional Spark
>>>>> distribution with Spark Connect enabled".
>>>>>
>>>>> Please vote for the next 72 hours:
>>>>>
>>>>> [ ] +1: Accept the proposal
>>>>> [ ] +0
>>>>> [ ]- 1: I don’t think this is a good idea because …
>>>>>
>>>>> Best,
>>>>> Wenchen Fan
>>>>>
>>>>
--
Adam Binford
t;>
> >>>>>>> Best regards,
> >>>>>>> Dongjoon
> >>>>>>>
> >>>>>>> On Mon, Feb 3, 2025 at 23:31 Wenchen Fan
> wrote:
> >>>>>>>>
> >>>>>>>> Hi a
>> >
>> >> > On Tue, Mar 11, 2025 at 7:13 AM Andrew Melo
>> wrote:
>> >> >>
>> >> >> Hello all
>> >> >>
>> >> >> As an outsider, I don't fully understand this discussion. This
>> >> >> particular configuration option "leaked" into the open-source Spark
>> >> >> distribution, and now there is a lot of discussion about how to
>> >> >> mitigate existing workloads. But: presumably the people who are
>> >> >> depending on this configuration flag are already using a downstream
>> >> >> (vendor-specific) fork, and a future update will similarly be
>> >> >> distributed by that downstream provider.
>> >> >>
>> >> >> Which people a) made a workflow using the vendor fork and b) want to
>> >> >> resume it in the OSS version of spark?
>> >> >>
>> >> >> It seems like the people who are affected by this will already be
>> >> >> using someone else's fork, and there's no need to carry this patch
>> in
>> >> >> the mainline Spark code.
>> >> >>
>> >> >> For that reason, I believe the code should be dropped by OSS Spark,
>> >> >> and vendors who need to mitigate it can push the appropriate changes
>> >> >> to their downstreams.
>> >> >>
>> >> >> Thanks
>> >> >> Andrew
>>
>
--
Adam Binford
3/07 09:15:39 Jungtaek Lim wrote:
> > I'll need to start VOTE to move this forward.
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Adam Binford
gt; > >>> having to
> > >>> > > > be upgraded with Spark 3.5.5+ in prior". If the vote passes, we
> > >>> will help
> > >>> > > > users to have a smooth upgrade from Spark 3.5.4 to Spark 4.0.x,
> > >>> which would
> > >>> > > > be almost 1 year.
> > >>> > > >
> > >>> > > > The (only) cons in this option is having to retain the
> incorrect
> > >>> > > > configuration name as "string" in the codebase a bit longer.
> The
> > >>> code
> > >>> > > > complexity of migration logic is arguably trivial. (link
> > >>> > > > <
> > >>>
> https://github.com/apache/spark/blob/4231d58245251a34ae80a38ea4bbf7d720caa439/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/OffsetSeq.scala#L174-L183
> > >>> >
> > >>> > > > )
> > >>> > > >
> > >>> > > > This VOTE is for Spark 4.0.x, but if someone supports including
> > >>> migration
> > >>> > > > logic to be longer than Spark 4.0.x, please cast +1 here and
> leave
> > >>> the
> > >>> > > > desired last minor version of Spark to retain this migration
> logic.
> > >>> > > >
> > >>> > > > The vote is open for the next 72 hours and passes if a
> majority +1
> > >>> PMC
> > >>> > > > votes are cast, with a minimum of 3 +1 votes.
> > >>> > > >
> > >>> > > > [ ] +1 Retain migration logic of incorrect `spark.databricks.*`
> > >>> > > > configuration in Spark 4.0.x
> > >>> > > > [ ] -1 Remove migration logic of incorrect `spark.databricks.*`
> > >>> > > > configuration in Spark 4.0.0 because...
> > >>> > > >
> > >>> > > > Thanks!
> > >>> > > > Jungtaek Lim (HeartSaVioR)
> > >>> > > >
> > >>> > >
> > >>> >
> > >>> >
> -
> > >>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >>> >
> > >>> >
> > >>>
> > >>> -
> > >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >>>
> > >>>
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
--
Adam Binford
nding to this release can be found at:
>>>>> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-rc1-docs/
>>>>>
>>>>> The list of bug fixes going into 4.0.0 can be found at the following
>>>>> URL:
>>>>> https://issues.apache.org/jira/projects/SPARK/versions/12353359
>>>>>
>>>>> This release is using the release script of the tag v4.0.0-rc1.
>>>>>
>>>>> FAQ
>>>>>
>>>>> =
>>>>> How can I help test this release?
>>>>> =
>>>>>
>>>>> If you are a Spark user, you can help us test this release by taking
>>>>> an existing Spark workload and running on this release candidate, then
>>>>> reporting any regressions.
>>>>>
>>>>> If you're working in PySpark you can set up a virtual env and install
>>>>> the current RC and see if anything important breaks, in the Java/Scala
>>>>> you can add the staging repository to your projects resolvers and test
>>>>> with the RC (make sure to clean up the artifact cache before/after so
>>>>> you don't end up building with a out of date RC going forward).
>>>>>
>>>>
--
Adam Binford
>>>>> >
>>>>>> > FAQ
>>>>>> >
>>>>>> > =
>>>>>> > How can I help test this release?
>>>>>> > =
>>>>>> >
>>>>>> > If you are a Spark user, you can help us test this release by taking
>>>>>> > an existing Spark workload and running on this release candidate,
>>>>>> then
>>>>>> > reporting any regressions.
>>>>>> >
>>>>>> > If you're working in PySpark you can set up a virtual env and
>>>>>> install
>>>>>> > the current RC and see if anything important breaks, in the
>>>>>> Java/Scala
>>>>>> > you can add the staging repository to your projects resolvers and
>>>>>> test
>>>>>> > with the RC (make sure to clean up the artifact cache before/after
>>>>>> so
>>>>>> > you don't end up building with a out of date RC going forward).
>>>>>>
>>>>>> -
>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/
>>>>> <https://www.fighthealthinsurance.com/?q=hk_email>
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>> Pronouns: she/her
>>>>>
>>>>
--
Adam Binford
Yep looks good to me.
Adam Binford
On Sat, May 17, 2025, 2:59 AM Wenchen Fan wrote:
> Thanks all! The two reported issues are both fixed, and I'll cut the next
> RC on my Monday next week.
>
> On Sat, May 17, 2025 at 4:17 AM Jungtaek Lim
> wrote:
>
>> UPDATE: The
24 matches
Mail list logo