Observer Namenode and Committer Algorithm V1

2021-08-17 Thread Adam Binford
Hi, We ran into an interesting issue that I wanted to share as well as get thoughts on if anything should be done about this. We run our own Hadoop cluster and recently deployed an Observer Namenode to take some burden off of our Active Namenode. We mostly use Delta Lake as our format, and everyth

Re: Observer Namenode and Committer Algorithm V1

2021-08-18 Thread Adam Binford
performant > fix that can work out-of-the-box. We can hide the calls behind reflection > to mitigate concerns around compatibility if needed. There is interest from > our side in pursuing this work, and certainly we would be happy to > collaborate if there is interest from you or others as

Re: Observer Namenode and Committer Algorithm V1

2021-08-20 Thread Adam Binford
hanks! On Wed, Aug 18, 2021 at 10:52 AM Adam Binford wrote: > Ahhh we don't do any RDD checkpointing but that makes sense too. Thanks > for the tip on setting that on the driver only, I didn't know that was > possible but it makes a lot of sense. > > I couldn't tell

Re: Observer Namenode and Committer Algorithm V1

2021-09-06 Thread Adam Binford
's trying to deal with cloud storage quirks like > nonatomic dir rename (GCS), slow list/file rename perf (everywhere), deep > directory delete timeouts, and other cloud storage specific issues. > > > Further reading on the commit problem in general > https://github.com/stevelou

Issue Upgrading to 3.2

2021-10-29 Thread Adam Binford
Test visualization partitioner val zoomLevel = 2 val newDf = VizPartitioner(spark.table("pixels"), zoomLevel, "pixel", new Envelope(0, 1000, 0, 1000)) So the main question is, is this a feature or a bug? -- Adam Binford

Re: Issue Upgrading to 3.2

2021-11-01 Thread Adam Binford
t 10:43 AM Wenchen Fan wrote: > Hi Adam, > > Thanks for reporting this issue! Do you have the full stacktrace or a code > snippet to reproduce the issue at Spark side? It looks like a bug, but it's > not obvious to me how this bug can happen. > > Thanks, > Wenchen >

Re: Issue Upgrading to 3.2

2021-11-01 Thread Adam Binford
Sorry yeah good question. It happens on the call to spark.table("pixels") On Mon, Nov 1, 2021 at 12:36 PM Wenchen Fan wrote: > To confirm: Does the error happen during view creation, or when we read > the view later? > > On Mon, Nov 1, 2021 at 11:28 PM Adam Binford wrot

Re: Apache Spark 3.3 Release

2022-03-16 Thread Adam Binford
> >> >>>>>>>>>> >> Hi Xiao, >>>>>>>>>> >> >>>>>>>>>> >> For the following list: >>>>>>>>>> >> >>>>>>>>>> >> #35789 [SPARK-32268][SQL] Row-level Runtime Filtering >>>>>>>>>> >> #34659 [SPARK-34863][SQL] Support complex types for Parquet >>>>>>>>>> vectorized reader >>>>>>>>>> >> #35848 [SPARK-38548][SQL] New SQL function: try_sum >>>>>>>>>> >> >>>>>>>>>> >> Do you mean we should include them, or exclude them from 3.3? >>>>>>>>>> >> >>>>>>>>>> >> Thanks, >>>>>>>>>> >> Chao >>>>>>>>>> >> >>>>>>>>>> >> On Tue, Mar 15, 2022 at 9:56 AM Dongjoon Hyun < >>>>>>>>>> dongjoon.h...@gmail.com> wrote: >>>>>>>>>> >> > >>>>>>>>>> >> > The following was tested and merged a few minutes ago. So, >>>>>>>>>> we can remove it from the list. >>>>>>>>>> >> > >>>>>>>>>> >> > #35819 [SPARK-38524][SPARK-38553][K8S] Bump Volcano to v1.5.1 >>>>>>>>>> >> > >>>>>>>>>> >> > Thanks, >>>>>>>>>> >> > Dongjoon. >>>>>>>>>> >> > >>>>>>>>>> >> > On Tue, Mar 15, 2022 at 9:48 AM Xiao Li < >>>>>>>>>> gatorsm...@gmail.com> wrote: >>>>>>>>>> >> >> >>>>>>>>>> >> >> Let me clarify my above suggestion. Maybe we can wait 3 >>>>>>>>>> more days to collect the list of actively developed PRs that we want >>>>>>>>>> to >>>>>>>>>> merge to 3.3 after the branch cut? >>>>>>>>>> >> >> >>>>>>>>>> >> >> Please do not rush to merge the PRs that are not fully >>>>>>>>>> reviewed. We can cut the branch this Friday and continue merging the >>>>>>>>>> PRs >>>>>>>>>> that have been discussed in this thread. Does that make sense? >>>>>>>>>> >> >> >>>>>>>>>> >> >> Xiao >>>>>>>>>> >> >> >>>>>>>>>> >> >> >>>>>>>>>> >> >> >>>>>>>>>> >> >> Holden Karau 于2022年3月15日周二 09:10写道: >>>>>>>>>> >> >>> >>>>>>>>>> >> >>> May I suggest we push out one week (22nd) just to give >>>>>>>>>> everyone a bit of breathing space? Rushed software development more >>>>>>>>>> often >>>>>>>>>> results in bugs. >>>>>>>>>> >> >>> >>>>>>>>>> >> >>> On Tue, Mar 15, 2022 at 6:23 AM Yikun Jiang < >>>>>>>>>> yikunk...@gmail.com> wrote: >>>>>>>>>> >> >>>> >>>>>>>>>> >> >>>> > To make our release time more predictable, let us >>>>>>>>>> collect the PRs and wait three more days before the branch cut? >>>>>>>>>> >> >>>> >>>>>>>>>> >> >>>> For SPIP: Support Customized Kubernetes Schedulers: >>>>>>>>>> >> >>>> #35819 [SPARK-38524][SPARK-38553][K8S] Bump Volcano to >>>>>>>>>> v1.5.1 >>>>>>>>>> >> >>>> >>>>>>>>>> >> >>>> Three more days are OK for this from my view. >>>>>>>>>> >> >>>> >>>>>>>>>> >> >>>> Regards, >>>>>>>>>> >> >>>> Yikun >>>>>>>>>> >> >>> >>>>>>>>>> >> >>> -- >>>>>>>>>> >> >>> Twitter: https://twitter.com/holdenkarau >>>>>>>>>> >> >>> Books (Learning Spark, High Performance Spark, etc.): >>>>>>>>>> https://amzn.to/2MaRAG9 >>>>>>>>>> >> >>> YouTube Live Streams: >>>>>>>>>> https://www.youtube.com/user/holdenkarau >>>>>>>>>> >>>>>>>>> -- Adam Binford

Re: [VOTE] Release Spark 3.3.0 (RC1)

2022-05-05 Thread Adam Binford
triage. Extremely important bug >> fixes, documentation, and API tweaks that impact compatibility should >> be worked on immediately. Everything else please retarget to an >> appropriate release. >> >> == >> But my bug isn't fixed? >> == >> In order to make timely releases, we will typically not hold the >> release unless the bug in question is a regression from the previous >> release. That being said, if there is something which is a regression >> that has not been correctly targeted please ping me or a committer to >> help target the issue. >> >> Maxim Gekk >> >> Software Engineer >> >> Databricks, Inc. >> > -- Adam Binford

Re: The draft of the Spark 3.3.0 release notes

2022-06-03 Thread Adam Binford
oogle.com/document/d/1gGySrLGvIK8bajKdGjTI_mDqk0-YPvHmPN64YjoWfOQ/edit?usp=sharing > > Please take a look and let me know if I missed any major changes or > something. > > Maxim Gekk > > Software Engineer > > Databricks, Inc. > -- Adam Binford

Re: The draft of the Spark 3.3.0 release notes

2022-06-03 Thread Adam Binford
browse/SPARK-37618> seems > like a bug fix, that's why I didn't put it in the doc. > > Maxim Gekk > > Software Engineer > > Databricks, Inc. > > > On Fri, Jun 3, 2022 at 2:20 PM Adam Binford wrote: > >> I don't think I see https://issues.

Re: [DISCUSS] Deprecate Trigger.Once and promote Trigger.AvailableNow

2022-07-08 Thread Adam Binford
y. Thanks to >> the behavior of Trigger.AvailableNow, it handles no-data batch as well >> before termination of the query. >> >> Please review and let us know if you have any feedback or concerns on the >> proposal. >> >> Thanks! >> Jungtaek Lim >> >> 1. https://issues.apache.org/jira/browse/SPARK-36533 >> > -- Adam Binford

Re: [DISCUSS] Deprecate Trigger.Once and promote Trigger.AvailableNow

2022-07-08 Thread Adam Binford
ri, Jul 8, 2022 at 9:16 AM Jungtaek Lim wrote: > Thanks for the input, Adam! Replying inline. > > On Fri, Jul 8, 2022 at 8:48 PM Adam Binford wrote: > >> We use Trigger.Once a lot, usually for backfilling data for new streams. >> I feel like I could see a continuing use case

Re: Re: [DISCUSS][SPIP] Subexpression elimination supporting more physical operators

2023-03-08 Thread Adam Binford
support subexpression elimination in ProjectExec and >> AggregateExec. We can support improve subexpression elimination framework >> to support more physical operators. >> >> >> SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42551 >> >> SPIP Doc: >> https://docs.google.com/document/d/165cv7hRvkFvuUHlnbapWvamxcxcn9FJj/edit?usp=sharing&ouid=107277827304520252190&rtpof=true&sd=true >> >> >> Thank you >> >> -- Adam Binford

Re: On adding applyInArrow to groupBy and cogroup

2023-10-28 Thread Adam Binford
ic support. > > We need your thoughts on whether PySpark should support Arrow on a par > with Pandas, or not: https://github.com/apache/spark/pull/38624 > Cheers, > Enrico > > -- Adam Binford

Re: [DISCUSS] Publish additional Spark distribution with Spark Connect enabled

2025-02-05 Thread Adam Binford
ctions), but if it's not the case, you can configure authenticating > proxies for the gRPC HTTP/2 interface used by Spark Connect. > > On Wed, Feb 5, 2025 at 8:14 PM Adam Binford wrote: > >> Long time Spark on YARN user with some maybe dumb questions but I'm >&

Re: [VOTE] Publish additional Spark distribution with Spark Connect enabled

2025-02-05 Thread Adam Binford
n9cfmjfb0vwx4xnrq>, >>>>> I'd like to start the vote for the proposal "Publish additional Spark >>>>> distribution with Spark Connect enabled". >>>>> >>>>> Please vote for the next 72 hours: >>>>> >>>>> [ ] +1: Accept the proposal >>>>> [ ] +0 >>>>> [ ]- 1: I don’t think this is a good idea because … >>>>> >>>>> Best, >>>>> Wenchen Fan >>>>> >>>> -- Adam Binford

Re: [DISCUSS] Publish additional Spark distribution with Spark Connect enabled

2025-02-05 Thread Adam Binford
t;> > >>>>>>> Best regards, > >>>>>>> Dongjoon > >>>>>>> > >>>>>>> On Mon, Feb 3, 2025 at 23:31 Wenchen Fan > wrote: > >>>>>>>> > >>>>>>>> Hi a

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

2025-03-10 Thread Adam Binford
>> > >> >> > On Tue, Mar 11, 2025 at 7:13 AM Andrew Melo >> wrote: >> >> >> >> >> >> Hello all >> >> >> >> >> >> As an outsider, I don't fully understand this discussion. This >> >> >> particular configuration option "leaked" into the open-source Spark >> >> >> distribution, and now there is a lot of discussion about how to >> >> >> mitigate existing workloads. But: presumably the people who are >> >> >> depending on this configuration flag are already using a downstream >> >> >> (vendor-specific) fork, and a future update will similarly be >> >> >> distributed by that downstream provider. >> >> >> >> >> >> Which people a) made a workflow using the vendor fork and b) want to >> >> >> resume it in the OSS version of spark? >> >> >> >> >> >> It seems like the people who are affected by this will already be >> >> >> using someone else's fork, and there's no need to carry this patch >> in >> >> >> the mainline Spark code. >> >> >> >> >> >> For that reason, I believe the code should be dropped by OSS Spark, >> >> >> and vendors who need to mitigate it can push the appropriate changes >> >> >> to their downstreams. >> >> >> >> >> >> Thanks >> >> >> Andrew >> > -- Adam Binford

Re: [DISCUSS] Handling spark.databricks.* config being exposed in 3.5.4 in Spark 4.0.0+

2025-03-10 Thread Adam Binford
3/07 09:15:39 Jungtaek Lim wrote: > > I'll need to start VOTE to move this forward. > > > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Adam Binford

Re: [VOTE] Retain migration logic of incorrect `spark.databricks.*` configuration in Spark 4.0.x

2025-03-11 Thread Adam Binford
gt; > >>> having to > > >>> > > > be upgraded with Spark 3.5.5+ in prior". If the vote passes, we > > >>> will help > > >>> > > > users to have a smooth upgrade from Spark 3.5.4 to Spark 4.0.x, > > >>> which would > > >>> > > > be almost 1 year. > > >>> > > > > > >>> > > > The (only) cons in this option is having to retain the > incorrect > > >>> > > > configuration name as "string" in the codebase a bit longer. > The > > >>> code > > >>> > > > complexity of migration logic is arguably trivial. (link > > >>> > > > < > > >>> > https://github.com/apache/spark/blob/4231d58245251a34ae80a38ea4bbf7d720caa439/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/OffsetSeq.scala#L174-L183 > > >>> > > > >>> > > > ) > > >>> > > > > > >>> > > > This VOTE is for Spark 4.0.x, but if someone supports including > > >>> migration > > >>> > > > logic to be longer than Spark 4.0.x, please cast +1 here and > leave > > >>> the > > >>> > > > desired last minor version of Spark to retain this migration > logic. > > >>> > > > > > >>> > > > The vote is open for the next 72 hours and passes if a > majority +1 > > >>> PMC > > >>> > > > votes are cast, with a minimum of 3 +1 votes. > > >>> > > > > > >>> > > > [ ] +1 Retain migration logic of incorrect `spark.databricks.*` > > >>> > > > configuration in Spark 4.0.x > > >>> > > > [ ] -1 Remove migration logic of incorrect `spark.databricks.*` > > >>> > > > configuration in Spark 4.0.0 because... > > >>> > > > > > >>> > > > Thanks! > > >>> > > > Jungtaek Lim (HeartSaVioR) > > >>> > > > > > >>> > > > > >>> > > > >>> > > - > > >>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > >>> > > > >>> > > > >>> > > >>> - > > >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > >>> > > >>> > > > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Adam Binford

Re: [VOTE] Release Spark 4.0.0 (RC1)

2025-02-21 Thread Adam Binford
nding to this release can be found at: >>>>> https://dist.apache.org/repos/dist/dev/spark/v4.0.0-rc1-docs/ >>>>> >>>>> The list of bug fixes going into 4.0.0 can be found at the following >>>>> URL: >>>>> https://issues.apache.org/jira/projects/SPARK/versions/12353359 >>>>> >>>>> This release is using the release script of the tag v4.0.0-rc1. >>>>> >>>>> FAQ >>>>> >>>>> = >>>>> How can I help test this release? >>>>> = >>>>> >>>>> If you are a Spark user, you can help us test this release by taking >>>>> an existing Spark workload and running on this release candidate, then >>>>> reporting any regressions. >>>>> >>>>> If you're working in PySpark you can set up a virtual env and install >>>>> the current RC and see if anything important breaks, in the Java/Scala >>>>> you can add the staging repository to your projects resolvers and test >>>>> with the RC (make sure to clean up the artifact cache before/after so >>>>> you don't end up building with a out of date RC going forward). >>>>> >>>> -- Adam Binford

Re: [VOTE] Release Spark 4.0.0 (RC6)

2025-05-16 Thread Adam Binford
>>>>> > >>>>>> > FAQ >>>>>> > >>>>>> > = >>>>>> > How can I help test this release? >>>>>> > = >>>>>> > >>>>>> > If you are a Spark user, you can help us test this release by taking >>>>>> > an existing Spark workload and running on this release candidate, >>>>>> then >>>>>> > reporting any regressions. >>>>>> > >>>>>> > If you're working in PySpark you can set up a virtual env and >>>>>> install >>>>>> > the current RC and see if anything important breaks, in the >>>>>> Java/Scala >>>>>> > you can add the staging repository to your projects resolvers and >>>>>> test >>>>>> > with the RC (make sure to clean up the artifact cache before/after >>>>>> so >>>>>> > you don't end up building with a out of date RC going forward). >>>>>> >>>>>> - >>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>> >>>>>> >>>>> >>>>> -- >>>>> Twitter: https://twitter.com/holdenkarau >>>>> Fight Health Insurance: https://www.fighthealthinsurance.com/ >>>>> <https://www.fighthealthinsurance.com/?q=hk_email> >>>>> Books (Learning Spark, High Performance Spark, etc.): >>>>> https://amzn.to/2MaRAG9 <https://amzn.to/2MaRAG9> >>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >>>>> Pronouns: she/her >>>>> >>>> -- Adam Binford

Re: [VOTE] Release Spark 4.0.0 (RC6)

2025-05-17 Thread Adam Binford
Yep looks good to me. Adam Binford On Sat, May 17, 2025, 2:59 AM Wenchen Fan wrote: > Thanks all! The two reported issues are both fixed, and I'll cut the next > RC on my Monday next week. > > On Sat, May 17, 2025 at 4:17 AM Jungtaek Lim > wrote: > >> UPDATE: The