Auto-closing PRs or How to get reviewers' attention

2021-02-18 Thread Enrico Minack
Hi Spark Developers, I have a fundamental question on the process of contributing to Apache Spark from outside the circle of committers. I have gone through a number of pull requests and I always found it hard to get feedback, especially from committers. I understand there is a very high com

Re: Auto-closing PRs or How to get reviewers' attention

2021-02-23 Thread Enrico Minack
Am 18.02.21 um 16:34 schrieb Sean Owen: One other aspect is that a committer is taking some degree of responsibility for merging a change, so the ask is more than just a few minutes of eyeballing. If it breaks something the merger pretty much owns resolving it, and, the whole project owns any c

Observable Metrics on Spark Datasets

2021-03-15 Thread Enrico Minack
Hi Spark-Devs, the observable metrics that have been added to the Dataset API in 3.0.0 are a great improvement over the Accumulator APIs that seem to have much weaker guarantees. I have two questions regarding follow-up contributions: *1. Add observe to Python ***DataFrame** As I can see fro

Re: Observable Metrics on Spark Datasets

2021-03-16 Thread Enrico Minack
JVM and you can leverage these values from PySpark? (I see there's support for listeners with DStream in PySpark, so there might be reasons not to add the same for SQL/SS. Probably a lesson learned?) On Mon, Mar 15, 2021 at 6:59 PM Enrico Minack <mailto:m...@enrico.minack.dev>> wrote

Re: Observable Metrics on Spark Datasets

2021-03-19 Thread Enrico Minack
f we have a consensus on the usefulness of observable metrics on batch query. On Tue, Mar 16, 2021 at 4:17 PM Enrico Minack mailto:m...@enrico.minack.dev>> wrote: I am focusing on batch mode, not streaming mode. I would argue that Dataset.observe() is equally u

Re: [SPARK-34806] Observable Metrics on Spark Datasets

2021-03-20 Thread Enrico Minack
The PR can be found here: https://github.com/apache/spark/pull/31905 Am 19.03.21 um 10:55 schrieb Enrico Minack: I'll sketch out a PR so we can talk code and move the discussion there. Am 18.03.21 um 14:55 schrieb Wenchen Fan: I think a listener-based API makes sense for streaming (

Re: [Spark Core]: Support for un-pivoting data ('melt')

2022-04-11 Thread Enrico Minack
The melt function has recently been implemented in the PySpark Pandas API (because melt is part of the Pandas API). I think, Scala/Java Dataset and Python DataFrame APIs deserve this method equally well, ideally all based on one implementation. I'd like to fuel the conversation with some code:

Re: [Spark] [SQL] Updating Spark from version 3.0.1 to 3.2.1 reduced functionality for working with parquet files

2022-06-05 Thread Enrico Minack
Hi, looks like the error comes from the Parquet library, has the library version changed moving to 3.2.1? What are the parquet versions used in 3.0.1 and 3.2.1? Can you read that parquet file with the newer parquet library version natively (without Spark)? Then this might be a Parquet issue,

Cannot resolve graphx 3.4.0-SNAPSHOT

2022-06-19 Thread Enrico Minack
Hi devs, moving to 3.4.0 snapshots, Spark modules resolve perfectly fine for 3.4.0-SNAPSHOT, except for graphx: org.apache.spark spark-graphx_2.12 3.4.0-SNAPSHOT provided ... Downloading from apache.snapshots: https://repository.apache.org/snapshots/org/apache/spark/spark-catalyst_2.12/3.

Re: Cannot resolve graphx 3.4.0-SNAPSHOT

2022-06-19 Thread Enrico Minack
Issue solved by explicitly adding the https://repository.apache.org/snapshots repository to my POM. Mvn resolved other packages from that repo, and this has worked for snapshots before. Thanks anyway, Enrico Am 19.06.22 um 22:30 schrieb Enrico Minack: Hi devs, moving to 3.4.0 snapshots

Support for spark-packages.org

2022-09-13 Thread Enrico Minack
Hi devs, I understand that spark-packages.org is not associated with Apache and Apache Spark, but hosted by Databricks. Does anyone have any pointers on how to get support? The e-mail address feedb...@spark-packages.org does not respond. I found a few "missing features" that block me from re

Does partitioned write preserve in-partition order?

2022-10-11 Thread Enrico Minack
Hi Devs, this has been raised by Swetha on the user mailing list, which also hit us recently. Here is the question again: *Is it guaranteed that written files are sorted as stated in **sortWithinPartitions**?* ds.repartition($"day")   .sortWithinPartitions($"day", $"id")   .write   .partit

Re: Does partitioned write preserve in-partition order?

2022-10-11 Thread Enrico Minack
spark.sql.adaptive.coalescePartitions.enabled set false, it still fails for all versions before 3.4.0. Enrico Am 11.10.22 um 12:15 schrieb Enrico Minack: Hi Devs, this has been raised by Swetha on the user mailing list, which also hit us recently. Here is the question again: *Is it guaranteed that written files are

Re: Time for Spark 3.4.0 release?

2023-01-04 Thread Enrico Minack
Hi All, can we get these correctness issues fixed with the 3.4 release, please? SPARK-41162 incorrect query plan for anti-join and semi-join of self-joined aggregations (since 3.1), fix in https://github.com/apache/spark/pull/39131 SPARK-40885 loosing in-partition order for string type partiti

Re: Time for Spark 3.4.0 release?

2023-01-17 Thread Enrico Minack
Hi Xinrong, what about regression issue https://issues.apache.org/jira/browse/SPARK-40819 and correctness issue https://issues.apache.org/jira/browse/SPARK-40885? The latter gets fixed by either https://issues.apache.org/jira/browse/SPARK-41959 or https://issues.apache.org/jira/browse/SPARK-

Re: Time for Spark 3.4.0 release?

2023-01-17 Thread Enrico Minack
You are saying the RCs are cut from that branch at a later point? What is the estimate deadline for that? Enrico Am 18.01.23 um 07:59 schrieb Hyukjin Kwon: These look like we can fix it after the branch-cut so should be fine. On Wed, 18 Jan 2023 at 15:57, Enrico Minack wrote: Hi

Re: [VOTE] Release Spark 3.3.2 (RC1)

2023-02-12 Thread Enrico Minack
RC builds and all our downstream tests are green, thanks for the release! Am 11.02.23 um 06:00 schrieb L. C. Hsieh: Please vote on releasing the following candidate as Apache Spark version 3.3.2. The vote is open until Feb 15th 9AM (PST) and passes if a majority +1 PMC votes are cast, with a m

Re: Spark Union performance issue

2023-02-22 Thread Enrico Minack
Plus number of unioned tables would be helpful, as well as which downstream operations are performed on the unioned tables. And what "performance issues" do you exactly measure? Enrico Am 22.02.23 um 16:50 schrieb Mich Talebzadeh: Hi, Few details will help 1. Spark version 2. Spark SQL,

Re: Spark Union performance issue

2023-02-22 Thread Enrico Minack
t 11:07 AM Enrico Minack wrote: Plus number of unioned tables would be helpful, as well as which downstream operations are performed on the unioned tables. And what "performance issues" do you exactly measure? Enrico Am 22.02.23 um 16:50 schrieb Mich Taleb

Spark 3.2.4 pom NOT FOUND on maven

2023-04-17 Thread Enrico Minack
Hi, thanks for the Spark 3.2.4 release. I have found that Maven does not serve the spark-parent_2.13 pom file. It is listed in the directory: https://repo1.maven.org/maven2/org/apache/spark/spark-parent_2.13/3.2.4/ But cannot be downloaded: https://repo1.maven.org/maven2/org/apache/spark/spar

Re: Spark 3.2.4 pom NOT FOUND on maven

2023-04-17 Thread Enrico Minack
Any suggestions on how to fix or use the Spark 3.2.4 (Scala 2.13) release? Cheers, Enrico Am 17.04.23 um 08:19 schrieb Enrico Minack: Hi, thanks for the Spark 3.2.4 release. I have found that Maven does not serve the spark-parent_2.13 pom file. It is listed in the directory: https://repo1

Spark 3.4.0 with Hadoop2.7 cannot be downloaded

2023-04-19 Thread Enrico Minack
Hi, selecting Spark 3.4.0 with Hadoop2.7 at https://spark.apache.org/downloads.html leads to https://www.apache.org/dyn/closer.lua/spark/spark-3.4.0/spark-3.4.0-bin-hadoop2.tgz saying: The requested file or directory is *not* on the mirrors. The object is in not in our archive https://archi

Re: Spark 3.4.0 with Hadoop2.7 cannot be downloaded

2023-04-20 Thread Enrico Minack
recommended for all Hadoop clusters. Please see SPARK-40651 <https://issues.apache.org/jira/browse/SPARK-40651> . The option to download Spark 3.4.0 with Hadoop2.7 has been removed from the Downloads page to avoid confusion. Thanks, Xinrong Meng On Wed, Apr 19, 2023 at 11:24 PM Enrico

Re: Spark 3.2.4 pom NOT FOUND on maven

2023-04-21 Thread Enrico Minack
/spark/spark-parent_2.13/3.2.4/spark-parent_2.13-3.2.4.pom You may want to use (1) and (2) repositories temporarily while waiting for `repo1.maven.org`'s recovery. Dongjoon. On 2023/04/18 05:38:59 Enrico Minack wrote: Any suggestions on how to fix or use the Spark 3.2.4 (Scala 2.13) re

Re: [DISCUSS] Add SQL functions into Scala, Python and R API

2023-05-24 Thread Enrico Minack
+1 Functions available in SQL (more general in one API) should be available in all APIs. I am very much in favor of this. Enrico Am 24.05.23 um 09:41 schrieb Hyukjin Kwon: Hi all, I would like to discuss adding all SQL functions into Scala, Python and R API. We have SQL functions that d

Re: [Reminder] Spark 3.5 Branch Cut

2023-07-15 Thread Enrico Minack
Speaking of JdbcDialect, is there any interest in getting upserts for JDBC into 3.5.0? [SPARK-19335][SPARK-38200][SQL] Add upserts for writing to JDBC: https://github.com/apache/spark/pull/41518 [SPARK-19335][SPARK-38200][SQL] Add upserts for writing to JDBC using MERGE INTO with temp table: h

On adding applyInArrow to groupBy and cogroup

2023-10-26 Thread Enrico Minack
Hi devs, PySpark allows to transform a |DataFrame| via Pandas *and* Arrow API: df.mapInArrow(map_arrow, schema="...") df.mapInPandas(map_pandas, schema="...") For |df.groupBy(...)| and |df.groupBy(...).cogroup(...)|, there is *only* a Pandas interface, no Arrow interface: df.groupBy("id").ap

10x to 100x faster df.groupby().applyInPandas()

2023-12-01 Thread Enrico Minack
Hi devs, I am looking for some PySpark dev that is interested in some 10x to 100x speed up of df.groupby().applyInPandas() for small groups. A PoC and benchmark can be found at https://github.com/apache/spark/pull/37360#issuecomment-1228293766. I suppose, the same approach could be taken to

ShuffleManager and Speculative Execution

2023-12-21 Thread Enrico Minack
Hi Spark devs, I have a question around ShuffleManager: With speculative execution, one map output file is being created multiple times (by multiple task attempts). If both attempts succeed, which is to be read by the reduce task in the next stage? Is any map output as good as any other? Tha

[SPARK-29176][DISCUSS] Optimization should change join type to CROSS

2019-11-06 Thread Enrico Minack
Hi, I would like to discuss issue SPARK-29176 to see if this is considered a bug and if so, to sketch out a fix. In short, the issue is that a valid inner join with condition gets optimized so that no condition is left, but the type is still INNER. Then CheckCartesianProducts throws an excep

Re: [SPARK-29176][DISCUSS] Optimization should change join type to CROSS

2019-11-06 Thread Enrico Minack
, hence the error you can disable. The query is not invalid in any case. It's just stopping you from doing something you may not meant to, and which may be expensive. However I think we've already changed the default to enable it in Spark 3 anyway. On Wed, Nov 6, 2019 at 8:50 AM Enrico Min

[DISCUSS] Expensive deterministic UDFs

2019-11-07 Thread Enrico Minack
Hi all, Running expensive deterministic UDFs that return complex types, followed by multiple references to those results cause Spark to evaluate the UDF multiple times per row. This has been reported and discussed before: SPARK-18748 SPARK-17728     val f: Int => Array[Int]     val udfF = ud

Re: [DISCUSS] Expensive deterministic UDFs

2019-11-08 Thread Enrico Minack
. > > At first look, no I don't think this Spark-side workaround for naming > for your use case is worthwhile. There are existing better solutions. > > On Thu, Nov 7, 2019 at 2:45 AM Enrico Minack mailto:m...@enrico.minack.dev>> wrote: >

[SPARK-30296][SQL] Add Dataset diffing feature

2020-01-07 Thread Enrico Minack
Hi Devs, I'd like to get your thoughts on this Dataset feature proposal. Comparing datasets is a central operation when regression testing your code changes. It would be super useful if Spark's Datasets provide this transformation natively. https://github.com/apache/spark/pull/26936 Regar

[SPARK-30319][SQL] Add a stricter version of as[T]

2020-01-07 Thread Enrico Minack
Hi Devs, I'd like to propose a stricter version of as[T]. Given the interface def as[T](): Dataset[T], it is counter-intuitive that the schema of the returned Dataset[T] is not agnostic to the schema of the originating Dataset. The schema should always be derived only from T. I am proposing

Re: [SPARK-30319][SQL] Add a stricter version of as[T]

2020-01-08 Thread Enrico Minack
ap(identity)`. On Tue, Jan 7, 2020 at 4:42 PM Enrico Minack <mailto:m...@enrico.minack.dev>> wrote: Hi Devs, I'd like to propose a stricter version of as[T]. Given the interface def as[T](): Dataset[T], it is counter-intuitive that the schema of the returned Dataset[

Fwd: dataframe null safe joins given a list of columns

2020-02-08 Thread Enrico Minack
Hi Devs, I am forwarding this from the user mailing list. I agree that the <=> version of join(Dataset[_], Seq[String]) would be useful. Does any PMC consider this useful enough to be added to the Dataset API? I'd be happy to create a PR in that case. Enrico Weitergeleitete Nach

comparable and orderable CalendarInterval

2020-02-11 Thread Enrico Minack
Hi Devs, I would like to know what is the current roadmap of making CalendarInterval comparable and orderable again (SPARK-29679, SPARK-29385, #26337). With #27262, this got reverted but SPARK-30551 does not mention how to go forward in this matter. I have found SPARK-28494, but this seems t

Re: comparable and orderable CalendarInterval

2020-02-11 Thread Enrico Minack
), so comparing it with the "right kinds of intervals" should always be correct. Enrico Am 11.02.20 um 17:06 schrieb Wenchen Fan: What's your use case to compare intervals? It's tricky in Spark as there is only one interval type and you can't really compare one month with

[SPARK-30957][SQL] Null-safe variant of Dataset.join(Dataset[_], Seq[String])

2020-02-26 Thread Enrico Minack
I have created a jira to track this request: https://issues.apache.org/jira/browse/SPARK-30957 Enrico Am 08.02.20 um 16:56 schrieb Enrico Minack: Hi Devs, I am forwarding this from the user mailing list. I agree that the <=> version of join(Dataset[_], Seq[String]) would be useful.

Re: comparable and orderable CalendarInterval

2020-03-05 Thread Enrico Minack
-+--+--+ | The length of an interval can be measured by dividing it with the length of your measuring unit, e.g. "1 hour": ||$"interval" / lit("1 hour").cast(CalendarIntervalType)| | Which brings us to CalendarInterval division: https://gith

Re: Need to order iterator values in spark dataframe

2020-03-26 Thread Enrico Minack
Abhinav, you can repartition by your key, then sortWithinPartition, and the groupByKey. Since data are already hash-partitioned by key, Spark should not shuffle the data hence change the sort wihtin each partition: ds.repartition($"key").sortWithinPartitions($"code").groupBy($"key") Enrico

Re: is there any tool to visualize the spark physical plan or spark plan

2020-05-02 Thread Enrico Minack
Kelly Zhang, You can add a SparkListenerto your spark context: sparkContext.addSparkListener(newSparkListener{}) That one can override onTaskEnd, which provides you a SparkListenerTaskEnd for each task. That instance provides you access to the metrics. See: - https://spark.apache.org/doc

ShuffleDataIO: where is the reading part of the API?

2024-10-30 Thread Enrico Minack
Hi devs, the docs of org.apache.spark.shuffle.api.ShuffleDataIO read:     An interface for plugging in modules for storing and reading temporary shuffle data. but the API does only provide interface for writing shuffle data: - ShuffleExecutorComponents.createMapOutputWriter - ShuffleExecutorCom

Re: Extending Spark with a custom ExternalClusterManager

2025-02-19 Thread Enrico Minack
Hi devs, Let me pull some spark-submit developers into this discussion. @dongjoon-hyun @HyukjinKwon @cloud-fan What are your thoughts on making spark-submit fully and generically support ExternalClusterManager implementations? The current situation is that the only way to submit a Spark job vi