Hi Spark Developers,
I have a fundamental question on the process of contributing to Apache
Spark from outside the circle of committers.
I have gone through a number of pull requests and I always found it hard
to get feedback, especially from committers. I understand there is a
very high com
Am 18.02.21 um 16:34 schrieb Sean Owen:
One other aspect is that a committer is taking some degree of
responsibility for merging a change, so the ask is more than just a
few minutes of eyeballing. If it breaks something the merger pretty
much owns resolving it, and, the whole project owns any c
Hi Spark-Devs,
the observable metrics that have been added to the Dataset API in 3.0.0
are a great improvement over the Accumulator APIs that seem to have much
weaker guarantees. I have two questions regarding follow-up contributions:
*1. Add observe to Python ***DataFrame**
As I can see fro
JVM
and you can leverage these values from PySpark?
(I see there's support for listeners with DStream in PySpark, so there
might be reasons not to add the same for SQL/SS. Probably a lesson
learned?)
On Mon, Mar 15, 2021 at 6:59 PM Enrico Minack <mailto:m...@enrico.minack.dev>> wrote
f we have a consensus on the usefulness of observable
metrics on batch query.
On Tue, Mar 16, 2021 at 4:17 PM Enrico Minack
mailto:m...@enrico.minack.dev>> wrote:
I am focusing on batch mode, not streaming mode. I would argue
that Dataset.observe() is equally u
The PR can be found here: https://github.com/apache/spark/pull/31905
Am 19.03.21 um 10:55 schrieb Enrico Minack:
I'll sketch out a PR so we can talk code and move the discussion there.
Am 18.03.21 um 14:55 schrieb Wenchen Fan:
I think a listener-based API makes sense for streaming (
The melt function has recently been implemented in the PySpark Pandas
API (because melt is part of the Pandas API). I think, Scala/Java
Dataset and Python DataFrame APIs deserve this method equally well,
ideally all based on one implementation.
I'd like to fuel the conversation with some code:
Hi,
looks like the error comes from the Parquet library, has the library
version changed moving to 3.2.1? What are the parquet versions used in
3.0.1 and 3.2.1? Can you read that parquet file with the newer parquet
library version natively (without Spark)? Then this might be a Parquet
issue,
Hi devs,
moving to 3.4.0 snapshots, Spark modules resolve perfectly fine for
3.4.0-SNAPSHOT, except for graphx:
org.apache.spark spark-graphx_2.12
3.4.0-SNAPSHOT provided
...
Downloading from apache.snapshots:
https://repository.apache.org/snapshots/org/apache/spark/spark-catalyst_2.12/3.
Issue solved by explicitly adding the
https://repository.apache.org/snapshots repository to my POM.
Mvn resolved other packages from that repo, and this has worked for
snapshots before.
Thanks anyway,
Enrico
Am 19.06.22 um 22:30 schrieb Enrico Minack:
Hi devs,
moving to 3.4.0 snapshots
Hi devs,
I understand that spark-packages.org is not associated with Apache and
Apache Spark, but hosted by Databricks. Does anyone have any pointers on
how to get support? The e-mail address feedb...@spark-packages.org does
not respond.
I found a few "missing features" that block me from re
Hi Devs,
this has been raised by Swetha on the user mailing list, which also hit
us recently.
Here is the question again:
*Is it guaranteed that written files are sorted as stated in
**sortWithinPartitions**?*
ds.repartition($"day")
.sortWithinPartitions($"day", $"id")
.write
.partit
spark.sql.adaptive.coalescePartitions.enabled set false, it still fails
for all versions before 3.4.0.
Enrico
Am 11.10.22 um 12:15 schrieb Enrico Minack:
Hi Devs,
this has been raised by Swetha on the user mailing list, which also
hit us recently.
Here is the question again:
*Is it guaranteed that written files are
Hi All,
can we get these correctness issues fixed with the 3.4 release, please?
SPARK-41162 incorrect query plan for anti-join and semi-join of
self-joined aggregations (since 3.1), fix in
https://github.com/apache/spark/pull/39131
SPARK-40885 loosing in-partition order for string type partiti
Hi Xinrong,
what about regression issue
https://issues.apache.org/jira/browse/SPARK-40819
and correctness issue https://issues.apache.org/jira/browse/SPARK-40885?
The latter gets fixed by either
https://issues.apache.org/jira/browse/SPARK-41959 or
https://issues.apache.org/jira/browse/SPARK-
You are saying the RCs are cut from that branch at a later point? What
is the estimate deadline for that?
Enrico
Am 18.01.23 um 07:59 schrieb Hyukjin Kwon:
These look like we can fix it after the branch-cut so should be fine.
On Wed, 18 Jan 2023 at 15:57, Enrico Minack
wrote:
Hi
RC builds and all our downstream tests are green, thanks for the release!
Am 11.02.23 um 06:00 schrieb L. C. Hsieh:
Please vote on releasing the following candidate as Apache Spark version 3.3.2.
The vote is open until Feb 15th 9AM (PST) and passes if a majority +1
PMC votes are cast, with a m
Plus number of unioned tables would be helpful, as well as which
downstream operations are performed on the unioned tables.
And what "performance issues" do you exactly measure?
Enrico
Am 22.02.23 um 16:50 schrieb Mich Talebzadeh:
Hi,
Few details will help
1. Spark version
2. Spark SQL,
t 11:07 AM Enrico Minack
wrote:
Plus number of unioned tables would be helpful, as well as which
downstream operations are performed on the unioned tables.
And what "performance issues" do you exactly measure?
Enrico
Am 22.02.23 um 16:50 schrieb Mich Taleb
Hi,
thanks for the Spark 3.2.4 release.
I have found that Maven does not serve the spark-parent_2.13 pom file.
It is listed in the directory:
https://repo1.maven.org/maven2/org/apache/spark/spark-parent_2.13/3.2.4/
But cannot be downloaded:
https://repo1.maven.org/maven2/org/apache/spark/spar
Any suggestions on how to fix or use the Spark 3.2.4 (Scala 2.13) release?
Cheers,
Enrico
Am 17.04.23 um 08:19 schrieb Enrico Minack:
Hi,
thanks for the Spark 3.2.4 release.
I have found that Maven does not serve the spark-parent_2.13 pom file.
It is listed in the directory:
https://repo1
Hi,
selecting Spark 3.4.0 with Hadoop2.7 at
https://spark.apache.org/downloads.html leads to
https://www.apache.org/dyn/closer.lua/spark/spark-3.4.0/spark-3.4.0-bin-hadoop2.tgz
saying:
The requested file or directory is *not* on the mirrors.
The object is in not in our archive https://archi
recommended for all Hadoop clusters. Please see SPARK-40651
<https://issues.apache.org/jira/browse/SPARK-40651> .
The option to download Spark 3.4.0 with Hadoop2.7 has been removed
from the Downloads page to avoid confusion.
Thanks,
Xinrong Meng
On Wed, Apr 19, 2023 at 11:24 PM Enrico
/spark/spark-parent_2.13/3.2.4/spark-parent_2.13-3.2.4.pom
You may want to use (1) and (2) repositories temporarily while waiting for
`repo1.maven.org`'s recovery.
Dongjoon.
On 2023/04/18 05:38:59 Enrico Minack wrote:
Any suggestions on how to fix or use the Spark 3.2.4 (Scala 2.13) re
+1
Functions available in SQL (more general in one API) should be available
in all APIs. I am very much in favor of this.
Enrico
Am 24.05.23 um 09:41 schrieb Hyukjin Kwon:
Hi all,
I would like to discuss adding all SQL functions into Scala, Python
and R API.
We have SQL functions that d
Speaking of JdbcDialect, is there any interest in getting upserts for
JDBC into 3.5.0?
[SPARK-19335][SPARK-38200][SQL] Add upserts for writing to JDBC:
https://github.com/apache/spark/pull/41518
[SPARK-19335][SPARK-38200][SQL] Add upserts for writing to JDBC using
MERGE INTO with temp table: h
Hi devs,
PySpark allows to transform a |DataFrame| via Pandas *and* Arrow API:
df.mapInArrow(map_arrow, schema="...")
df.mapInPandas(map_pandas, schema="...")
For |df.groupBy(...)| and |df.groupBy(...).cogroup(...)|, there is
*only* a Pandas interface, no Arrow interface:
df.groupBy("id").ap
Hi devs,
I am looking for some PySpark dev that is interested in some 10x to 100x
speed up of df.groupby().applyInPandas() for small groups.
A PoC and benchmark can be found at
https://github.com/apache/spark/pull/37360#issuecomment-1228293766.
I suppose, the same approach could be taken to
Hi Spark devs,
I have a question around ShuffleManager: With speculative execution, one
map output file is being created multiple times (by multiple task
attempts). If both attempts succeed, which is to be read by the reduce
task in the next stage? Is any map output as good as any other?
Tha
Hi,
I would like to discuss issue SPARK-29176 to see if this is considered a
bug and if so, to sketch out a fix.
In short, the issue is that a valid inner join with condition gets
optimized so that no condition is left, but the type is still INNER.
Then CheckCartesianProducts throws an excep
, hence the error you can disable.
The query is not invalid in any case. It's just stopping you from
doing something you may not meant to, and which may be expensive.
However I think we've already changed the default to enable it in
Spark 3 anyway.
On Wed, Nov 6, 2019 at 8:50 AM Enrico Min
Hi all,
Running expensive deterministic UDFs that return complex types, followed
by multiple references to those results cause Spark to evaluate the UDF
multiple times per row. This has been reported and discussed before:
SPARK-18748 SPARK-17728
val f: Int => Array[Int]
val udfF = ud
.
>
> At first look, no I don't think this Spark-side workaround for
naming
> for your use case is worthwhile. There are existing better
solutions.
>
> On Thu, Nov 7, 2019 at 2:45 AM Enrico Minack
mailto:m...@enrico.minack.dev>> wrote:
>
Hi Devs,
I'd like to get your thoughts on this Dataset feature proposal.
Comparing datasets is a central operation when regression testing your
code changes.
It would be super useful if Spark's Datasets provide this transformation
natively.
https://github.com/apache/spark/pull/26936
Regar
Hi Devs,
I'd like to propose a stricter version of as[T]. Given the interface def
as[T](): Dataset[T], it is counter-intuitive that the schema of the
returned Dataset[T] is not agnostic to the schema of the originating
Dataset. The schema should always be derived only from T.
I am proposing
ap(identity)`.
On Tue, Jan 7, 2020 at 4:42 PM Enrico Minack <mailto:m...@enrico.minack.dev>> wrote:
Hi Devs,
I'd like to propose a stricter version of as[T]. Given the
interface def as[T](): Dataset[T], it is counter-intuitive that
the schema of the returned Dataset[
Hi Devs,
I am forwarding this from the user mailing list. I agree that the <=>
version of join(Dataset[_], Seq[String]) would be useful.
Does any PMC consider this useful enough to be added to the Dataset API?
I'd be happy to create a PR in that case.
Enrico
Weitergeleitete Nach
Hi Devs,
I would like to know what is the current roadmap of making
CalendarInterval comparable and orderable again (SPARK-29679,
SPARK-29385, #26337).
With #27262, this got reverted but SPARK-30551 does not mention how to
go forward in this matter. I have found SPARK-28494, but this seems t
), so comparing it with the "right kinds of
intervals" should always be correct.
Enrico
Am 11.02.20 um 17:06 schrieb Wenchen Fan:
What's your use case to compare intervals? It's tricky in Spark as
there is only one interval type and you can't really compare one month
with
I have created a jira to track this request:
https://issues.apache.org/jira/browse/SPARK-30957
Enrico
Am 08.02.20 um 16:56 schrieb Enrico Minack:
Hi Devs,
I am forwarding this from the user mailing list. I agree that the <=>
version of join(Dataset[_], Seq[String]) would be useful.
-+--+--+
|
The length of an interval can be measured by dividing it with the length
of your measuring unit, e.g. "1 hour":
||$"interval" / lit("1 hour").cast(CalendarIntervalType)| |
Which brings us to CalendarInterval division:
https://gith
Abhinav,
you can repartition by your key, then sortWithinPartition, and the
groupByKey. Since data are already hash-partitioned by key, Spark should
not shuffle the data hence change the sort wihtin each partition:
ds.repartition($"key").sortWithinPartitions($"code").groupBy($"key")
Enrico
Kelly Zhang,
You can add a SparkListenerto your spark context:
sparkContext.addSparkListener(newSparkListener{})
That one can override onTaskEnd, which provides you a
SparkListenerTaskEnd for each task. That instance provides you access to
the metrics.
See:
-
https://spark.apache.org/doc
Hi devs,
the docs of org.apache.spark.shuffle.api.ShuffleDataIO read:
An interface for plugging in modules for storing and reading
temporary shuffle data.
but the API does only provide interface for writing shuffle data:
- ShuffleExecutorComponents.createMapOutputWriter
- ShuffleExecutorCom
Hi devs,
Let me pull some spark-submit developers into this discussion.
@dongjoon-hyun @HyukjinKwon @cloud-fan
What are your thoughts on making spark-submit fully and generically
support ExternalClusterManager implementations?
The current situation is that the only way to submit a Spark job vi
45 matches
Mail list logo