+1 on Jungtaek's point. We can revisit this when we release Spark 3.1?
After the release of 3.0, I believe we will get more feedback about DSv2
from the community. The current design is just made by a small group of
contributors. DSv2 + catalog APIs are still evolving. It is very likely we
will mak
Just 2 cents, I haven't tracked the change of DSv2 (though I needed to deal
with this as the change made confusion on my PRs...), but my bet is that
DSv2 would be already changed in incompatible way, at least who works for
custom DataSource. Making downstream to diverge their implementation
heavily
small correction: confusion -> conflict, so I had to go through and
understand parts of the changes
On Sat, Sep 21, 2019 at 1:25 PM Jungtaek Lim wrote:
> Just 2 cents, I haven't tracked the change of DSv2 (though I needed to
> deal with this as the change made confusion on my PRs...), but my bet
Do you mean you want to have a breaking API change between 3.0 and 3.1?
I believe we follow Semantic Versioning (
https://spark.apache.org/versioning-policy.html ).
> We just won’t add any breaking changes before 3.1.
Bests,
Dongjoon.
On Fri, Sep 20, 2019 at 11:48 AM Ryan Blue
wrote:
> I don’
Hi all,
I am trying to do some actions at the Driver side in Spark while an
application is running. The Driver needs to know the tasks progress before
making any decision. I know that tasks progress can be accessed within each
executor or task from RecordReader class by calling getProgress().
The
Hi, All.
As a next step, we started JDK11 QA.
https://issues.apache.org/jira/browse/SPARK-29194
This issue mainly focuses on the following areas, but feel free to add any
sub-issues which you hit on JDK11 from now.
- Documentations
- Examples
- Performance
- Integration Tests
B
I don’t think we need to gate a 3.0 release on making a more stable version
of InternalRow
Sounds like we agree, then. We will use it for 3.0, but there are known
problems with it.
Thinking we’d have dsv2 working in both 3.x (which will change and progress
towards more stable, but will have to br
I don't think we need to gate a 3.0 release on making a more stable version of
InternalRow, but thinking we'd have dsv2 working in both 3.x (which will change
and progress towards more stable, but will have to break certain APIs) and 2.x
seems like a false premise.
To point out some problems wi
When you created the PR to make InternalRow public
This isn’t quite accurate. The change I made was to use InternalRow instead
of UnsafeRow, which is a specific implementation of InternalRow. Exposing
this API has always been a part of DSv2 and while both you and I did some
work to avoid this, we
I don't know enough about DSv2 to comment on this part, but, any
theoretical 2.5 is still a ways off. Does waiting for 3.0 to 'stabilize' it
as much as is possible help?
I say that because re: Java 11, the main breaking change is probably the
Hive 2 / Hadoop 3 dependency, JPMML (minor), as well as
To push back, while I agree we should not drastically change "InternalRow",
there are a lot of changes that need to happen to make it stable. For example,
none of the publicly exposed interfaces should be in the Catalyst package or
the unsafe package. External implementations should be decoupled
I didn't realize that Java 11 would require breaking changes. What breaking
changes are required?
On Fri, Sep 20, 2019 at 11:18 AM Sean Owen wrote:
> Narrowly on Java 11: the problem is that it'll take some breaking
> changes, more than would be usually appropriate in a minor release, I
> think.
> DSv2 is far from stable right?
No, I think it is reasonably stable and very close to being ready for a
release.
> All the actual data types are unstable and you guys have completely
ignored that.
I think what you're referring to is the use of `InternalRow`. That's a
stable API and there has be
Narrowly on Java 11: the problem is that it'll take some breaking
changes, more than would be usually appropriate in a minor release, I
think. I'm still not convinced there is a burning need to use Java 11
but stay on 2.4, after 3.0 is out, and at least the wheels are in
motion there. Java 8 is sti
DSv2 is far from stable right? All the actual data types are unstable and you
guys have completely ignored that. We'd need to work on that and that will be a
breaking change. If the goal is to make DSv2 work across 3.x and 2.x, that
seems too invasive of a change to backport once you consider th
I’m not sure that DSv2 list is accurate. We discussed this in the DSv2 sync
this week (just sent out the notes) and came up with these items:
- Finish TableProvider update to avoid another API change: pass all
table config from metastore
- Catalog behavior fix: https://issues.apache.org/j
Hi everyone,
In the DSv2 sync this week, we talked about a possible Spark 2.5 release
based on the latest Spark 2.4, but with DSv2 and Java 11 support added.
A Spark 2.5 release with these two additions will help people migrate to
Spark 3.0 when it is released because they will be able to use a s
Here are my notes from this week’s DSv2 sync.
*Attendees*:
Ryan Blue
Holden Karau
Russell Spitzer
Terry Kim
Wenchen Fan
Shiv Prashant Sood
Joseph Torres
Gengliang Wang
Matt Cheah
Burak Yavuz
*Topics*:
- Driver-side Hadoop conf
- SHOW DATABASES/NAMESPACES behavior
- Review outstanding 3
Thank you for the summarization, Xingbo.
I also agree with Sean because I don't think those block 3.0.0 preview
release.
Especially, correctness issues should not be there.
Instead, could you summarize what we have as of now for 3.0.0 preview?
I believe JDK11 (SPARK-28684) and Hive 2.3.5 (SPARK-
Hi,
to my best knowledge, the existing FileStreamSource reads all the files in
a directory (hive table).
However, I need to be able to specify an initial partition it should start
from (i.e. like a Kafka offset/initial warmed-up state) and then only read
data which is semantically (i.e. using a fi
Is this a list of items that might be focused on for the final 3.0
release? At least, Scala 2.13 support shouldn't be on that list. The
others look plausible, or are already done, but there are probably
more.
As for the 3.0 preview, I wouldn't necessarily block on any particular
feature, though, y
I forgot to mention important part that I'm issuing same query to both
parquets - selecting only one column:
df.select(sum('amount))
BR,
Tomas
št 19. 9. 2019 o 18:10 Tomas Bartalos napísal(a):
> Hello,
>
> I have 2 parquets (each containing 1 file):
>
>- parquet-wide - schema has 25 top le
> New pushdown API for DataSourceV2
One correction: I want to revisit the pushdown API to make sure it works
for dynamic partition pruning and can be extended to support
limit/aggregate/... pushdown in the future. It should be a small API update
instead of a new API.
On Fri, Sep 20, 2019 at 3:46
Hi all,
Let's start a new thread to discuss the on-going features for Spark 3.0
preview release.
Below is the feature list for the Spark 3.0 preview release. The list is
collected from the previous discussions in the dev list.
- Followup of the shuffle+repartition correctness issue: support r
24 matches
Mail list logo