[Spark SQL] query nested structure data

2014-08-27 Thread wenchen
I am going to dig into this issue: https://issues.apache.org/jira/browse/SPARK-2096 However, I noticed that there is already a NestedSqlParser in sql/core/test org.apache.spark.sql.parquet. I checked this parser and it could solve the issue I mentioned before. But why the author of the parser mark

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-09 Thread Wenchen Fan
cycle of the UDF instance. > Should they use String or UTF8String? What representations are supported and how will Spark detect and produce those representations? It's the same as InternalRow. We can just copy-paste the document of InternalRow to explain the corresponding java type for each data typ

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-09 Thread Wenchen Fan
Hi Holden, As Hyukjin said, following existing designs is not the principle of DS v2 API design. We should make sure the DS v2 API makes sense. AFAIK we didn't fully follow the catalog API design from Hive and I believe Ryan also agrees with it. I think the problem here is we were discussing some

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-09 Thread Wenchen Fan
tive (Trino). On Wed, Feb 10, 2021 at 10:18 AM Wenchen Fan wrote: > Hi Holden, > > As Hyukjin said, following existing designs is not the principle of DS v2 > API design. We should make sure the DS v2 API makes sense. AFAIK we didn't > fully follow the catalog API design fr

Re: [VOTE] Release Spark 3.0.2 (RC1)

2021-02-16 Thread Wenchen Fan
+1 On Wed, Feb 17, 2021 at 1:43 PM Dongjoon Hyun wrote: > +1 > > Bests, > Dongjoon. > > > On Tue, Feb 16, 2021 at 2:27 AM Herman van Hovell > wrote: > >> +1 >> >> On Tue, Feb 16, 2021 at 11:08 AM Hyukjin Kwon >> wrote: >> >>> +1 >>> >>> 2021년 2월 16일 (화) 오후 5:10, Prashant Sharma 님이 작성: >>>

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-17 Thread Wenchen Fan
pep-0020/>: > > There should be one— and preferably only one —obvious way to do it. > > If multiple approaches have the way for developers to do the (almost) same > thing, I would prefer to avoid it. > > In addition, I would prefer to focus on what Spark does by default first. &

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-17 Thread Wenchen Fan
IW, there’s the saying I like in the zen of Python >> <https://www.python.org/dev/peps/pep-0020/>: >> >> There should be one— and preferably only one —obvious way to do it. >> >> If multiple approaches have the way for developers to do the (almost) >> same thi

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-18 Thread Wenchen Fan
using InternalRow and that it isn’t a usability >> problem to include it. >> >> Oh, and one last thought is that we already have users that call >> Dataset.map and use InternalRow. This would allow converting that code >> directly to a UDF. >> >>

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-21 Thread Wenchen Fan
nd, how would > you solve the problem when implementations define methods with the wrong > types? The InternalRow approach helps implementations catch that problem > (as demonstrated above) and also provides a fallback when there is a but > preventing the invoke optimization from working. Th

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-22 Thread Wenchen Fan
27;s actually struct, values: array>, so we can also allow Java beans/Scala case classes here. The general idea is to use stuff that can retain nested type information at compile-time, i.e. array, java bean, case classes. Thanks, Wenchen On Mon, Feb 22, 2021 at 3:47 PM Walaa Eldin Moustafa

Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-23 Thread Wenchen Fan
osal and rationales. > > It looks like we need more discussion to reach an agreement. And the > technical details become more difficult to track because this is an email > thread. > > Although Ryan initially suggested discussing this on Apache email thread > instead of the PR, c

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-02 Thread Wenchen Fan
Yes, GenericInternalRow is safe if when type mismatches, with the cost of using Object[], and primitive types need to do boxing. And this is a runtime failure, which is absolutely worse than query-compile-time checks. Also, don't forget my previous point: users need to specify the type and index su

Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-03 Thread Wenchen Fan
Great work and congrats! On Wed, Mar 3, 2021 at 3:51 PM Kent Yao wrote: > Congrats, all! > > Bests, > *Kent Yao * > @ Data Science Center, Hangzhou Research Institute, NetEase Corp. > *a spark enthusiast* > *kyuubi is a unified multi-tenant JDBC > interface fo

Re: [DISCUSS] SPIP: FunctionCatalog

2021-03-03 Thread Wenchen Fan
und like a good plan to everyone? If so, I’ll update the SPIP > doc so we can move forward. > > On Wed, Mar 3, 2021 at 4:36 PM Dongjoon Hyun > wrote: > >> Hi, All. >> >> We shared many opinions in different perspectives. >> However, we didn't reach a conse

Re: [VOTE] SPIP: Add FunctionCatalog

2021-03-09 Thread Wenchen Fan
+1 (binding) On Tue, Mar 9, 2021 at 1:47 PM Russell Spitzer wrote: > +1 (for what it's worth) > > Thanks for making such a robust proposal, i'm excited to see the new work > coming from this > > On Mar 8, 2021, at 11:44 PM, Dongjoon Hyun > wrote: > > +1 (binding) > > Thank you, Ryan. > > Bests,

Re: Apache Spark 3.2 Expectation

2021-03-11 Thread Wenchen Fan
There are many projects going on right now, such as new DS v2 APIs, ANSI interval types, join improvement, disaggregated shuffle, etc. I don't think it's realistic to do the branch cut in April. I'm +1 to release 3.2 around July, but it doesn't mean we have to cut the branch 3 months earlier. We s

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-16 Thread Wenchen Fan
+1, it's great to have Pandas support in Spark out of the box. On Tue, Mar 16, 2021 at 10:12 PM Takeshi Yamamuro wrote: > +1; the pandas interfaces are pretty popular and supporting them in > pyspark looks promising, I think. > one question I have; what's an initial goal of the proposal? > Is th

Re: SessionCatalog lock issue

2021-03-18 Thread Wenchen Fan
The `synchronized` is needed for getting `currentDb` IIUC. So a small change is to only wrap `formatDatabaseName(name.database.getOrElse(currentDb))` with `synchronized`. On Thu, Mar 18, 2021 at 3:38 PM Chang Chen wrote: > hi all > > We met an issue which is related with SessionCatalog synchron

Re: Observable Metrics on Spark Datasets

2021-03-18 Thread Wenchen Fan
I think a listener-based API makes sense for streaming (since you need to keep watching the result), but may not be very reasonable for batch queries (you only get the result once). The idea of Observation looks good, but we should define what happens if `observation.get` is called before the batch

Re: Welcoming six new Apache Spark committers

2021-03-28 Thread Wenchen Fan
Congrats! On Mon, Mar 29, 2021 at 12:04 PM 郑瑞峰 wrote: > Congratulations to all! > > > -- 原始邮件 -- > *发件人:* "Yuanjian Li" ; > *发送时间:* 2021年3月29日(星期一) 上午10:38 > *收件人:* "Yi Wu"; > *抄送:* "Gengliang Wang";"Xiao Li";"Chao > Sun";"Mridul Muralidharan";"Dongjoon > Hyun";"P

Re: [VOTE] SPIP: Support pandas API layer on PySpark

2021-03-28 Thread Wenchen Fan
+1 On Mon, Mar 29, 2021 at 1:45 PM Holden Karau wrote: > +1 > > On Sun, Mar 28, 2021 at 10:25 PM sarutak wrote: > >> +1 (non-binding) >> >> - Kousuke >> >> > +1 (non-binding) >> > >> > On Sun, Mar 28, 2021 at 9:06 PM 郑瑞峰 >> > wrote: >> > >> >> +1 (non-binding) >> >> >> >> -- 原始

Re: PR testing and flaky tests (triggering executions separately)

2021-03-29 Thread Wenchen Fan
AFAIK, Github actions triggered checks are almost the same as SparkPullRequestBuilder except that it has one more Scala 2.13 check. So at least we don't have to wait for both SparkPullRequestBuilder and Github actions to merge PR. On Fri, Mar 26, 2021 at 6:09 PM Attila Zsolt Piros < piros.attila.z

Re: [VOTE] Release Spark 2.4.8 (RC1)

2021-04-07 Thread Wenchen Fan
+1 On Thu, Apr 8, 2021 at 9:24 AM Sean Owen wrote: > Looks good to me testing on Java 8, Hadoop 2.7, Ubuntu, with about all > profiles enabled. > I still get an odd failure in the Hive versions suite, but I keep seeing > that in my env and think it's something odd about my setup. > +1 >

Re: Big Broadcast Hash Join with Dynamic Partition Pruning gives wrong results

2021-04-07 Thread Wenchen Fan
Hi Tomas, thanks for reporting this bug! Is it possible to share your dataset so that other people can reproduce and debug it? On Thu, Apr 8, 2021 at 7:52 AM Tomas Bartalos wrote: > when I try to do a Broadcast Hash Join on a bigger table (6Mil rows) I get > an incorrect result of 0 rows. > > v

Re: Increase the number of parallel jobs in GitHub Actions at ASF organization level

2021-04-07 Thread Wenchen Fan
> for example, having sub-groups where each group shares the resources - currently one GitHub organisation shares all resources across the projects. That's a good idea. We do need to thank Github to give free resources to ASF projects, but it's better if we can make it a business: we allow individ

Re: Apache Spark 3.2 Expectation

2021-04-11 Thread Wenchen Fan
g the release dates in https://github.com/apache/spark-website/pull/331 Thanks, Wenchen On Thu, Mar 11, 2021 at 11:17 PM Dongjoon Hyun wrote: > Thank you, Xiao, Wenchen and Hyukjin. > > Bests, > Dongjoon. > > > On Thu, Mar 11, 2021 at 2:15 AM Hyukjin Kwon wrote: > >>

Re: [VOTE] Release Spark 2.4.8 (RC2)

2021-04-14 Thread Wenchen Fan
+1 (binding) On Thu, Apr 15, 2021 at 12:22 AM Maxim Gekk wrote: > +1 (non-binding) > > On Wed, Apr 14, 2021 at 6:39 PM Dongjoon Hyun > wrote: > >> +1 >> >> Bests, >> Dongjoon. >> >> On Tue, Apr 13, 2021 at 10:38 PM Kent Yao wrote: >> >>> +1 (non-binding) >>> >>> *Kent Yao * >>> @ Data Science

Re: [DISCUSS] Add error IDs

2021-04-21 Thread Wenchen Fan
I think severity makes sense for logs, but not sure about errors. +1 to the proposal to improve the error message further. On Fri, Apr 16, 2021 at 6:01 PM Yuming Wang wrote: > +1 for this proposal. > > On Fri, Apr 16, 2021 at 5:15 AM Karen wrote: > >> We could leave space in the numbering syst

Re: [VOTE] Release Spark 2.4.8 (RC3)

2021-04-28 Thread Wenchen Fan
+1 (binding) On Thu, Apr 29, 2021 at 1:05 AM DB Tsai wrote: > +1 (binding) > > > On Apr 28, 2021, at 9:26 AM, Liang-Chi Hsieh wrote: > > > > > > Please vote on releasing the following candidate as Apache Spark version > > 2.4.8. > > > > The vote is open until May 4th at 9AM PST and passes if a

Re: Bintray replacement for spark-packages.org

2021-04-28 Thread Wenchen Fan
Shall we make new releases for 3.0 and 3.1? So that people don't need to change the sbt resolver/pom files to work around Bintray sunset. It's also been a while since the last 3.0 and 3.1 releases. On Tue, Apr 27, 2021 at 9:02 AM Matthew Powers wrote: > Great job fixing this!! I just checked an

Re: [Spark Catalog API] Support for metadata Backup/Restore

2021-05-07 Thread Wenchen Fan
If a catalog implements backup/restore, it can easily expose some client APIs to the end-users (e.g. REST API), I don't see a strong reason to expose the APIs to Spark. Do you plan to add new SQL commands in Spark to backup/restore a catalog? On Tue, May 4, 2021 at 2:39 AM Tianchen Zhang wrote:

Re: [VOTE] Release Spark 2.4.8 (RC4)

2021-05-11 Thread Wenchen Fan
+1 On Tue, May 11, 2021 at 2:59 AM Holden Karau wrote: > +1 - pip install with Py 2.7 works (with the understandable warnings > regarding Python 2.7 no longer being maintained). > > On Mon, May 10, 2021 at 11:18 AM sarutak wrote: > > > > +1 (non-binding) > > > > - Kousuke > > > > > It looks lik

Re: [VOTE] Release Spark 2.4.8 (RC4)

2021-05-11 Thread Wenchen Fan
[image: image.png] I checked the log in https://repository.apache.org/#stagingRepositories, seems the gpg key is not uploaded to the public keyserver. Liang-Chi can you take a look? On Tue, May 11, 2021 at 3:47 PM Wenchen Fan wrote: > +1 > > On Tue, May 11, 2021 at 2:59 AM Holden Kar

Re: [Spark Catalog API] Support for metadata Backup/Restore

2021-05-11 Thread Wenchen Fan
ng client APIs to the end users >> in this approach? The users can only call backup or restore, right? >> >> Thanks, >> Tianchen >> >> On Fri, May 7, 2021 at 12:27 PM Wenchen Fan wrote: >> >>> If a catalog implements backup/restore, it can easily expose so

Re: Apache Spark 3.1.2 Release?

2021-05-18 Thread Wenchen Fan
+1, thanks! On Tue, May 18, 2021 at 1:37 PM Xiao Li wrote: > +1 Thanks, Dongjoon! > > Xiao > > > > On Mon, May 17, 2021 at 8:45 PM Kent Yao wrote: > >> +1. thanks Dongjoon >> >> *Kent Yao * >> @ Data Science Center, Hangzhou Research Institute, NetEase Corp. >> *a spark enthusiast* >> *kyuubi <

Re: [ANNOUNCE] Apache Spark 2.4.8 released

2021-05-18 Thread Wenchen Fan
Thank you, Liang-Chi! On Tue, May 18, 2021 at 1:32 PM Dongjoon Hyun wrote: > Finally! Thank you, Liang-Chi. > > Bests, > Dongjoon. > > > On Mon, May 17, 2021 at 10:14 PM Takeshi Yamamuro > wrote: > >> Thank you for the release work, Liang-Chi~ >> >> On Tue, May 18, 2021 at 2:11 PM Hyukjin Kwon

Re: Resolves too old JIRAs as incomplete

2021-05-20 Thread Wenchen Fan
+1 On Thu, May 20, 2021 at 11:59 AM Dongjoon Hyun wrote: > +1. > > Thank you, Takeshi. > > On Wed, May 19, 2021 at 7:49 PM Hyukjin Kwon wrote: > >> Yeah, I wanted to discuss this. I agree since 2.4.x became EOL >> >> 2021년 5월 20일 (목) 오전 10:54, Sean Owen 님이 작성: >> >>> I agree. Such old JIRAs are

Re: Bridging gap between Spark UI and Code

2021-05-24 Thread Wenchen Fan
I believe you can already see each plan change Spark did to your query plan in the debug-level logs. I think it's hard to do in the web UI as keeping all these historical query plans is expensive. Mapping the query plan to your application code is nearly impossible, as so many optimizations can ha

Re: [Spark Core]: Adding support for size based partition coalescing

2021-05-24 Thread Wenchen Fan
Ideally this should be handled by the underlying data source to produce a reasonably partitioned RDD as the input data. However if we already have a poorly partitioned RDD at hand and want to repartition it properly, I think an extra shuffle is required so that we can know the partition size first.

Re: Secrets store for DSv2

2021-05-24 Thread Wenchen Fan
You can take a look at PartitionReaderFactory. It's created at the driver side, serialized and sent to the executor side. For the write side, there is a similar channel: DataWriterFactory On Wed, May 19, 2021 at 4:37 AM Andrew Melo wrote: > Hello, > > When implementing a DSv2 datasource, where

Re: Purpose of OffsetHolder as a LeafNode?

2021-05-24 Thread Wenchen Fan
It's just an immediate place holder to update the query plan in each micro-batch. On Sat, May 15, 2021 at 10:29 PM Jacek Laskowski wrote: > Hi, > > Just stumbled upon OffsetHolder [1] and am curious why it's a LeafNode? > What logical plan could it be part of? > > [1] > https://github.com/apache

Re: About Spark executs sqlscript

2021-05-24 Thread Wenchen Fan
It's not possible to load everything into memory. We should use a big query connector (should be existing already?) and register table B and C and temp views in Spark. On Fri, May 14, 2021 at 8:50 AM bo zhao wrote: > Hi Team, > > I've followed Spark community for several years. This is my first

Re: SPIP: Catalog API for view metadata

2021-05-24 Thread Wenchen Fan
much cleaner when it happens well >> before table resolution. And, View and Table are very different objects; >> returning Object from this API doesn't make much sense. >> >> One extra RPC is not unreasonable, and the choice should be left to >> sources. That's

Re: Bridging gap between Spark UI and Code

2021-05-25 Thread Wenchen Fan
You can see the SQL plan node name in the DAG visualization. Please refer to https://spark.apache.org/docs/latest/web-ui.html for more details. If you still have any confusion, please let us know and we will keep improving the document. On Tue, May 25, 2021 at 4:41 AM mhawes wrote: > @Wenc

Re: [Spark Core]: Adding support for size based partition coalescing

2021-05-25 Thread Wenchen Fan
so what does a > repartition() call do if AQE is not enabled? this is essentially a new api > so would repartitionBySize or something be less confusing to users who > already use repartition(num_partitions). > > Tom > > On Monday, May 24, 2021, 12:30:20 PM CDT, Wenchen Fan > wrote: &

Re: [VOTE] SPIP: Catalog API for view metadata

2021-05-26 Thread Wenchen Fan
OK, then I'd vote for TableViewCatalog, because 1. This is how Hive catalog works, and we need to migrate Hive catalog to the v2 API sooner or later. 2. Because of 1, TableViewCatalog is easy to support in the current table/view resolution framework. 3. It's better to avoid name conflicts between t

Re: [SQL] s.s.a.coalescePartitions.parallelismFirst true but recommends false

2021-09-06 Thread Wenchen Fan
This is correct. It's true by default so that AQE doesn't have performance regression. If you run a benchmark, larger parallelism usually means better performance. However, it's recommended to set it to false, so that AQE can give better resource utilization, which is good for a busy Spark cluster.

Re: [SQL] When SQLConf vals gets own accessor defs?

2021-09-06 Thread Wenchen Fan
I think SQLConf doesn't need defs anymore. In the beginning, SQLConf lived in sql/core, so we have to add defs if the code in sql/catalyst needs to access configs. Now SQLConf is in sql/catalyst (this was done a few years ago), defs are only needed if we have some special logic that is not just rea

Re: [VOTE] Release Spark 3.2.0 (RC7)

2021-10-10 Thread Wenchen Fan
+1 On Sat, Oct 9, 2021 at 2:36 PM angers zhu wrote: > +1 (non-binding) > > Cheng Pan 于2021年10月9日周六 下午2:06写道: > >> +1 (non-binding) >> >> Integration test passed[1] with my project[2]. >> >> [1] >> https://github.com/housepower/spark-clickhouse-connector/runs/3834335017 >> [2] https://github.com

Re: [ANNOUNCE] Apache Spark 3.2.0

2021-10-19 Thread Wenchen Fan
Yea the file naming is a bit confusing, we can fix it in the next release. 3.2 actually means 3.2 or higher, so not a big deal I think. Congrats and thanks! On Wed, Oct 20, 2021 at 3:44 AM Jungtaek Lim wrote: > Thanks to Gengliang for driving this huge release! > > On Wed, Oct 20, 2021 at 1:50

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-26 Thread Wenchen Fan
+1 to this SPIP and nice writeup of the design doc! Can we open comment permission in the doc so that we can discuss details there? On Tue, Oct 26, 2021 at 8:29 PM Hyukjin Kwon wrote: > Seems making sense to me. > > Would be great to have some feedback from people such as @We

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-27 Thread Wenchen Fan
f the table has many partitions? Shall we apply certain join algorithms in the phase 1 split-wise join as well? Or even launch a Spark job to do so? Thanks, Wenchen On Wed, Oct 27, 2021 at 3:08 AM Chao Sun wrote: > Thanks Cheng for the comments. > > > Is migrating Hive table read path

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-27 Thread Wenchen Fan
to get started because it fills > an existing gap. More complex use cases can be supported over time. > > Ryan > > On Wed, Oct 27, 2021 at 9:08 AM Wenchen Fan wrote: > >> IIUC, the general idea is to let each input split report its partition >> value, and Spark can

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-28 Thread Wenchen Fan
; `BoundFunction` directly. That is easier than defining a way for Spark to > query the function catalog. > > In any case, I'm sure it's easy to understand how this works once you get > a concrete implementation. > > On Wed, Oct 27, 2021 at 9:35 AM Wenchen Fan wrote: &

Re: [VOTE] SPIP: Storage Partitioned Join for Data Source V2

2021-10-31 Thread Wenchen Fan
+1 On Sat, Oct 30, 2021 at 8:58 AM Cheng Su wrote: > +1 > > > > Thanks, > > Cheng Su > > > > *From: *Holden Karau > *Date: *Friday, October 29, 2021 at 12:41 PM > *To: *DB Tsai > *Cc: *Dongjoon Hyun , Ryan Blue , > dev , huaxin gao > *Subject: *Re: [VOTE] SPIP: Storage Partitioned Join for Da

Re: Issue Upgrading to 3.2

2021-11-01 Thread Wenchen Fan
Hi Adam, Thanks for reporting this issue! Do you have the full stacktrace or a code snippet to reproduce the issue at Spark side? It looks like a bug, but it's not obvious to me how this bug can happen. Thanks, Wenchen On Sat, Oct 30, 2021 at 1:08 AM Adam Binford wrote: > Hi devs,

Re: [DISCUSS] SPIP: Row-level operations in Data Source V2

2021-11-01 Thread Wenchen Fan
The general idea looks great. This is indeed a complicated API and we probably need more time to evaluate the API design. It's better to commit this work earlier so that we have more time to verify it before the 3.3 release. Maybe we can commit the group-based API first, then the delta-based one, a

Re: Issue Upgrading to 3.2

2021-11-01 Thread Wenchen Fan
utExpressions > } > > Function registration: > Catalog.expressions.foreach(f => { > val functionIdentifier = > FunctionIdentifier(f.getClass.getSimpleName.dropRight(1)) > val expressionInfo = new ExpressionInfo( > f.getClass.getCanonicalName, > function

Re: [VOTE] SPIP: Row-level operations in Data Source V2

2021-11-16 Thread Wenchen Fan
+1 On Mon, Nov 15, 2021 at 2:54 AM John Zhuge wrote: > +1 (non-binding) > > On Sun, Nov 14, 2021 at 10:33 AM Chao Sun wrote: > >> +1 (non-binding). Thanks Anton for the work! >> >> On Sun, Nov 14, 2021 at 10:01 AM Ryan Blue wrote: >> >>> +1 >>> >>> Thanks to Anton for all this great work! >>>

Re: [FYI] Build and run tests on Java 17 for Apache Spark 3.3

2021-11-16 Thread Wenchen Fan
Great job! On Sat, Nov 13, 2021 at 11:18 AM Hyukjin Kwon wrote: > Awesome! > > On Sat, Nov 13, 2021 at 12:04 PM Xiao Li wrote: > >> Thank you! Great job! >> >> Xiao >> >> >> On Fri, Nov 12, 2021 at 7:02 PM Mridul Muralidharan >> wrote: >> >>> >>> Nice job ! >>> There are some nice API's which

Re: Supports Dynamic Table Options for Spark SQL

2021-11-16 Thread Wenchen Fan
It's useful to have a SQL API to specify table options, similar to the DataFrameReader API. However, I share the same concern from @Hyukjin Kwon and am not very comfortable with using hints to do it. In the PR, someone mentioned TVF. I think it's better than hints, but still has problems. For exa

Re: [Apache Spark Jenkins] build system shutting down Dec 23th, 2021

2021-12-06 Thread Wenchen Fan
Thanks, Shane! Really appreciate it! Wenchen On Tue, Dec 7, 2021 at 12:38 PM Xiao Li wrote: > Hi, Shane, > > Thank you for your work on it! > > Xiao > > > > > On Mon, Dec 6, 2021 at 6:20 PM L. C. Hsieh wrote: > >> Thank you, Shane. >> >>

Re: Time for Spark 3.2.1?

2021-12-06 Thread Wenchen Fan
+1 to make new maintenance releases for all 3.x branches. On Tue, Dec 7, 2021 at 8:57 AM Sean Owen wrote: > Always fine by me if someone wants to roll a release. > > It's been ~6 months since the last 3.0.x and 3.1.x releases, too; a new > release of those wouldn't hurt either, if any of our rel

Re: Difference in behavior for Spark 3.0 vs Spark 3.1 "create database "

2022-01-11 Thread Wenchen Fan
Hopefully, this StackOverflow answer can solve your problem: https://stackoverflow.com/questions/47523037/how-do-i-configure-pyspark-to-write-to-hdfs-by-default Spark doesn't control the behavior of qualifying paths. It's decided by certain configs and/or config files. On Tue, Jan 11, 2022 at 3:0

Re: [VOTE] Release Spark 3.2.1 (RC2)

2022-01-24 Thread Wenchen Fan
+1 On Tue, Jan 25, 2022 at 10:13 AM Ruifeng Zheng wrote: > +1 (non-binding) > > > -- 原始邮件 -- > *发件人:* "Kent Yao" ; > *发送时间:* 2022年1月25日(星期二) 上午10:09 > *收件人:* "John Zhuge"; > *抄送:* "dev"; > *主题:* Re: [VOTE] Release Spark 3.2.1 (RC2) > > +1, non-binding > > John Zhu

Re: [VOTE] SPIP: Catalog API for view metadata

2022-02-07 Thread Wenchen Fan
+1 (binding) On Sun, Feb 6, 2022 at 10:27 AM Jacky Lee wrote: > +1 (non-binding). Thanks John! > It's great to see ViewCatalog moving on, it's a nice feature. > > Terry Kim 于2022年2月5日周六 11:57写道: > >> +1 (non-binding). Thanks John! >> >> Terry >> >> On Fri, Feb 4, 2022 at 4:13 PM Yufei Gu wrote

Re: [VOTE] Spark 3.1.3 RC3

2022-02-07 Thread Wenchen Fan
Shall we use the release scripts of branch 3.1 to release 3.1? On Fri, Feb 4, 2022 at 4:57 AM Holden Karau wrote: > Good catch Dongjoon :) > > This release candidate fails, but feel free to keep testing for any other > potential blockers. > > I’ll roll RC4 next week with the older release script

Re: [VOTE] Spark 3.1.3 RC4

2022-02-15 Thread Wenchen Fan
+1 On Tue, Feb 15, 2022 at 3:59 PM Yuming Wang wrote: > +1 (non-binding). > > On Tue, Feb 15, 2022 at 10:22 AM Ruifeng Zheng > wrote: > >> +1 (non-binding) >> >> checked the release script issue Dongjoon mentioned: >> >> curl -s >> https://dist.apache.org/repos/dist/dev/spark/v3.1.3-rc4-bin/spa

Re: Data correctness issue with Repartition + FetchFailure

2022-03-14 Thread Wenchen Fan
We fixed the repartition correctness bug before, by sorting the data before doing round-robin partitioning. But the issue is that we need to propagate the isDeterministic property through SQL operators. On Tue, Mar 15, 2022 at 1:50 AM Jason Xu wrote: > Hi Reynold, do you suggest removing RoundRo

Re: Data correctness issue with Repartition + FetchFailure

2022-03-16 Thread Wenchen Fan
It's great if you can help with it! Basically, we need to propagate the column-level deterministic information and sort the inputs if the partition key lineage has nondeterminisitc part. On Wed, Mar 16, 2022 at 5:28 AM Jason Xu wrote: > Hi Wenchen, thanks for the insight. Agree, the

Re: Apache Spark 3.3 Release

2022-03-16 Thread Wenchen Fan
+1 to define an allowlist of features that we want to backport to branch 3.3. I also have a few in my mind complex type support in vectorized parquet reader: https://github.com/apache/spark/pull/34659 refine the DS v2 filter API for JDBC v2: https://github.com/apache/spark/pull/35768 a few new SQL

Re: Apache Spark 3.3 Release

2022-03-20 Thread Wenchen Fan
Shall we revisit this list after a week? Ideally, they should be either merged or rejected for 3.3, so that we can cut rc1. We can still discuss them case by case at that time if there are exceptions. On Sat, Mar 19, 2022 at 5:27 AM Dongjoon Hyun wrote: > Thank you for your summarization. > > I

Re: Apache Spark 3.3 Release

2022-03-20 Thread Wenchen Fan
Just checked the release calendar, the planned RC cut date is April: [image: image.png] Let's revisit after 2 weeks then? On Mon, Mar 21, 2022 at 2:47 PM Wenchen Fan wrote: > Shall we revisit this list after a week? Ideally, they should be either > merged or rejected for 3.3, so that

Re: bazel and external/

2022-03-21 Thread Wenchen Fan
How about renaming it to `connectors` if docker is the only exception and will be moved out? On Sat, Mar 19, 2022 at 6:18 PM Alkis Evlogimenos wrote: > It looks like renaming the directory and moving components can be separate > steps. If there is consensus that connectors will move out, should

Re: PR builder not working now

2022-04-19 Thread Wenchen Fan
Thank you, Hyukjin! On Wed, Apr 20, 2022 at 7:48 AM Dongjoon Hyun wrote: > It's great! Thank you. :) > > On Tue, Apr 19, 2022 at 4:42 PM Hyukjin Kwon wrote: > >> It's fixed now. >> >> On Tue, 19 Apr 2022 at 08:33, Hyukjin Kwon wrote: >> >>> It's still persistent. I will send an email to GitHub

Re: [VOTE] Release Spark 3.3.0 (RC1)

2022-05-10 Thread Wenchen Fan
I'd like to see an RC2 as well. There is kind of a correctness bug fixed after RC1 is cut: https://github.com/apache/spark/pull/36468 Users may hit this bug much more frequently if they enable ANSI mode. It's not a regression so I'd vote -0. On Wed, May 11, 2022 at 5:24 AM Thomas graves wrote: >

Re: SIGMOD System Award for Apache Spark

2022-05-13 Thread Wenchen Fan
Great! Congratulations to everyone! On Fri, May 13, 2022 at 10:38 AM Gengliang Wang wrote: > Congratulations to the whole spark community! > > On Fri, May 13, 2022 at 10:14 AM Jungtaek Lim < > kabhwan.opensou...@gmail.com> wrote: > >> Congrats Spark community! >> >> On Fri, May 13, 2022 at 10:40

Re: Unable to create view due to up cast error when migrating from Hive to Spark

2022-05-18 Thread Wenchen Fan
A view is essentially a SQL query. It's fragile to share views between Spark and Hive because different systems have different SQL dialects. They may interpret the view SQL query differently and introduce unexpected behaviors. In this case, Spark returns decimal type for gender * 0.3 - 0.1 but Hiv

Re: [VOTE] Release Spark 3.3.0 (RC2)

2022-05-19 Thread Wenchen Fan
I think it should have been fixed by https://github.com/apache/spark/commit/0fdb6757946e2a0991256a3b73c0c09d6e764eed . Maybe the fix is not completed... On Thu, May 19, 2022 at 2:16 PM Kent Yao wrote: > Thanks, Maxim. > > Leave my -1 for this release candidate. > > Unfortunately, I don't know w

Re: [VOTE] Release Spark 3.3.0 (RC6)

2022-06-13 Thread Wenchen Fan
+1, tests are all green and there are no more blocker issues AFAIK. On Fri, Jun 10, 2022 at 12:27 PM Maxim Gekk wrote: > Please vote on releasing the following candidate as > Apache Spark version 3.3.0. > > The vote is open until 11:59pm Pacific time June 14th and passes if a > majority +1 PMC v

Re: [VOTE][SPIP] Spark Connect

2022-06-14 Thread Wenchen Fan
+1 On Tue, Jun 14, 2022 at 9:38 AM Ruifeng Zheng wrote: > +1 > > > -- 原始邮件 -- > *发件人:* "huaxin gao" ; > *发送时间:* 2022年6月14日(星期二) 上午8:47 > *收件人:* "L. C. Hsieh"; > *抄送:* "Spark dev list"; > *主题:* Re: [VOTE][SPIP] Spark Connect > > +1 > > On Mon, Jun 13, 2022 at 5:42

Re: Apache Spark 3.2.2 Release?

2022-07-06 Thread Wenchen Fan
+1 On Thu, Jul 7, 2022 at 10:41 AM Xinrong Meng wrote: > +1 > > Thanks! > > > Xinrong Meng > > Software Engineer > > Databricks > > > On Wed, Jul 6, 2022 at 7:25 PM Xiao Li wrote: > >> +1 >> >> Xiao >> >> Cheng Su 于2022年7月6日周三 19:16写道: >> >>> +1 (non-binding) >>> >>> Thanks, >>> Cheng Su >>> >

Re: [DISCUSS][Catalog API] Deprecate 4 Catalog API that takes two parameters which are (dbName, tableName/functionName)

2022-07-08 Thread Wenchen Fan
h is exactly the case here. We should remove these 4 APIs when most users have moved away. Thanks, Wenchen On Fri, Jul 8, 2022 at 2:49 PM Dongjoon Hyun wrote: > Thank you for starting the official discussion, Rui. > > 'Unneeded API' doesn't sound like a good frame for this

Re: [VOTE] Release Spark 3.2.2 (RC1)

2022-07-14 Thread Wenchen Fan
+1 On Wed, Jul 13, 2022 at 7:29 PM Yikun Jiang wrote: > +1 (non-binding) > > Checked out tag and built from source on Linux aarch64 and ran some basic > test. > > > Regards, > Yikun > > > On Wed, Jul 13, 2022 at 5:54 AM Mridul Muralidharan > wrote: > >> >> +1 >> >> Signatures, digests, etc chec

Re: Non-deterministic function duplicated in final Spark plan

2022-08-01 Thread Wenchen Fan
This is a hard one. Spark duplicates the join child plan if it's a self-join because Spark does not support diamond-shaped query plans. Seems the only option here is to write the join child plan to a parquet table (or using a shuffle) and read it back. On Mon, Aug 1, 2022 at 4:46 PM Enrico Minack

Re: [DISCUSS] SPIP: Support Docker Official Image for Spark

2022-09-19 Thread Wenchen Fan
+1 On Mon, Sep 19, 2022 at 2:59 PM Yang,Jie(INF) wrote: > +1 (non-binding) > > > > Yang Jie > -- > *发件人:* Yikun Jiang > *发送时间:* 2022年9月19日 14:23:14 > *收件人:* Denny Lee > *抄送:* bo zhaobo; Yuming Wang; Kent Yao; Gengliang Wang; Hyukjin Kwon; > dev; zrf > *主题:* Re: [DISC

Re: [VOTE] Release Spark 3.3.1 (RC4)

2022-10-18 Thread Wenchen Fan
+1 On Wed, Oct 19, 2022 at 4:59 AM Chao Sun wrote: > +1. Thanks Yuming! > > Chao > > On Tue, Oct 18, 2022 at 1:18 PM Thomas graves wrote: > > > > +1. Ran internal test suite. > > > > Tom > > > > On Sun, Oct 16, 2022 at 9:14 PM Yuming Wang wrote: > > > > > > Please vote on releasing the followi

Re: [DISCUSS] SPIP: Better Spark UI scalability and Driver stability for large applications

2022-11-15 Thread Wenchen Fan
This looks great! UI stability/scalability has been a pain point for a long time. On Sat, Nov 12, 2022 at 5:24 AM Gengliang Wang wrote: > Hi Everyone, > > I want to discuss the "Better Spark UI scalability and Driver stability > for large applications" proposal. Please find the links below: > >

Re: [VOTE] Release Spark 3.2.3 (RC1)

2022-11-16 Thread Wenchen Fan
+1 On Thu, Nov 17, 2022 at 10:20 AM Yang,Jie(INF) wrote: > +1,non-binding > > > > The test combination of Java 11 + Scala 2.12 and Java 11 + Scala 2.13 has > passed. > > > > Yang Jie > > > > *发件人**: *Chris Nauroth > *日期**: *2022年11月17日 星期四 04:27 > *收件人**: *Yuming Wang > *抄送**: *"Yang,Jie(INF)"

Re: [VOTE][SPIP] Better Spark UI scalability and Driver stability for large applications

2022-11-16 Thread Wenchen Fan
+1, I'm looking forward to it! On Thu, Nov 17, 2022 at 9:44 AM Ye Zhou wrote: > +1 (non-binding) > Thanks for proposing this improvement to SHS, it resolves the main > performance issue within SHS. > > On Wed, Nov 16, 2022 at 1:15 PM Jungtaek Lim > wrote: > >> +1 >> >> Nice to see the chance fo

Re: [ANNOUNCE] Apache Spark 3.2.3 released

2022-11-30 Thread Wenchen Fan
Thanks, Chao! On Wed, Nov 30, 2022 at 1:33 AM Chao Sun wrote: > We are happy to announce the availability of Apache Spark 3.2.3! > > Spark 3.2.3 is a maintenance release containing stability fixes. This > release is based on the branch-3.2 maintenance branch of Spark. We strongly > recommend all

Re: [DISCUSSION] SPIP: Asynchronous Offset Management in Structured Streaming

2022-11-30 Thread Wenchen Fan
+1 to improve the widely used micro-batch mode first. On Thu, Dec 1, 2022 at 8:49 AM Hyukjin Kwon wrote: > +1 > > On Thu, 1 Dec 2022 at 08:10, Shixiong Zhu wrote: > >> +1 >> >> This is exciting. I agree with Jerry that this SPIP and continuous >> processing are orthogonal. This SPIP itself woul

Re: [VOTE][SPIP] Asynchronous Offset Management in Structured Streaming

2022-12-01 Thread Wenchen Fan
+1 On Thu, Dec 1, 2022 at 12:31 PM Shixiong Zhu wrote: > +1 > > > On Wed, Nov 30, 2022 at 8:04 PM Hyukjin Kwon wrote: > >> +1 >> >> On Thu, 1 Dec 2022 at 12:39, Mridul Muralidharan >> wrote: >> >>> >>> +1 >>> >>> Regards, >>> Mridul >>> >>> On Wed, Nov 30, 2022 at 8:55 PM Xingbo Jiang >>> wro

Re: Time for release v3.3.2

2023-01-31 Thread Wenchen Fan
+1, thanks! On Tue, Jan 31, 2023 at 3:17 PM Maxim Gekk wrote: > +1 > > On Tue, Jan 31, 2023 at 10:12 AM John Zhuge wrote: > >> +1 Thanks Liang-Chi for driving the release! >> >> On Mon, Jan 30, 2023 at 10:26 PM Yuming Wang wrote: >> >>> +1 >>> >>> On Tue, Jan 31, 2023 at 12:18 PM yangjie01 wr

Re: [VOTE] Release Apache Spark 3.4.0 (RC5)

2023-04-03 Thread Wenchen Fan
/pull/40623 We found some usability issues with a new API and need to change the API to fix it. If people have concerns we can also remove the new API entirely. Thus I'm -1 to this RC. I'll merge these 2 PRs today if no objections. Thanks, Wenchen On Tue, Apr 4, 2023 at 3:47 AM L. C. Hs

Re: [VOTE] Release Apache Spark 3.2.4 (RC1)

2023-04-10 Thread Wenchen Fan
+1 On Tue, Apr 11, 2023 at 10:09 AM Hyukjin Kwon wrote: > +1 > > On Tue, 11 Apr 2023 at 11:04, Ruifeng Zheng wrote: > >> +1 (non-binding) >> >> Thank you for driving this release! >> >> -- >> Ruifeng Zheng >> ruife...@foxmail.com >> >>

Re: [VOTE] Release Apache Spark 3.4.0 (RC7)

2023-04-10 Thread Wenchen Fan
+1 On Tue, Apr 11, 2023 at 9:57 AM Yuming Wang wrote: > +1. > > On Tue, Apr 11, 2023 at 9:14 AM Yikun Jiang wrote: > >> +1 (non-binding) >> >> Also ran the docker image related test (signatures/standalone/k8s) with >> rc7: https://github.com/apache/spark-docker/pull/32 >> >> Regards, >> Yikun >

Re: Apache Spark 3.4.1 Release?

2023-06-09 Thread Wenchen Fan
+1 On Fri, Jun 9, 2023 at 8:52 PM Xinrong Meng wrote: > +1. Thank you Doonjoon! > > Thanks, > > Xinrong Meng > > Mridul Muralidharan 于2023年6月9日 周五上午5:22写道: > >> >> +1, thanks Dongjoon ! >> >> Regards, >> Mridul >> >> On Thu, Jun 8, 2023 at 7:16 PM Jia Fan >> wrote: >> >>> +1 >>> >>> ___

Re: [Feature Request] create *permanent* Spark View from DataFrame via PySpark

2023-06-09 Thread Wenchen Fan
DataFrame view stores the logical plan, while SQL view stores SQL text. I don't think we can support this feature until we have a reliable way to materialize logical plans. On Sun, Jun 4, 2023 at 10:31 PM Mich Talebzadeh wrote: > Try sending it to dev@spark.apache.org (and join that group) > > Y

Re: [DISCUSS] SPIP: Python Data Source API

2023-06-20 Thread Wenchen Fan
In an ideal world, every data source you want to connect to already has a Spark data source implementation (either v1 or v2), then this Python API is useless. But I feel it's common that people want to do quick data exploration, and the target data system is not popular enough to have an existing S

  1   2   3   4   5   6   7   >