Hyperspace v0.1 is now open-sourced!

2020-07-02 Thread Terry Kim
-indexing-subsystem-for-apache-spark - Docs: https://aka.ms/hyperspace This project would not have been possible without the outstanding work from the Apache Spark™ community. Thank you everyone and we look forward to collaborating with the community towards evolving Hyperspace. Thanks, Terry Kim on

Announcing .NET for Apache Spark™ 0.12

2020-07-02 Thread Terry Kim
4.6 (3.0 support is on the way!) - SparkSession.CreateDataFrame, Broadcast variable - Preliminary support for MLLib (TF-IDF, Word2Vec, Bucketizer, etc.) - Support for .NET Core 3.1 We would like to thank all those who contributed to this release. Thanks, Terry Kim on behalf of the .NET for Apache Spark™ team

Re: Future timeout

2020-07-20 Thread Terry Kim
"spark.sql.broadcastTimeout" is the config you can use: https://github.com/apache/spark/blob/fe07521c9efd9ce0913eee0d42b0ffd98b1225ec/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L863 Thanks, Terry On Mon, Jul 20, 2020 at 11:20 AM Amit Sharma wrote: > Please help on t

Re: Renaming a DataFrame column makes Spark lose partitioning information

2020-08-04 Thread Terry Kim
This is fixed in Spark 3.0 by https://github.com/apache/spark/pull/26943: scala> :paste // Entering paste mode (ctrl-D to finish) Seq((1, 2)) .toDF("a", "b") .repartition($"b") .withColumnRenamed("b", "c") .repartition($"c") .explain() // Exiting paste mode, now int

Announcing .NET for Apache Spark™ 1.0

2020-11-06 Thread Terry Kim
- Support for all the complex types in Spark SQL - Support for Delta Lake <https://github.com/delta-io/delta> v0.7 and Hyperspace <https://github.com/microsoft/hyperspace> v0.2 We would like to thank the community for the great feedback and all those who contributed to this release

Announcing Hyperspace v0.3.0 - an indexing subsystem for Apache Spark™

2020-11-17 Thread Terry Kim
nd all those who contributed to this release. Thanks, Terry Kim on behalf of the Hyperspace team

Re: [Spark SQL]HiveQL and Spark SQL producing different results

2021-01-12 Thread Terry Kim
Ying, Can you share a query that produces different results? Thanks, Terry On Sun, Jan 10, 2021 at 1:48 PM Ying Zhou wrote: > Hi, > > I run some SQL using both Hive and Spark. Usually we get the same results. > However when a window function is in the script Hive and Spark can produce > differe

Announcing Hyperspace v0.4.0 - an indexing subsystem for Apache Spark™

2021-02-08 Thread Terry Kim
PR to support Iceberg tables. We would like to thank the community for the great feedback and all those who contributed to this release. Thanks, Terry Kim on behalf of the Hyperspace team

[ANNOUNCE] .NET for Apache Spark™ 2.1 released

2022-02-02 Thread Terry Kim
ents of this release. Here are the some of the highlights: - Support for Apache Spark 3.2 - Exposing new SQL function APIs introduced in Spark 3.2 We would like to thank the community for the great feedback and all those who contributed to this release. Thanks, Terry Kim on behalf of the .NE

Announcing .NET for Apache Spark 0.4.0

2019-07-31 Thread Terry Kim
We are thrilled to announce that .NET for Apache Spark 0.4.0 has been just released ! Some of the highlights of this release include: - Apache Arrow backed UDFs (Vector UDF, Grouped Map UDF) - Robust UDF-related assembly loading - Lo

Re: Release Apache Spark 2.4.4

2019-08-13 Thread Terry Kim
Can the following be included? [SPARK-27234][SS][PYTHON] Use InheritableThreadLocal for current epoch in EpochTracker (to support Python UDFs) Thanks, Terry On Tue, Aug 13, 2019 at 10:24 PM Wenchen Fan wrote: > +1 > > On Wed, Aug 14, 2019 at 12:52 P

Announcing .NET for Apache Spark 0.5.0

2019-09-30 Thread Terry Kim
We are thrilled to announce that .NET for Apache Spark 0.5.0 has been just released ! Some of the highlights of this release include: - Delta Lake 's *DeltaTable *APIs - UDF improvements - Support f

Re: [Spark SQL]: Does namespace name is always needed in a query for tables from a user defined catalog plugin

2019-12-01 Thread Terry Kim
Hi Xufei, I also noticed the same while looking into relation resolution behavior (See Appendix A in this doc ). I created SPARK-30094 and will follo

Re: Using existing distribution for join when subset of keys

2020-05-31 Thread Terry Kim
You can use bucketBy to avoid shuffling in your scenario. This test suite has some examples: https://github.com/apache/spark/blob/45cf5e99503b00a6bd83ea94d6d92761db1a00ab/sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala#L343 Thanks, Terry On Sun, May 31, 2020 at 7:43 A

Re: Using existing distribution for join when subset of keys

2020-05-31 Thread Terry Kim
true, Format: Parquet, Location: > InMemoryFileIndex[file:/home/pwoody/tm/spark-2.4.5-bin-hadoop2.7/spark-warehouse/bx], > PartitionFilters: [], PushedFilters: [IsNotNull(x), IsNotNull(y)], > ReadSchema: struct, SelectedBucketsCount: 200 out of 200 > > Best, > Pat >