Spark 2.4.2

2019-04-16 Thread Michael Armbrust
Hello All, I know we just released Spark 2.4.1, but in light of fixing SPARK-27453 I was wondering if it might make sense to follow up quickly with 2.4.2. Without this fix its very hard to build a datasource that correctly handles partitioning w

Re: Spark 2.4.2

2019-04-16 Thread Michael Armbrust
this behavior. Do you have a different proposal about how this should be handled? On Tue, Apr 16, 2019 at 4:23 PM Ryan Blue wrote: > Is this a bug fix? It looks like a new feature to me. > > On Tue, Apr 16, 2019 at 4:13 PM Michael Armbrust > wrote: > >> Hello All, >>

Re: Spark 2.4.2

2019-04-16 Thread Michael Armbrust
is is a small change and looks safe enough to me. I was just a > little surprised since I was expecting a correctness issue if this is > prompting a release. I'm definitely on the side of case-by-case judgments > on what to allow in patch releases and this looks fine. > > On

Re: [VOTE] Release Apache Spark 2.4.2

2019-04-19 Thread Michael Armbrust
+1 (binding), we've test this and it LGTM. On Thu, Apr 18, 2019 at 7:51 PM Wenchen Fan wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.4.2. > > The vote is open until April 23 PST and passes if a majority +1 PMC votes > are cast, with > a minimum of 3 +1 vot

Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

2020-02-21 Thread Michael Armbrust
This plan for evolving the TRIM function to be more standards compliant sounds much better to me than the original change to just switch the order. It pushes users in the right direction and cleans up our tech debt without silently breaking existing workloads. It means that programs won't return di

[Proposal] Modification to Spark's Semantic Versioning Policy

2020-02-24 Thread Michael Armbrust
Hello Everyone, As more users have started upgrading to Spark 3.0 preview (including myself), there have been many discussions around APIs that have been broken compared with Spark 2.x. In many of these discussions, one of the rationales for breaking an API seems to be "Spark follows semantic vers

Re: Clarification on the commit protocol

2020-02-27 Thread Michael Armbrust
No, it is not. Although the commit protocol has mostly been superseded by Delta Lake , which is available as a separate open source project that works natively with Apache Spark. In contrast to the commit protocol, Delta can guarantee full ACID (rather than just partition level at

Re: [Proposal] Modification to Spark's Semantic Versioning Policy

2020-02-27 Thread Michael Armbrust
Thanks for the discussion! A few responses: The decision needs to happen at api/config change time, otherwise the > deprecated warning has no purpose if we are never going to remove them. > Even if we never remove an API, I think deprecation warnings (when done right) can still serve a purpose. F

[VOTE] Amend Spark's Semantic Versioning Policy

2020-03-06 Thread Michael Armbrust
I propose to add the following text to Spark's Semantic Versioning policy and adopt it as the rubric that should be used when deciding to break APIs (even at major versions such as 3.0). I'll leave the vote open until Tuesday, March 10th at 2pm. A

Re: [VOTE] Amend Spark's Semantic Versioning Policy

2020-03-06 Thread Michael Armbrust
I'll start off the vote with a strong +1 (binding). On Fri, Mar 6, 2020 at 1:01 PM Michael Armbrust wrote: > I propose to add the following text to Spark's Semantic Versioning policy > <https://spark.apache.org/versioning-policy.html> and adopt it as the > rubric

Re: [VOTE] Amend Spark's Semantic Versioning Policy

2020-03-11 Thread Michael Armbrust
Thank you for the discussion everyone! This vote passes. I'll work to get this posed on the website. +1 Michael Armbrust Sean Owen Jules Damji 大啊 Ismaël Mejía Wenchen Fan Matei Zaharia Gengliang Wang Takeshi Yamamuro Denny Lee Xiao Li Xingbo Jiang Tkuya UESHIN Hichael Heuer John Zhuge Reynol

Re: FYI: The evolution on `CHAR` type behavior

2020-03-17 Thread Michael Armbrust
> > What I'd oppose is to just ban char for the native data sources, and do > not have a plan to address this problem systematically. > +1 > Just forget about padding, like what Snowflake and MySQL have done. > Document that char(x) is just an alias for string. And then move on. Almost > no work

Re: [vote] Apache Spark 3.0 RC3

2020-06-08 Thread Michael Armbrust
+1 (binding) On Mon, Jun 8, 2020 at 1:22 PM DB Tsai wrote: > +1 (binding) > > Sincerely, > > DB Tsai > -- > Web: https://www.dbtsai.com > PGP Key ID: 42E5B25A8F7A82C1 > > On Mon, Jun 8, 2020 at 1:03 PM Dongjoon Hyun > wrote: > > > > +1 > >

Re: Structured Stream equivalent of reduceByKey

2017-10-26 Thread Michael Armbrust
- dev I think you should be able to write an Aggregator . You probably want to run in update mode if you are looking for it to output any group that has changed in the batch. On Wed, Oct 25, 201

Re: [Vote] SPIP: Continuous Processing Mode for Structured Streaming

2017-11-06 Thread Michael Armbrust
+1 On Sat, Nov 4, 2017 at 11:02 AM, Xiao Li wrote: > +1 > > 2017-11-04 11:00 GMT-07:00 Burak Yavuz : > >> +1 >> >> On Fri, Nov 3, 2017 at 10:02 PM, vaquar khan >> wrote: >> >>> +1 >>> >>> On Fri, Nov 3, 2017 at 8:14 PM, Weichen Xu >>> wrote: >>> +1. On Sat, Nov 4, 2017 at 8:04 A

Timeline for Spark 2.3

2017-11-09 Thread Michael Armbrust
According to the timeline posted on the website, we are nearing branch cut for Spark 2.3. I'd like to propose pushing this out towards mid to late December for a couple of reasons and would like to hear what people think. 1. I've done release management during the Thanksgiving / Christmas time be

Re: queryable state & streaming

2017-12-08 Thread Michael Armbrust
https://issues.apache.org/jira/browse/SPARK-16738 I don't believe anyone is working on it yet. I think the most useful thing is to start enumerating requirements and use cases and then we can talk about how to build it. On Fri, Dec 8, 2017 at 10:47 AM, Stavros Kontopoulos < st.kontopou...@gmail.

Re: Timeline for Spark 2.3

2017-12-19 Thread Michael Armbrust
Do people really need to be around for the branch cut (modulo the person cutting the branch)? 1st or 2nd doesn't really matter to me, but I am +1 kicking this off as soon as we enter the new year :) Michael On Tue, Dec 19, 2017 at 4:39 PM, Holden Karau wrote: > Sounds reasonable, although I'd

Re: Spark error while trying to spark.read.json()

2017-12-19 Thread Michael Armbrust
- dev java.lang.AbstractMethodError almost always means that you have different libraries on the classpath than at compilation time. In this case I would check to make sure you have the correct version of Scala (and only have one version of scala) on the classpath. On Tue, Dec 19, 2017 at 5:42 P

Re: Max number of streams supported ?

2018-01-31 Thread Michael Armbrust
-dev +user > Similarly for structured streaming, Would there be any limit on number of > of streaming sources I can have ? > There is no fundamental limit, but each stream will have a thread on the driver that is doing coordination of execution. We comfortably run 20+ streams on a single cluste

Re: SQL logical plans and DataSourceV2 (was: data source v2 online meetup)

2018-02-02 Thread Michael Armbrust
> > So here are my recommendations for moving forward, with DataSourceV2 as a > starting point: > >1. Use well-defined logical plan nodes for all high-level operations: >insert, create, CTAS, overwrite table, etc. >2. Use rules that match on these high-level plan nodes, so that it >

Re: DataSourceV2: support for named tables

2018-02-02 Thread Michael Armbrust
I am definitely in favor of first-class / consistent support for tables and data sources. One thing that is not clear to me from this proposal is exactly what the interfaces are between: - Spark - A (The?) metastore - A data source If we pass in the table identifier is the data source then res

Re: [VOTE] Spark 2.3.0 (RC4)

2018-02-21 Thread Michael Armbrust
I'm -1 on any changes that aren't fixing major regressions from 2.2 at this point. Also in any cases where its possible we should be flipping new features off if they are still regressing, rather than continuing to attempt to fix them. Since its experimental, I would support backporting the DataSo

Re: [VOTE] Spark 2.3.0 (RC5)

2018-02-26 Thread Michael Armbrust
+1 all our pipelines have been running the RC for several days now. On Mon, Feb 26, 2018 at 10:33 AM, Dongjoon Hyun wrote: > +1 (non-binding). > > Bests, > Dongjoon. > > > > On Mon, Feb 26, 2018 at 9:14 AM, Ryan Blue > wrote: > >> +1 (non-binding) >> >> On Sat, Feb 24, 2018 at 4:17 PM, Xiao Li

Re: Sorting on a streaming dataframe

2018-04-26 Thread Michael Armbrust
The basic tenet of structured streaming is that a query should return the same answer in streaming or batch mode. We support sorting in complete mode because we have all the data and can sort it correctly and return the full answer. In update or append mode, sorting would only return a correct ans

Re: Sorting on a streaming dataframe

2018-04-30 Thread Michael Armbrust
performance as compared to implementing this > functionality inside the applications. > > Hemant > > On Thu, Apr 26, 2018 at 11:59 PM, Michael Armbrust > wrote: > >> The basic tenet of structured streaming is that a query should return the >> same answer in streami

Re: Plan on Structured Streaming in next major/minor release?

2018-10-30 Thread Michael Armbrust
Thanks for bringing up some possible future directions for streaming. Here are some thoughts: - I personally view all of the activity on Spark SQL also as activity on Structured Streaming. The great thing about building streaming on catalyst / tungsten is that continued improvement to these compon

Re: Plan on Structured Streaming in next major/minor release?

2018-10-30 Thread Michael Armbrust
> > Agree. Just curious, could you explain what do you mean by "negation"? > Does it mean applying retraction on aggregated? > Yeah exactly. Our current streaming aggregation assumes that the input is in append-mode and multiple aggregations break this.

Re: Structured Streaming Sink in 2.0 collect/foreach restrictions added in SPARK-16020

2016-06-28 Thread Michael Armbrust
This is not too broadly worded, and in general I would caution that any interface in org.apache.spark.sql.catalyst or org.apache.spark.sql.execution is considered internal and likely to change in between releases. We do plan to open a stable source/sink API in a future release. The problem here i

Re: Structured Streaming Sink in 2.0 collect/foreach restrictions added in SPARK-16020

2016-06-28 Thread Michael Armbrust
ustom Sink and then doing your operations on that be a reasonable work > around? > > > On Tuesday, June 28, 2016, Michael Armbrust > wrote: > >> This is not too broadly worded, and in general I would caution that any >> interface in org.apache.spark.sql.catalyst or >

Re: transtition SQLContext to SparkSession

2016-07-18 Thread Michael Armbrust
+ dev, reynold Yeah, thats a good point. I wonder if SparkSession.sqlContext should be public/deprecated? On Mon, Jul 18, 2016 at 8:37 AM, Koert Kuipers wrote: > in my codebase i would like to gradually transition to SparkSession, so > while i start using SparkSession i also want a SQLContext

Re: [VOTE] Release Apache Spark 2.0.0 (RC5)

2016-07-22 Thread Michael Armbrust
+1 On Fri, Jul 22, 2016 at 2:42 PM, Holden Karau wrote: > +1 (non-binding) > > Built locally on Ubuntu 14.04, basic pyspark sanity checking & tested with > a simple structured streaming project (spark-structured-streaming-ml) & > spark-testing-base & high-performance-spark-examples (minor change

Re: Outer Explode needed

2016-07-25 Thread Michael Armbrust
I don't think this would be hard to implement. The physical explode operator supports it (for our HiveQL compatibility). Perhaps comment on this JIRA? https://issues.apache.org/jira/browse/SPARK-13721 It could probably just be another argument to explode() Michael On Mon, Jul 25, 2016 at 6:12

Re: Source API requires unbounded distributed storage?

2016-08-04 Thread Michael Armbrust
Yeah, this API is in the private execution package because we are planning to continue to iterate on it. Today, we will only ever go back one batch, though that might change in the future if we do async checkpointing of internal state. You are totally right that we should relay this info back to

Re: Sorting within partitions is not maintained in parquet?

2016-08-11 Thread Michael Armbrust
This is an optimization to avoid overloading the scheduler with many small tasks. It bin-packs data into tasks based on the file size. You can disable it by setting spark.sql.files.openCostInBytes very high (higher than spark.sql.files.maxPartitionBytes). On Thu, Aug 11, 2016 at 4:27 AM, Hyukjin

Spark SQL JSON Column Support

2016-09-28 Thread Michael Armbrust
Spark SQL has great support for reading text files that contain JSON data. However, in many cases the JSON data is just one column amongst others. This is particularly true when reading from sources such as Kafka. This PR adds a new functions from_json t

Re: Spark SQL JSON Column Support

2016-09-28 Thread Michael Armbrust
Wed, Sep 28, 2016 at 11:02 AM, Nathan Lande > wrote: > >> We are currently pulling out the JSON columns, passing them through >> read.json, and then joining them back onto the initial DF so something like >> from_json would be a nice quality of life improvement for us. >

Re: [VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-29 Thread Michael Armbrust
+1 On Thu, Sep 29, 2016 at 11:51 AM, Mridul Muralidharan wrote: > +1 > > Regards, > Mridul > > On Wed, Sep 28, 2016 at 7:14 PM, Reynold Xin wrote: > > Please vote on releasing the following candidate as Apache Spark version > > 2.0.1. The vote is open until Sat, Oct 1, 2016 at 20:00 PDT and pas

Re: Spark SQL JSON Column Support

2016-09-29 Thread Michael Armbrust
> > Will this be able to handle projection pushdown if a given job doesn't > utilize all the columns in the schema? Or should people have a per-job schema? > As currently written, we will do a little bit of extra work to pull out fields that aren't needed. I think it would be pretty straight fo

Re: Questions about DataFrame's filter()

2016-09-29 Thread Michael Armbrust
-dev +user It surprises me as `filter()` takes a Column, not a `Row => Boolean`. There are several overloaded versions of Dataset.filter(...) def filter(func: FilterFunction[T]): Dataset[T] def filter(func: (T) ⇒ Boolean): Dataset[T] def filter(conditionExpr: String): Dataset[T] def filter(cond

Re: Catalyst - ObjectType for Encoders

2016-09-30 Thread Michael Armbrust
I'd be okay removing that modifier, with one caveat. The code in org.apache.spark.sql.catalyst.* is purposefully excluded from published documentation and does not have the same compatibility guarantees as the rest of the Spark's Public APIs. We leave most of it not "private" so that advanced use

Re: StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

2016-10-04 Thread Michael Armbrust
> > I don't quite understand why exposing it indirectly through a typed > interface should be delayed before finalizing the API. > Spark has a long history of maintaining binary compatibility in its public APIs. I strongly believe this is on

Re: StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

2016-10-06 Thread Michael Armbrust
me idea of the current plans and progress? I get asked a lot about > when Structured Streaming will be a viable replacement for Spark Streaming, > and I like to be able to give accurate advice. > > Fred > > On Tue, Oct 4, 2016 at 3:02 PM, Michael Armbrust > wrote: > >> I

Kafaka 0.8, 0.9 in Structured Streaming

2016-10-07 Thread Michael Armbrust
We recently merged support for Kafak 0.10.0 in Structured Streaming, but I've been hearing a few people tell me that they are stuck on an older version of Kafka and cannot upgrade. I'm considering revisiting SPARK-17344 , but it would be good to h

Re: Kafaka 0.8, 0.9 in Structured Streaming

2016-10-07 Thread Michael Armbrust
> > The implementation is totally and completely different however, in ways > that leak to the end user. Can you elaborate? Especially in the context of the interface provided by structured streaming.

Re: Kafaka 0.8, 0.9 in Structured Streaming

2016-10-07 Thread Michael Armbrust
> 0.10 consumers won't work on an earlier broker. > Earlier consumers will (should?) work on a 0.10 broker. > This lines up with my testing. Is there a page I'm missing that describes this? Like does a 0.9 client work with 0.8 broker? Is it always old clients can talk to new brokers but not vi

Re: Kafaka 0.8, 0.9 in Structured Streaming

2016-10-07 Thread Michael Armbrust
> > Without a hell of a lot more work, Assign would be the only strategy > usable. How would the current "subscribe" break?

Re: StructuredStreaming Custom Sinks (motivated by Structured Streaming Machine Learning)

2016-10-11 Thread Michael Armbrust
This is super helpful, thanks for writing it up! > *Delivering low latency, high throughput, and stability simultaneously:* Right > now, our own tests indicate you can get at most two of these > characteristics out of Spark Streaming at the same time. I know of two > parties that have abandoned S

Re: DataFrameReader Schema Supersedes Schema Provided by Encoder, Renders Fields Nullable

2016-10-13 Thread Michael Armbrust
There is a lot of confusion around nullable

Re: DataFrameReader Schema Supersedes Schema Provided by Encoder, Renders Fields Nullable

2016-10-14 Thread Michael Armbrust
> > Additionally, shall I go ahead and open a ticket pointing out the missing > call to .asNullable in the streaming reader? > Yes please! This probably affects correctness.

Re: StructuredStreaming status

2016-10-19 Thread Michael Armbrust
Anything that is actively being designed should be in JIRA, and it seems like you found most of it. In general, release windows can be found on the wiki . 2.1 has a lot of stability fixes as well as the kafka support you mentioned.

Re: StructuredStreaming status

2016-10-19 Thread Michael Armbrust
016 at 2:45 PM, Michael Armbrust > wrote: > > Anything that is actively being designed should be in JIRA, and it seems > > like you found most of it. In general, release windows can be found on > the > > wiki. > > > > 2.1 has a lot of stability fixes as well as

Re: Why the json file used by sparkSession.read.json must be a valid json object per line

2016-10-19 Thread Michael Armbrust
On Sun, Oct 16, 2016 at 3:50 AM, wrote: > Think of it as jsonl instead of a json file. > Point people at this if they need an official looking spec: > http://jsonlines.org/ > That link is awesome. I think it would be great if someone could open a PR to add this to our documentation. I'd also b

Re: StructuredStreaming status

2016-10-20 Thread Michael Armbrust
> > let’s say we would have implemented distinct count by saving a map with > the key being the distinct value and the value being the last time we saw > this value. This would mean that we wouldn’t really need to save all the > steps in the middle and copy the data, we could only save the last por

Re: StructuredStreaming status

2016-10-20 Thread Michael Armbrust
> > On a personal note, I'm quite surprised that this is all the progress in > Structured Streaming over the last three months since 2.0 was released. I > was under the impression that this was one of the biggest things that the > Spark community actively works on, but that is clearly not the case,

Re: Ran in to a bug in Broadcast Hash Join

2016-10-22 Thread Michael Armbrust
2.0.0 or 2.0.1? There are several correctness fixes in the latter. On Oct 22, 2016 2:14 PM, "Jeremy Davis" wrote: > > Hello, I ran in to a bug with Broadcast Hash Join in Spark 2.0. (Running > on EMR) > If I just toggle spark.sql.autoBroadcastJoinThreshold=-1 then the join > works, if I leave i

Re: LIMIT issue of SparkSQL

2016-10-23 Thread Michael Armbrust
- dev + user Can you give more info about the query? Maybe a full explain()? Are you using a datasource like JDBC? The API does not currently push down limits, but the documentation talks about how you can use a query instead of a table if that is what you are looking to do. On Mon, Oct 24, 20

Re: getting encoder implicits to be more accurate

2016-10-26 Thread Michael Armbrust
Hmm, that is unfortunate. Maybe the best solution is to add support for sets? I don't think that would be super hard. On Tue, Oct 25, 2016 at 8:52 PM, Koert Kuipers wrote: > i am trying to use encoders as a typeclass where if it fails to find an > ExpressionEncoder it falls back to KryoEncoder

Re: Straw poll: dropping support for things like Scala 2.10

2016-10-26 Thread Michael Armbrust
+1 On Wed, Oct 26, 2016 at 11:26 AM, Reynold Xin wrote: > We can do the following concrete proposal: > > 1. Plan to remove support for Java 7 / Scala 2.10 in Spark 2.2.0 (Mar/Apr > 2017). > > 2. In Spark 2.1.0 release, aggressively and explicitly announce the > deprecation of Java 7 / Scala 2.10

Re: getting encoder implicits to be more accurate

2016-10-26 Thread Michael Armbrust
> > On Wed, Oct 26, 2016 at 2:33 PM, Ryan Blue wrote: > >> Isn't the problem that Option is a Product and the class it contains >> isn't checked? Adding support for Set fixes the example, but the problem >> would happen with any class there isn't a

Re: getting encoder implicits to be more accurate

2016-10-26 Thread Michael Armbrust
implicit for Seq[(Int, Seq[(String, >> Int)])] will create a new ExpressionEncoder(), while an implicit for >> Seq[(Int, Set[(String, Int)])] produces a Encoders.kryoEncoder() >> >> On Wed, Oct 26, 2016 at 3:50 PM, Michael Armbrust > > wrote: >> >>> Sorry

Re: getting encoder implicits to be more accurate

2016-10-26 Thread Michael Armbrust
Oct 26, 2016 at 5:10 PM, Koert Kuipers wrote: > >> i use kryo for the whole thing currently >> >> it would be better to use it for the subtree >> >> On Wed, Oct 26, 2016 at 5:06 PM, Michael Armbrust > > wrote: >> >>> You use kryo encoder f

Re: Watermarking in Structured Streaming to drop late data

2016-10-26 Thread Michael Armbrust
And the JIRA: https://issues.apache.org/jira/browse/SPARK-18124 On Wed, Oct 26, 2016 at 4:56 PM, Tathagata Das wrote: > Hey all, > > We are planning implement watermarking in Structured Streaming that would > allow us handle late, out-of-order data better. Specially, when we are > aggregating ov

Re: [VOTE] Release Apache Spark 2.0.2 (RC1)

2016-10-27 Thread Michael Armbrust
+1 On Oct 27, 2016 12:19 AM, "Reynold Xin" wrote: > Greetings from Spark Summit Europe at Brussels. > > Please vote on releasing the following candidate as Apache Spark version > 2.0.2. The vote is open until Sun, Oct 30, 2016 at 00:30 PDT and passes if > a majority of at least 3+1 PMC votes are

Re: encoders for more complex types

2016-10-27 Thread Michael Armbrust
I would categorize these as bugs. We should (but probably don't fully yet) support arbitrary nesting as long as you use basic collections / case classes / primitives. Please do open JIRAs as you find problems. On Thu, Oct 27, 2016 at 1:05 PM, Koert Kuipers wrote: > well i was using Aggregators

JIRA Components for Streaming

2016-10-31 Thread Michael Armbrust
I'm planning to do a little maintenance on JIRA to hopefully improve the visibility into the progress / gaps in Structured Streaming. In particular, while we share a lot of optimization / execution logic with SQL, the set of desired features and bugs is fairly different. Proposal: - Structured

Re: JIRA Components for Streaming

2016-11-01 Thread Michael Armbrust
I did this <https://issues.apache.org/jira/browse/SPARK/component/12331043/?selectedTab=com.atlassian.jira.jira-projects-plugin:component-issues-panel>. Please help me correct any issues I may have missed. On Mon, Oct 31, 2016 at 11:37 AM, Michael Armbrust wrote: > I'm planning

Re: Structured streaming aggregation - update mode

2016-11-02 Thread Michael Armbrust
Yeah, agreed. As mentioned here , its near the top of my list. I just opened SPARK-18234 to track. On Wed, Nov 2, 2016 at 3:24 PM, Cristian Opris wrote: > Hi, > > I've been looking at planned jiras f

Re: [VOTE] Release Apache Spark 1.6.3 (RC2)

2016-11-03 Thread Michael Armbrust
+1 On Wed, Nov 2, 2016 at 5:40 PM, Reynold Xin wrote: > Please vote on releasing the following candidate as Apache Spark version > 1.6.3. The vote is open until Sat, Nov 5, 2016 at 18:00 PDT and passes if a > majority of at least 3+1 PMC votes are cast. > > [ ] +1 Release this package as Apache

Re: [VOTE] Release Apache Spark 2.0.2 (RC2)

2016-11-04 Thread Michael Armbrust
+1 On Tue, Nov 1, 2016 at 9:51 PM, Reynold Xin wrote: > Please vote on releasing the following candidate as Apache Spark version > 2.0.2. The vote is open until Fri, Nov 4, 2016 at 22:00 PDT and passes if a > majority of at least 3+1 PMC votes are cast. > > [ ] +1 Release this package as Apache

Re: [VOTE] Release Apache Spark 2.0.2 (RC3)

2016-11-08 Thread Michael Armbrust
+1 On Tue, Nov 8, 2016 at 1:17 PM, Sean Owen wrote: > +1 binding > > (See comments on last vote; same results, except, the regression we > identified is fixed now.) > > > On Tue, Nov 8, 2016 at 6:10 AM Reynold Xin wrote: > >> Please vote on releasing the following candidate as Apache Spark vers

Re: getting encoder implicits to be more accurate

2016-11-14 Thread Michael Armbrust
;> wrote: >> >> You don't need compiler time macros for this, you can do it quite easily >> using shapeless. I've been playing with a project which borrows ideas from >> spray-json and spray-json-shapeless to implement Row marshalling for >> arbitrary c

Re: How do I convert json_encoded_blob_column into a data frame? (This may be a feature request)

2016-11-16 Thread Michael Armbrust
On Wed, Nov 16, 2016 at 2:49 AM, Hyukjin Kwon wrote: > Maybe it sounds like you are looking for from_json/to_json functions after > en/decoding properly. > Which are new built-in functions that will be released with Spark 2.1.

Re: Multiple streaming aggregations in structured streaming

2016-11-18 Thread Michael Armbrust
Doing this generally is pretty hard. We will likely support algebraic aggregate eventually, but this is not currently slotted for 2.2. Instead I think we will add something like mapWithState that lets users compute arbitrary stateful things. What is your use case? On Wed, Nov 16, 2016 at 6:58

Re: Analyzing and reusing cached Datasets

2016-11-19 Thread Michael Armbrust
You are hitting a weird optimization in withColumn. Specifically, to avoid building up huge trees with chained calls to this method, we collapse projections eagerly (instead of waiting for the optimizer). Typically we look for cached data in between analysis and optimization, so that optimization

Re: [VOTE] Apache Spark 2.1.0 (RC1)

2016-11-30 Thread Michael Armbrust
Unfortunately the FileFormat APIs are not stable yet, so if you are using spark-avro, we are going to need to update it for this release. On Wed, Nov 30, 2016 at 2:56 PM, Koert Kuipers wrote: > running our inhouse unit-tests (that work with spark 2.0.2) against spark > 2.1.0-rc1 i see the follow

Re: Flink event session window in Spark

2016-12-02 Thread Michael Armbrust
Here is the JIRA for adding this feature: https://issues.apache.org/jira/browse/SPARK-10816 On Fri, Dec 2, 2016 at 11:20 AM, Fritz Budiyanto wrote: > Hi All, > > I need help on how to implement Flink event session window in Spark. Is > this possible? > > For instance, I wanted to create a sessio

Re: ability to provide custom serializers

2016-12-02 Thread Michael Armbrust
I would love to see something like this. The closest related ticket is probably https://issues.apache.org/jira/browse/SPARK-7768 (though maybe there are enough people using UDTs in their current form that we should just make a new ticket) A few thoughts: - even if you can do implicit search, we

Re: ability to provide custom serializers

2016-12-05 Thread Michael Armbrust
Lets start with a new ticket, link them and we can merge if the solution ends up working out for both cases. On Sun, Dec 4, 2016 at 5:39 PM, Erik LaBianca wrote: > Thanks Michael! > > On Dec 2, 2016, at 7:29 PM, Michael Armbrust > wrote: > > I would love to see somethi

Re: Expand the Spark SQL programming guide?

2016-12-15 Thread Michael Armbrust
Pull requests would be welcome for any major missing features in the guide: https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md On Thu, Dec 15, 2016 at 11:48 AM, Jim Hughes wrote: > Hi Anton, > > I'd like to see this as well. I've been working on implementing > geospatial

Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

2016-12-27 Thread Michael Armbrust
An encoder uses reflection to generate expressions that can extract data out of an object (by calling methods on the object) and encode its contents directly into the tungst

Re: StateStoreSaveExec / StateStoreRestoreExec

2017-01-03 Thread Michael Armbrust
I think we should add something similar to mapWithState in 2.2. It would be great if you could add the description of your problem to this ticket: https://issues.apache.org/jira/browse/SPARK-19067 On Mon, Jan 2, 2017 at 2:05 PM, Jeremy Smith wrote: > I have a question about state tracking in St

Re: StateStoreSaveExec / StateStoreRestoreExec

2017-01-03 Thread Michael Armbrust
You might also be interested in this: https://issues.apache.org/jira/browse/SPARK-19031 On Tue, Jan 3, 2017 at 3:36 PM, Michael Armbrust wrote: > I think we should add something similar to mapWithState in 2.2. It would > be great if you could add the description of your problem to this

Re: [SQL][SPARK-14160] Maximum interval for o.a.s.sql.functions.window

2017-01-18 Thread Michael Armbrust
+1, we should just fix the error to explain why months aren't allowed and suggest that you manually specify some number of days. On Wed, Jan 18, 2017 at 9:52 AM, Maciej Szymkiewicz wrote: > Thanks for the response Burak, > > As any sane person I try to steer away from the objects which have both

Re: specifing schema on dataframe

2017-02-05 Thread Michael Armbrust
-dev You can use withColumn to change the type after the data has been loaded . On Sat, Feb 4, 2017 at 6:22 AM, Sam Elamin wrote: > Hi

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Michael Armbrust
Here a JIRA: https://issues.apache.org/jira/browse/SPARK-19497 We should add this soon. On Tue, Feb 7, 2017 at 8:35 AM, Sam Elamin wrote: > Hi All > > When trying to read a stream off S3 and I try and drop duplicates I get > the following error: > > Exception in thread "main" org.apache.spark.s

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Michael Armbrust
rk job using Ctrl+C. > When I rerun the stream it picks up "update 2" again > > Is this normal? isnt ctrl+c a failure? > > I would expect checkpointing to know that update 2 was already processed > > Regards > Sam > > On Tue, Feb 7, 2017 at 4:58 PM, Sam

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Michael Armbrust
cause I can see in the log > that its now polling for new changes, the latest offset is the right one > > After I kill it and relaunch it picks up that same file? > > > Sorry if I misunderstood you > > On Tue, Feb 7, 2017 at 5:20 PM, Michael Armbrust > wrote: > >

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Michael Armbrust
e case then how would I go about ensuring no duplicates? > > > Thanks again for the awesome support! > > Regards > Sam > On Tue, 7 Feb 2017 at 18:05, Michael Armbrust > wrote: > >> Sorry, I think I was a little unclear. There are two things at play here. >> &

Re: Structured Streaming. Dropping Duplicates

2017-02-07 Thread Michael Armbrust
nk",tableName2) > .option("checkpointLocation","checkpoint") > .start() > > > On Tue, Feb 7, 2017 at 7:24 PM, Michael Armbrust > wrote: > >> Read the JSON log of files that is in `/your/path/_spark_metadata` and >> only read files that

Re: benefits of code gen

2017-02-10 Thread Michael Armbrust
Function1 is specialized, but nullSafeEval is Any => Any, so that's still going to box in the non-codegened execution path. On Fri, Feb 10, 2017 at 1:32 PM, Koert Kuipers wrote: > based on that i take it that math functions would be primary beneficiaries > since they work on primitives. > > so i

Re: Structured Streaming Spark Summit Demo - Databricks people

2017-02-16 Thread Michael Armbrust
Thanks for your interest in Apache Spark Structured Streaming! There is nothing secret in that demo, though I did make some configuration changes in order to get the timing right (gotta have some dramatic effect :) ). Also I think the visualizations based on metrics output by the StreamingQueryLi

Re: Should we consider a Spark 2.1.1 release?

2017-03-15 Thread Michael Armbrust
Hey Holden, Thanks for bringing this up! I think we usually cut patch releases when there are enough fixes to justify it. Sometimes just a few weeks after the release. I guess if we are at 3 months Spark 2.1.0 was a pretty good release :) That said, it is probably time. I was about to start th

Spark 2.2 Code-freeze - 3/20

2017-03-15 Thread Michael Armbrust
Hey Everyone, Just a quick announcement that I'm planning to cut the branch for Spark 2.2 this coming Monday (3/20). Please try and get things merged before then and also please begin retargeting of any issues that you don't think will make the release. Michael

Re: Outstanding Spark 2.1.1 issues

2017-03-21 Thread Michael Armbrust
Please speak up if I'm wrong, but none of these seem like critical regressions from 2.1. As such I'll start the RC process later today. On Mon, Mar 20, 2017 at 9:52 PM, Holden Karau wrote: > I'm not super sure it should be a blocker for 2.1.1 -- is it a regression? > Maybe we can get TDs input

Re: Outstanding Spark 2.1.1 issues

2017-03-22 Thread Michael Armbrust
good to start the RC >> process. >> >> On Tue, Mar 21, 2017 at 1:41 PM, Michael Armbrust > > wrote: >> >> Please speak up if I'm wrong, but none of these seem like critical >> regressions from 2.1. As such I'll start the RC process later today. >

Re: Outstanding Spark 2.1.1 issues

2017-03-28 Thread Michael Armbrust
Asher Krim > Senior Software Engineer > > On Wed, Mar 22, 2017 at 7:44 PM, Michael Armbrust > wrote: > >> An update: I cut the tag for RC1 last night. Currently fighting with the >> release process. Will post RC1 once I get it working. >> >> On Tue, Mar 21, 2

[VOTE] Apache Spark 2.1.1 (RC2)

2017-03-30 Thread Michael Armbrust
Please vote on releasing the following candidate as Apache Spark version 2.1.0. The vote is open until Sun, April 2nd, 2018 at 16:30 PST and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 2.1.1 [ ] -1 Do not release this package because ...

Re: [VOTE] Apache Spark 2.1.1 (RC2)

2017-04-04 Thread Michael Armbrust
1:16 PM, Felix Cheung wrote: > -1 > sorry, found an issue with SparkR CRAN check. > Opened SPARK-20197 and working on fix. > > -- > *From:* holden.ka...@gmail.com on behalf of > Holden Karau > *Sent:* Friday, March 31, 2017 6:25:20 PM > *

Re: 2.2 branch

2017-04-13 Thread Michael Armbrust
Yeah, I was delaying until 2.1.1 was out and some of the hive questions were resolved. I'll make progress on that by the end of the week. Lets aim for 2.2 branch cut next week. On Thu, Apr 13, 2017 at 8:56 AM, Koert Kuipers wrote: > i see there is no 2.2 branch yet for spark. has this been pus

  1   2   3   4   >