date:20181018

Re: Plan on Structured Streaming in next major/minor release?

2018-10-18 Thread Jungtaek Lim

Small correction: "timeout" in map/flatmapGroupsWithState would not work similar as State TTL when event time and watermark is set. So timeout in map/flatmapGroupsWithState is to guarantee removal of state when the state will not be used, as similar as what we do with streaming aggregation, whereas

Re: DataSourceWriter V2 Api questions

2018-10-18 Thread Jungtaek Lim

Sorry to resurrect this old and long thread: we have been struggling with Kafka end-to-end exactly-once support, and couldn't find any approach which can get both things, transactional and scalable. If we tolerate scalability, we can let writers to write to staging topic within individual transact

Plan on Structured Streaming in next major/minor release?

2018-10-18 Thread Jungtaek Lim

Hi devs, While Spark 2.4.0 is still in progress of release votes, I'm seeing some pull requests on non-SS are being reviewed and merged into master branch, so I guess discussion about next release is OK. Looks like there's a major TODO left on structured streaming: allowing stateful operation in

Re: Spark In Memory Shuffle / 5403

2018-10-18 Thread Peter Liu

I would be very interested in the initial question here: is there a production level implementation for memory only shuffle and configurable (similar to MEMORY_ONLY storage level, MEMORY_OR_DISK storage level) as mentioned in this ticket, https://github.com/apache/spark/pull/5403 ? It would be

RE: data source api v2 refactoring

2018-10-18 Thread Mendelson, Assaf

HI, I actually encountered this corner case and I think it is not that uncommon. In my case, I was writing a write only source which used some library to write to a database. I didn’t want to have to write a reader, however, even if I would have written one it wouldn’t have worked. I wouldn’t h

Re: data source api v2 refactoring

2018-10-18 Thread Wenchen Fan

I want to bring back the discussion of data source v2 abstraction. There is a problem discovered by Hyukjin recently. For a write-only data source, it may accept any input, and itself does not have a schema. Then the table abstraction doesn't fit it, as table must provide a schema. Personally I t

Re: Structured Streaming with Watermark

2018-10-18 Thread Jungtaek Lim

Which version of Spark do you use? You can get help on attaching streaming query listener and print out the QueryProcessEvent to track watermark. The value of watermark will be updated per batch and next batch will utilize that value. If watermark exceeds the last timestamp but the value is still

Re: Structured Streaming with Watermark

2018-10-18 Thread sandeep_katta

Now I ve added same aggregation query as below but still it is didn't filter val lines_stream = spark.readStream. format("kafka"). option("kafka.bootstrap.servers", "vm3:21005,vm2:21005"). option("subscribe", "s1"). load(). withColumn("tokens", split('value, ",")).

Re: Structured Streaming with Watermark

2018-10-18 Thread Burak Yavuz

Hi Sandeep, Watermarks are used in aggregation queries to ensure correctness and clean up state. They don't allow you to drop records in map-only scenarios, which you have in your example. If you would do a test of `groupBy().count()` then you will see that the count doesn't increase with the last

Re: Plan on Structured Streaming in next major/minor release?

Re: DataSourceWriter V2 Api questions

Plan on Structured Streaming in next major/minor release?

Re: Spark In Memory Shuffle / 5403

RE: data source api v2 refactoring

Re: data source api v2 refactoring

Re: Structured Streaming with Watermark

Re: Structured Streaming with Watermark

Re: Structured Streaming with Watermark

9 matches

Site Navigation

Mail list logo

Footer information