Re: Asking for reviewing PRs regarding structured streaming

2018-07-05 Thread Jungtaek Lim
Ted Yu suggested posting the improved numbers to this thread and I think it's good idea, so also posting here, but I also think explaining rationalization of my issues would help understanding why I'm submitting couple of patches, so I'll explain it first. (Sorry to post a wall of text). tl;dr. SP

Re: Asking for reviewing PRs regarding structured streaming

2018-07-05 Thread Jungtaek Lim
Bump. I have been having hard time working on making additional PRs since some of these rely on non-merged PRs, so spending additional time to decouple these things if possible. https://github.com/apache/spark/pulls/HeartSaVioR Pending 5 PRs so far, and may add more sooner or later. Thanks, Jung

[Streaming][Kinesis][SPARK-20168] Could I get some reviews of the patch that resolves kinesis timestamp resume

2018-07-05 Thread Yash Sharma
Hi Team, Could I get some review at the patch here. Would love to hear suggestions here on the patch. I had to reopen SPARK-20168 because of this bug. https://github.com/apache/spark/pull/21541 https://issues.apache.org/jira/browse/SPARK-20168 Cheers, Yash

[SS] Trigger.Once not working with watermarked event time

2018-07-05 Thread Christopher Horn
Does anyone who worked on SS or is familiar with it have an opinion on this?: https://issues.apache.org/jira/browse/SPARK-24699 https://github.com/apache/spark/pull/21676 I don't like the idea of making Trigger.Once execute two batches, but it is the simplest way to get the desired batch state c

Unsubscribe

2018-07-05 Thread xu han

loading and release of shuffle blocks, what does the TODO in comment mean?

2018-07-05 Thread 吴晓菊
By looking into the code of ShuffleBlockFetcherIterator, seems it will load all shuffle blocks needed by one task at one time and release all of them when task finished. Please correct me if I'm wrong. If my understanding is correct, does it mean those shuffle blocks will keep in memory even thou

what about put broadcast to offheap?

2018-07-05 Thread 吴晓菊
Now all broadcast data are put on heap since the StorageLevel is hard-coded to MEMORY_AND_DISK_SER in TorrentBroadcast. So it's highly possible the broadcast data will goto oldgen and then may have impact on full-gc. So what about put broadcast to offheap? Chrysan Wu Phone:+86 17717640807

Re: Run Python User Defined Functions / code in Spark with Scala Codebase

2018-07-05 Thread Chetan Khatri
Prem sure, Thanks for suggestion. On Wed, Jul 4, 2018 at 8:38 PM, Prem Sure wrote: > try .pipe(.py) on RDD > > Thanks, > Prem > > On Wed, Jul 4, 2018 at 7:59 PM, Chetan Khatri > wrote: > >> Can someone please suggest me , thanks >> >> On Tue 3 Jul, 2018, 5:28 PM Chetan Khatri, >> wrote: >> >>>