Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-07-30 Thread vikram agrawal
If we compare file-stream source with other streaming sources such as Kafka, the current behavior is indeed incomplete. Starting the streaming from a custom offset/particular point of time is something that is missing. Typically filestream sources don't have auto-deletion of the older data/files.

[OSS DIGEST] The major changes of Apache Spark from June 17 to June 30

2020-07-30 Thread Hyukjin Kwon
Hi all, This is the bi-weekly Apache Spark digest from the Databricks OSS team. For each API/configuration/behavior change, an *[API] *tag is added in the title. CORE

[VOTE] Update the committer guidelines to clarify when to commit changes.

2020-07-30 Thread Holden Karau
Hi Spark Developers, After the discussion of the proposal to amend Spark committer guidelines, it appears folks are generally in agreement on policy clarifications. (See https://lists.apache.org/thread.html/r6706e977fda2c474a7f24775c933c2f46ea19afbfafb03c90f6972ba%40%3Cdev.spark.apache.org%3E, as

Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-07-30 Thread Jungtaek Lim
(I'd like to keep the discussion thread focusing on the specific topic - let's initiate another discussion threads on different topics.) Thanks for the input. I'd like to emphasize that the point in discussion is the "latestFirst" option - the rationalization starts from growing metadata log issue

Re: [VOTE] Update the committer guidelines to clarify when to commit changes.

2020-07-30 Thread Jungtaek Lim
+1 (non-binding, I guess) Thanks for raising the issue and sorting it out! On Fri, Jul 31, 2020 at 6:47 AM Holden Karau wrote: > Hi Spark Developers, > > After the discussion of the proposal to amend Spark committer guidelines, > it appears folks are generally in agreement on policy clarificati

Re: [VOTE] Update the committer guidelines to clarify when to commit changes.

2020-07-30 Thread Holden Karau
+1 from myself :) On Thu, Jul 30, 2020 at 2:53 PM Jungtaek Lim wrote: > +1 (non-binding, I guess) > > Thanks for raising the issue and sorting it out! > > On Fri, Jul 31, 2020 at 6:47 AM Holden Karau wrote: > >> Hi Spark Developers, >> >> After the discussion of the proposal to amend Spark comm

Re: [VOTE] Update the committer guidelines to clarify when to commit changes.

2020-07-30 Thread Wenchen Fan
+1, thanks for driving it, Holden! On Fri, Jul 31, 2020 at 10:24 AM Holden Karau wrote: > +1 from myself :) > > On Thu, Jul 30, 2020 at 2:53 PM Jungtaek Lim > wrote: > >> +1 (non-binding, I guess) >> >> Thanks for raising the issue and sorting it out! >> >> On Fri, Jul 31, 2020 at 6:47 AM Holde

Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-07-30 Thread German Schiavon
HI Jungtaek, I have a question, aren't both approaches compatible? How I see it, I think It would be interesting to have a retention period to delete old files and/or the possibility of indicating an offset (Timestamp). It would be very "similar" to how we do it with kafka. WDYT? On Thu, 30 Jul

Re: [DISCUSS] "latestFirst" option and metadata growing issue in File stream source

2020-07-30 Thread Jungtaek Lim
Hi German, option 1 isn't about "deleting" the old files, as your input directory may be accessed by multiple queries. Kafka centralizes the maintenance of input data hence possible to apply retention without problem. option 1 is more about "hiding" the old files being read, so that end users "may