Hi Folks, Great discussion! I will take into account rate-limiting and make it configurable for the http request as well as all
I was wondering if there is anything I might have missed that would make it technically impossible to do or at least difficult enough to not warrant the effort Is there anything I might have overlooked? Also, would this be useful to people? My idea is from a business perspective, why are we making them wait till the next scheduled batch run for data that is already available from an API. You could run a job every minute/hour but that in itself sounds like a streaming use-case Thoughts? Regards Sam On Thu, Jul 2, 2020 at 3:31 AM Burak Yavuz <brk...@gmail.com> wrote: > Well, the difference is, a technical user writes the UDF and a > non-technical user may use this built-in thing (misconfigure it) and shoot > themselves in the foot. > > On Wed, Jul 1, 2020, 6:40 PM Andrew Melo <andrew.m...@gmail.com> wrote: > >> On Wed, Jul 1, 2020 at 8:13 PM Burak Yavuz <brk...@gmail.com> wrote: >> > >> > I'm not sure having a built-in sink that allows you to DDOS servers is >> the best idea either. foreachWriter is typically used for such use cases, >> not foreachBatch. It's also pretty hard to guarantee exactly-once, rate >> limiting, etc. >> >> If you control the machines and can run arbitrary code, you can DDOS >> whatever you want. What's the difference between this proposal and >> writing a UDF that opens 1,000 connections to a target machine? >> >> > Best, >> > Burak >> > >> > On Wed, Jul 1, 2020 at 5:54 PM Holden Karau <hol...@pigscanfly.ca> >> wrote: >> >> >> >> I think adding something like this (if it doesn't already exist) could >> help make structured streaming easier to use, foreachBatch is not the best >> API. >> >> >> >> On Wed, Jul 1, 2020 at 2:21 PM Jungtaek Lim < >> kabhwan.opensou...@gmail.com> wrote: >> >>> >> >>> I guess the method, query parameter, header, and the payload would be >> all different for almost every use case - that makes it hard to generalize >> and requires implementation to be pretty much complicated to be flexible >> enough. >> >>> >> >>> I'm not aware of any custom sink implementing REST so your best bet >> would be simply implementing your own with foreachBatch, but so someone >> might jump in and provide a pointer if there is something in the Spark >> ecosystem. >> >>> >> >>> Thanks, >> >>> Jungtaek Lim (HeartSaVioR) >> >>> >> >>> On Thu, Jul 2, 2020 at 3:21 AM Sam Elamin <hussam.ela...@gmail.com> >> wrote: >> >>>> >> >>>> Hi All, >> >>>> >> >>>> >> >>>> We ingest alot of restful APIs into our lake and I'm wondering if it >> is at all possible to created a rest sink in structured streaming? >> >>>> >> >>>> For now I'm only focusing on restful services that have an >> incremental ID so my sink can just poll for new data then ingest. >> >>>> >> >>>> I can't seem to find a connector that does this and my gut instinct >> tells me it's probably because it isn't possible due to something >> completely obvious that I am missing >> >>>> >> >>>> I know some RESTful API obfuscate the IDs to a hash of strings and >> that could be a problem but since I'm planning on focusing on just >> numerical IDs that just get incremented I think I won't be facing that issue >> >>>> >> >>>> >> >>>> Can anyone let me know if this sounds like a daft idea? Will I need >> something like Kafka or kinesis as a buffer and redundancy or am I >> overthinking this? >> >>>> >> >>>> >> >>>> I would love to bounce ideas with people who runs structured >> streaming jobs in production >> >>>> >> >>>> >> >>>> Kind regards >> >>>> San >> >>>> >> >>>> >> >> >> >> >> >> -- >> >> Twitter: https://twitter.com/holdenkarau >> >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9 >> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> >