Hi Folks,

Great discussion! I will take into account rate-limiting and make it
configurable for the http request as well as all

I was wondering if there is anything I might have missed that would make it
technically impossible to do or at least difficult enough to not warrant
the effort

Is there anything I might have overlooked? Also, would this be useful to
people?

My idea is from a business perspective, why are we making them wait till
the next scheduled batch run for data that is already available from an
API. You could run a job every minute/hour but that in itself sounds like a
streaming use-case

Thoughts?

Regards
Sam

On Thu, Jul 2, 2020 at 3:31 AM Burak Yavuz <brk...@gmail.com> wrote:

> Well, the difference is, a technical user writes the UDF and a
> non-technical user may use this built-in thing (misconfigure it) and shoot
> themselves in the foot.
>
> On Wed, Jul 1, 2020, 6:40 PM Andrew Melo <andrew.m...@gmail.com> wrote:
>
>> On Wed, Jul 1, 2020 at 8:13 PM Burak Yavuz <brk...@gmail.com> wrote:
>> >
>> > I'm not sure having a built-in sink that allows you to DDOS servers is
>> the best idea either. foreachWriter is typically used for such use cases,
>> not foreachBatch. It's also pretty hard to guarantee exactly-once, rate
>> limiting, etc.
>>
>> If you control the machines and can run arbitrary code, you can DDOS
>> whatever you want. What's the difference between this proposal and
>> writing a UDF that opens 1,000 connections to a target machine?
>>
>> > Best,
>> > Burak
>> >
>> > On Wed, Jul 1, 2020 at 5:54 PM Holden Karau <hol...@pigscanfly.ca>
>> wrote:
>> >>
>> >> I think adding something like this (if it doesn't already exist) could
>> help make structured streaming easier to use, foreachBatch is not the best
>> API.
>> >>
>> >> On Wed, Jul 1, 2020 at 2:21 PM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>> >>>
>> >>> I guess the method, query parameter, header, and the payload would be
>> all different for almost every use case - that makes it hard to generalize
>> and requires implementation to be pretty much complicated to be flexible
>> enough.
>> >>>
>> >>> I'm not aware of any custom sink implementing REST so your best bet
>> would be simply implementing your own with foreachBatch, but so someone
>> might jump in and provide a pointer if there is something in the Spark
>> ecosystem.
>> >>>
>> >>> Thanks,
>> >>> Jungtaek Lim (HeartSaVioR)
>> >>>
>> >>> On Thu, Jul 2, 2020 at 3:21 AM Sam Elamin <hussam.ela...@gmail.com>
>> wrote:
>> >>>>
>> >>>> Hi All,
>> >>>>
>> >>>>
>> >>>> We ingest alot of restful APIs into our lake and I'm wondering if it
>> is at all possible to created a rest sink in structured streaming?
>> >>>>
>> >>>> For now I'm only focusing on restful services that have an
>> incremental ID so my sink can just poll for new data then ingest.
>> >>>>
>> >>>> I can't seem to find a connector that does this and my gut instinct
>> tells me it's probably because it isn't possible due to something
>> completely obvious that I am missing
>> >>>>
>> >>>> I know some RESTful API obfuscate the IDs to a hash of strings and
>> that could be a problem but since I'm planning on focusing on just
>> numerical IDs that just get incremented I think I won't be facing that issue
>> >>>>
>> >>>>
>> >>>> Can anyone let me know if this sounds like a daft idea? Will I need
>> something like Kafka or kinesis as a buffer and redundancy or am I
>> overthinking this?
>> >>>>
>> >>>>
>> >>>> I would love to bounce ideas with people who runs structured
>> streaming jobs in production
>> >>>>
>> >>>>
>> >>>> Kind regards
>> >>>> San
>> >>>>
>> >>>>
>> >>
>> >>
>> >> --
>> >> Twitter: https://twitter.com/holdenkarau
>> >> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9
>> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

Reply via email to