unrelated to the actual question, but iirc dataflow workers have iptables
rules that drop all inbound traffic (other than a few exceptions).

In any case, do you actually need the server part to be "inside" the
pipeline?  Could you just use a JvmInitalizer to launch the http server and
do the pubsub publishing there?

On Thu, Oct 7, 2021 at 12:53 PM Luke Cwik <[email protected]> wrote:

> I would suggest that you instead write the requests received within the
> splittable DoFn directly to a queue based sink and in another part of the
> pipeline read from that queue. For example if you were using Pubsub for the
> queue, your pipeline would look like:
> Create(LB config + pubsub topic A) -> ParDo(SDF get request from client
> and write to pubsub and then ack client)
> Pubsub(Read from A) -> ?Deduplicate? -> ... downstream processing ...
> Since the SDF will write to pubsub before it acknowledges the message you
> may write data that is not acked and once the client retries you'll publish
> a duplicate. If downstream processing is not resilient to duplicates then
> you'll want to have some unique piece of information to deduplicate on.
>
> Does the order in which the requests you get from a client or across
> clients matter?
> If yes, then you need to be aware that the parallel processing will impact
> the order in which you see things and you might need to have data
> sorted/ordered within the pipeline.
>
>
>
> On Wed, Oct 6, 2021 at 3:56 PM Daniel Collins <[email protected]>
> wrote:
>
>> Hi all,
>>
>> Bear with me, this is a bit of a weird one. I've been toying around with
>> an idea to do http ingestion using a beam (specifically dataflow) pipeline.
>> The concept would be that you spin up an HTTP server on each running task
>> with a well known port as a static member of some class in the JAR (or upon
>> initialization of a SDF the first time), then accept requests, but don't
>> acknowledge them back to the client until the bundle finalizer
>> <https://javadoc.io/static/org.apache.beam/beam-sdks-java-core/2.29.0/org/apache/beam/sdk/transforms/DoFn.BundleFinalizer.html>
>>  so
>> you know they're persisted/ have moved down the pipeline. You could then
>> use a load balancer pointed at the instance group created by dataflow as
>> the target for incoming requests, and create a PCollection from incoming
>> user requests.
>>
>> The only part of this I don't think would work is preventing user
>> requests from being stranded on a server that will never run the SDF that
>> will complete them due to load balancing constraints. So my question is: is
>> there a way to force an SDF to be run on every task where the JAR is loaded?
>>
>> Thanks!
>>
>> -Dan
>>
>

Reply via email to