Thanks all, This discussion was very helpful. I will test all the ideas out.
On Wed, Jul 20, 2022 at 3:59 PM Chamikara Jayalath via user < user@beam.apache.org> wrote: > > > On Wed, Jul 20, 2022 at 12:57 PM Chamikara Jayalath <chamik...@google.com> > wrote: > >> I don't think it's an antipattern per se. You can implement arbitrary >> operations in a DoFn or an SDF to read data. >> >> But if a single resource ID maps to a large amount of data, Beam runners >> (including Dataflow) will be able to parallelize reading, hence your >> solution may have suboptimal performance compared to reading from a Beam >> source that can be fully parallelized. >> > > *will not be able to* > > >> >> Thanks, >> Cham >> >> On Wed, Jul 20, 2022 at 11:53 AM Shree Tanna <shree.ta...@gmail.com> >> wrote: >> >>> Thank you! >>> I will try this out. >>> One more question on this, is it considered anti-pattern to do HTTP >>> ingestion on GCP Dataflow due to the reasoning I mentioned in my original >>> message? I ask because I am getting that indication from some of my >>> co-workers and also from google cloud support. Not sure if this is the >>> right place to ask this question. Happy to move this conversation to >>> somewhere else if not. >>> >>> On Tue, Jul 19, 2022 at 5:18 PM Luke Cwik via user <user@beam.apache.org> >>> wrote: >>> >>>> Even if you don't have the resource ids ahead of time, you can have a >>>> pipeline like: >>>> Impulse -> ParDo(GenerateResourceIds) -> Reshuffle -> >>>> ParDo(ReadResourceIds) -> ... >>>> >>>> You could also compose these as splittable DoFns [1, 2, 3]: >>>> ParDo(SplittableGenerateResourceIds) -> ParDo(SplittableReadResourceIds) >>>> >>>> The first approach is the simplest as the reshuffle will rebalance the >>>> reading of each resource id across worker nodes but is limited in >>>> generating resource ids on one worker. Making the generation a splittable >>>> DoFn will mean that you can increase the parallelism of generation which is >>>> important if there are so many that it could crash a worker or fail to have >>>> the output committed (these kinds of failures are runner dependent on how >>>> well they handle single bundles with large outputs). Making the reading >>>> splittable allows you to handle a large resource (imagine a large file) so >>>> that it can be read and processed in parallel (and will have similar >>>> failures if the runner can't handle single bundles with large outputs). >>>> >>>> You can always start with the first solution and swap either piece to >>>> be a splittable DoFn depending on your performance requirements and how >>>> well the simple solution works. >>>> >>>> 1: https://beam.apache.org/blog/splittable-do-fn/ >>>> 2: https://beam.apache.org/blog/splittable-do-fn-is-available/ >>>> 3: >>>> https://beam.apache.org/documentation/programming-guide/#splittable-dofns >>>> >>>> >>>> On Tue, Jul 19, 2022 at 10:05 AM Damian Akpan < >>>> damianakpan2...@gmail.com> wrote: >>>> >>>>> Provided you have all the resources ids ahead of fetching, Beam will >>>>> spread the fetches to its workers. It will still fetch synchronously but >>>>> within that worker. >>>>> >>>>> On Tue, Jul 19, 2022 at 5:40 PM Shree Tanna <shree.ta...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> I'm planning to use Apache beam to extract and load part of the ETL >>>>>> pipeline and run the jobs on Dataflow. I will have to do the REST API >>>>>> ingestion on our platform. I can opt to make sync API calls from DoFn. >>>>>> With >>>>>> that pipelines will stall while REST requests are made over the network. >>>>>> >>>>>> Is it best practice to run the REST ingestion job on Dataflow? Is >>>>>> there any best practice I can follow to accomplish this? Just as a >>>>>> reference I'm adding this >>>>>> <https://stackoverflow.com/questions/50335521/best-practices-in-http-calls-in-cloud-dataflow-java> >>>>>> StackOverflow thread here too. Also, I notice that Rest I/O transform >>>>>> <https://beam.apache.org/documentation/io/built-in/> built-in >>>>>> connector is in progress for Java. >>>>>> >>>>>> Let me know if this is the right group to ask this question. I can >>>>>> also ask d...@beam.apache.org if needed. >>>>>> -- >>>>>> Thanks, >>>>>> Shree >>>>>> >>>>> >>> >>> -- >>> Best, >>> Shree >>> >> -- Best, Shree