Any further thoughts here? Or tips on profiling Beam DirectRunner? Thanks, Evan
On Wed, May 12, 2021 at 6:22 PM Evan Galpin <evan.gal...@gmail.com> wrote: > Ok gotcha. In my tests, all sdk versions 2.25.0 and higher exhibit slow > behaviour regardless of use_deprecated_reads. Not sure if that points to > something different then. > > Thanks, > Evan > > On Wed, May 12, 2021 at 18:16 Steve Niemitz <sniem...@apache.org> wrote: > >> I think it was only broken in 2.29. >> >> On Wed, May 12, 2021 at 5:53 PM Evan Galpin <evan.gal...@gmail.com> >> wrote: >> >>> Ah ok thanks for that. Do you mean use_deprecated_reads is broken >>> specifically in 2.29.0 (regression) or broken in all versions up to and >>> including 2.29.0 (ie never worked)? >>> >>> Thanks, >>> Evan >>> >>> On Wed, May 12, 2021 at 17:12 Steve Niemitz <sniem...@apache.org> wrote: >>> >>>> Yeah, sorry my email was confusing. use_deprecated_reads is broken on >>>> the DirectRunner in 2.29. >>>> >>>> The behavior you describe is exactly the behavior I ran into as well >>>> when reading from pubsub with the new read method. I believe that soon the >>>> default is being reverted back to the old read method, not using SDFs, >>>> which will fix your performance issue. >>>> >>>> On Wed, May 12, 2021 at 4:40 PM Boyuan Zhang <boyu...@google.com> >>>> wrote: >>>> >>>>> Hi Evan, >>>>> >>>>> It seems like the slow step is not the read that use_deprecated_read >>>>> targets for. Would you like to share your pipeline code if possible? >>>>> >>>>> On Wed, May 12, 2021 at 1:35 PM Evan Galpin <evan.gal...@gmail.com> >>>>> wrote: >>>>> >>>>>> I just tried with v2.29.0 and use_deprecated_read but unfortunately I >>>>>> observed slow behavior again. Is it possible that use_deprecated_read is >>>>>> broken in 2.29.0 as well? >>>>>> >>>>>> Thanks, >>>>>> Evan >>>>>> >>>>>> On Wed, May 12, 2021 at 3:21 PM Steve Niemitz <sniem...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> oops sorry I was off by 10...I meant 2.29 not 2.19. >>>>>>> >>>>>>> On Wed, May 12, 2021 at 2:55 PM Evan Galpin <evan.gal...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Thanks for the link/info. v2.19.0 and v2.21.0 did exhibit the >>>>>>>> "faster" behavior, as did v2.23.0. But that "fast" behavior stopped at >>>>>>>> v2.25.0 (for my use case at least) regardless of use_deprecated_read >>>>>>>> setting. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Evan >>>>>>>> >>>>>>>> >>>>>>>> On Wed, May 12, 2021 at 2:47 PM Steve Niemitz <sniem...@apache.org> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> use_deprecated_read was broken in 2.19 on the direct runner and >>>>>>>>> didn't do anything. [1] I don't think the fix is in 2.20 either, but >>>>>>>>> will >>>>>>>>> be in 2.21. >>>>>>>>> >>>>>>>>> [1] https://github.com/apache/beam/pull/14469 >>>>>>>>> >>>>>>>>> On Wed, May 12, 2021 at 1:41 PM Evan Galpin <evan.gal...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> I forgot to also mention that in all tests I was setting >>>>>>>>>> --experiments=use_deprecated_read >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Evan >>>>>>>>>> >>>>>>>>>> On Wed, May 12, 2021 at 1:39 PM Evan Galpin < >>>>>>>>>> evan.gal...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hmm, I think I spoke too soon. I'm still seeing an issue of >>>>>>>>>>> overall DirectRunner slowness, not just pubsub. I have a pipeline >>>>>>>>>>> like so: >>>>>>>>>>> >>>>>>>>>>> Read pubsub | extract GCS glob patterns | FileIO.matchAll() >>>>>>>>>>> | FileIO.readMatches() | Read file contents | etc >>>>>>>>>>> >>>>>>>>>>> I have temporarily set up a transform between each step to log >>>>>>>>>>> what's going on and illustrate timing issues. I ran a series of >>>>>>>>>>> tests >>>>>>>>>>> changing only the SDK version each time since I hadn't noticed this >>>>>>>>>>> performance issue with 2.19.0 (effectively git-bisect). Before each >>>>>>>>>>> test, I >>>>>>>>>>> seeded the pubsub subscription with the exact same contents. >>>>>>>>>>> >>>>>>>>>>> SDK version 2.25.0 (I had a build issue with 2.24.0 that I >>>>>>>>>>> couldn't seem to resolve) and onward show a significant slowdown. >>>>>>>>>>> >>>>>>>>>>> Here is a snippet of logging from v2.25.0: >>>>>>>>>>> >>>>>>>>>>> *May 12, 2021 11:11:52 A.M.* com.myOrg.myPipeline.PipelineLeg$1 >>>>>>>>>>> processElement >>>>>>>>>>> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/** >>>>>>>>>>> May 12, 2021 11:16:59 A.M. >>>>>>>>>>> org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process >>>>>>>>>>> INFO: Matched 2 files for pattern >>>>>>>>>>> gs://my-bucket/my-dir/5004728247517184/** >>>>>>>>>>> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3 >>>>>>>>>>> processElement >>>>>>>>>>> INFO: Got ReadableFile: my-file1.json >>>>>>>>>>> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3 >>>>>>>>>>> processElement >>>>>>>>>>> INFO: Got ReadableFile: my-file2.json >>>>>>>>>>> May 12, 2021 11:24:35 A.M. com.myOrg.myPipeline.PipelineLeg$4 >>>>>>>>>>> processElement >>>>>>>>>>> INFO: Got file contents for document_id my-file1.json >>>>>>>>>>> *May 12, 2021 11:24:35 A.M*. com.myOrg.myPipeline.PipelineLeg$4 >>>>>>>>>>> processElement >>>>>>>>>>> INFO: Got file contents for document_id my-file2.json >>>>>>>>>>> >>>>>>>>>>> Note that end-to-end, these steps took about *13 minutes*. With >>>>>>>>>>> SDK 2.23.0 and identical user code, the same section of the >>>>>>>>>>> pipeline took *2 >>>>>>>>>>> seconds*: >>>>>>>>>>> >>>>>>>>>>> *May 12, 2021 11:03:39 A.M.* com.myOrg.myPipeline.PipelineLeg$1 >>>>>>>>>>> processElement >>>>>>>>>>> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/** >>>>>>>>>>> May 12, 2021 11:03:40 A.M. >>>>>>>>>>> org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process >>>>>>>>>>> INFO: Matched 2 files for pattern >>>>>>>>>>> gs://my-bucket/my-dir/5004728247517184/** >>>>>>>>>>> May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3 >>>>>>>>>>> processElement >>>>>>>>>>> INFO: Got ReadableFile: my-file1.json >>>>>>>>>>> May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3 >>>>>>>>>>> processElement >>>>>>>>>>> INFO: Got ReadableFile: my-file2.json >>>>>>>>>>> May 12, 2021 11:03:41 A.M. com.myOrg.myPipeline.PipelineLeg$4 >>>>>>>>>>> processElement >>>>>>>>>>> INFO: Got file contents for document_id my-file1.json >>>>>>>>>>> *May 12, 2021 11:03:41 A.M.* com.myOrg.myPipeline.PipelineLeg$4 >>>>>>>>>>> processElement >>>>>>>>>>> INFO: Got file contents for document_id my-file2.json >>>>>>>>>>> >>>>>>>>>>> Any thoughts on what could be causing this? >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> Evan >>>>>>>>>>> >>>>>>>>>>> On Wed, May 12, 2021 at 9:53 AM Evan Galpin < >>>>>>>>>>> evan.gal...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Mon, May 10, 2021 at 2:09 PM Boyuan Zhang < >>>>>>>>>>>> boyu...@google.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Evan, >>>>>>>>>>>>> >>>>>>>>>>>>> What do you mean startup delay? Is it the time that from you >>>>>>>>>>>>> start the pipeline to the time that you notice the first output >>>>>>>>>>>>> record from >>>>>>>>>>>>> PubSub? >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Yes that's what I meant, the seemingly idle system waiting for >>>>>>>>>>>> pubsub output despite data being in the subscription at pipeline >>>>>>>>>>>> start >>>>>>>>>>>> time. >>>>>>>>>>>> >>>>>>>>>>>> On Sat, May 8, 2021 at 12:50 AM Ismaël Mejía <ieme...@gmail.com> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Can you try running direct runner with the option >>>>>>>>>>>>>> `--experiments=use_deprecated_read` >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> This seems to work for me, thanks for this! 👍 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>> Seems like an instance of >>>>>>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-10670?focusedCommentId=17316858&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17316858 >>>>>>>>>>>>>> also reported in >>>>>>>>>>>>>> https://lists.apache.org/thread.html/re6b0941a8b4951293a0327ce9b25e607cafd6e45b69783f65290edee%40%3Cdev.beam.apache.org%3E >>>>>>>>>>>>>> >>>>>>>>>>>>>> We should rollback using the SDF wrapper by default because >>>>>>>>>>>>>> of the usability and performance issues reported. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Sat, May 8, 2021 at 12:57 AM Evan Galpin < >>>>>>>>>>>>>> evan.gal...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I’m experiencing very slow performance and startup delay >>>>>>>>>>>>>>> when testing a pipeline locally. I’m reading data from a Google >>>>>>>>>>>>>>> PubSub >>>>>>>>>>>>>>> subscription as the data source, and before each pipeline >>>>>>>>>>>>>>> execution I >>>>>>>>>>>>>>> ensure that data is present in the subscription (readable from >>>>>>>>>>>>>>> GCP >>>>>>>>>>>>>>> console). >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I’m seeing startup delay on the order of minutes with >>>>>>>>>>>>>>> DirectRunner (5-10 min). Is that expected? I did find a Jira >>>>>>>>>>>>>>> ticket[1] that >>>>>>>>>>>>>>> at first seemed related, but I think it has more to do with BQ >>>>>>>>>>>>>>> than >>>>>>>>>>>>>>> DirectRunner. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I’ve run the pipeline with a debugger connected and >>>>>>>>>>>>>>> confirmed that it’s minutes before the first DoFn in my >>>>>>>>>>>>>>> pipeline receives >>>>>>>>>>>>>>> any data. Is there a way I can profile the direct runner to see >>>>>>>>>>>>>>> what it’s >>>>>>>>>>>>>>> churning on? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>> Evan >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> [1] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/BEAM-4548 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>