Any further thoughts here? Or tips on profiling Beam DirectRunner?

Thanks,
Evan

On Wed, May 12, 2021 at 6:22 PM Evan Galpin <evan.gal...@gmail.com> wrote:

> Ok gotcha. In my tests, all sdk versions 2.25.0 and higher exhibit slow
> behaviour regardless of use_deprecated_reads. Not sure if that points to
> something different then.
>
> Thanks,
> Evan
>
> On Wed, May 12, 2021 at 18:16 Steve Niemitz <sniem...@apache.org> wrote:
>
>> I think it was only broken in 2.29.
>>
>> On Wed, May 12, 2021 at 5:53 PM Evan Galpin <evan.gal...@gmail.com>
>> wrote:
>>
>>> Ah ok thanks for that. Do you mean use_deprecated_reads is broken
>>> specifically in 2.29.0 (regression) or broken in all versions up to and
>>> including 2.29.0 (ie never worked)?
>>>
>>> Thanks,
>>> Evan
>>>
>>> On Wed, May 12, 2021 at 17:12 Steve Niemitz <sniem...@apache.org> wrote:
>>>
>>>> Yeah, sorry my email was confusing.  use_deprecated_reads is broken on
>>>> the DirectRunner in 2.29.
>>>>
>>>> The behavior you describe is exactly the behavior I ran into as well
>>>> when reading from pubsub with the new read method.  I believe that soon the
>>>> default is being reverted back to the old read method, not using SDFs,
>>>> which will fix your performance issue.
>>>>
>>>> On Wed, May 12, 2021 at 4:40 PM Boyuan Zhang <boyu...@google.com>
>>>> wrote:
>>>>
>>>>> Hi Evan,
>>>>>
>>>>> It seems like the slow step is not the read that use_deprecated_read
>>>>> targets for. Would you like to share your pipeline code if possible?
>>>>>
>>>>> On Wed, May 12, 2021 at 1:35 PM Evan Galpin <evan.gal...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I just tried with v2.29.0 and use_deprecated_read but unfortunately I
>>>>>> observed slow behavior again. Is it possible that use_deprecated_read is
>>>>>> broken in 2.29.0 as well?
>>>>>>
>>>>>> Thanks,
>>>>>> Evan
>>>>>>
>>>>>> On Wed, May 12, 2021 at 3:21 PM Steve Niemitz <sniem...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> oops sorry I was off by 10...I meant 2.29 not 2.19.
>>>>>>>
>>>>>>> On Wed, May 12, 2021 at 2:55 PM Evan Galpin <evan.gal...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks for the link/info. v2.19.0 and v2.21.0 did exhibit the
>>>>>>>> "faster" behavior, as did v2.23.0. But that "fast" behavior stopped at
>>>>>>>> v2.25.0 (for my use case at least) regardless of use_deprecated_read
>>>>>>>> setting.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Evan
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, May 12, 2021 at 2:47 PM Steve Niemitz <sniem...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> use_deprecated_read was broken in 2.19 on the direct runner and
>>>>>>>>> didn't do anything. [1]  I don't think the fix is in 2.20 either, but 
>>>>>>>>> will
>>>>>>>>> be in 2.21.
>>>>>>>>>
>>>>>>>>> [1] https://github.com/apache/beam/pull/14469
>>>>>>>>>
>>>>>>>>> On Wed, May 12, 2021 at 1:41 PM Evan Galpin <evan.gal...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I forgot to also mention that in all tests I was setting
>>>>>>>>>> --experiments=use_deprecated_read
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Evan
>>>>>>>>>>
>>>>>>>>>> On Wed, May 12, 2021 at 1:39 PM Evan Galpin <
>>>>>>>>>> evan.gal...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hmm, I think I spoke too soon. I'm still seeing an issue of
>>>>>>>>>>> overall DirectRunner slowness, not just pubsub. I have a pipeline 
>>>>>>>>>>> like so:
>>>>>>>>>>>
>>>>>>>>>>> Read pubsub  |  extract GCS glob patterns  |  FileIO.matchAll()
>>>>>>>>>>> |  FileIO.readMatches()  |  Read file contents  |  etc
>>>>>>>>>>>
>>>>>>>>>>> I have temporarily set up a transform between each step to log
>>>>>>>>>>> what's going on and illustrate timing issues.  I ran a series of 
>>>>>>>>>>> tests
>>>>>>>>>>> changing only the SDK version each time since I hadn't noticed this
>>>>>>>>>>> performance issue with 2.19.0 (effectively git-bisect). Before each 
>>>>>>>>>>> test, I
>>>>>>>>>>> seeded the pubsub subscription with the exact same contents.
>>>>>>>>>>>
>>>>>>>>>>> SDK version 2.25.0 (I had a build issue with 2.24.0 that I
>>>>>>>>>>> couldn't seem to resolve) and onward show a significant slowdown.
>>>>>>>>>>>
>>>>>>>>>>> Here is a snippet of logging from v2.25.0:
>>>>>>>>>>>
>>>>>>>>>>> *May 12, 2021 11:11:52 A.M.* com.myOrg.myPipeline.PipelineLeg$1
>>>>>>>>>>> processElement
>>>>>>>>>>> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/**
>>>>>>>>>>> May 12, 2021 11:16:59 A.M.
>>>>>>>>>>> org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process
>>>>>>>>>>> INFO: Matched 2 files for pattern
>>>>>>>>>>> gs://my-bucket/my-dir/5004728247517184/**
>>>>>>>>>>> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3
>>>>>>>>>>> processElement
>>>>>>>>>>> INFO: Got ReadableFile: my-file1.json
>>>>>>>>>>> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3
>>>>>>>>>>> processElement
>>>>>>>>>>> INFO: Got ReadableFile: my-file2.json
>>>>>>>>>>> May 12, 2021 11:24:35 A.M. com.myOrg.myPipeline.PipelineLeg$4
>>>>>>>>>>> processElement
>>>>>>>>>>> INFO: Got file contents for document_id my-file1.json
>>>>>>>>>>> *May 12, 2021 11:24:35 A.M*. com.myOrg.myPipeline.PipelineLeg$4
>>>>>>>>>>> processElement
>>>>>>>>>>> INFO: Got file contents for document_id my-file2.json
>>>>>>>>>>>
>>>>>>>>>>> Note that end-to-end, these steps took about *13 minutes*. With
>>>>>>>>>>> SDK 2.23.0 and identical user code, the same section of the 
>>>>>>>>>>> pipeline took *2
>>>>>>>>>>> seconds*:
>>>>>>>>>>>
>>>>>>>>>>> *May 12, 2021 11:03:39 A.M.* com.myOrg.myPipeline.PipelineLeg$1
>>>>>>>>>>> processElement
>>>>>>>>>>> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/**
>>>>>>>>>>> May 12, 2021 11:03:40 A.M.
>>>>>>>>>>> org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process
>>>>>>>>>>> INFO: Matched 2 files for pattern
>>>>>>>>>>> gs://my-bucket/my-dir/5004728247517184/**
>>>>>>>>>>> May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3
>>>>>>>>>>> processElement
>>>>>>>>>>> INFO: Got ReadableFile: my-file1.json
>>>>>>>>>>> May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3
>>>>>>>>>>> processElement
>>>>>>>>>>> INFO: Got ReadableFile: my-file2.json
>>>>>>>>>>> May 12, 2021 11:03:41 A.M. com.myOrg.myPipeline.PipelineLeg$4
>>>>>>>>>>> processElement
>>>>>>>>>>> INFO: Got file contents for document_id my-file1.json
>>>>>>>>>>> *May 12, 2021 11:03:41 A.M.* com.myOrg.myPipeline.PipelineLeg$4
>>>>>>>>>>> processElement
>>>>>>>>>>> INFO: Got file contents for document_id my-file2.json
>>>>>>>>>>>
>>>>>>>>>>> Any thoughts on what could be causing this?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Evan
>>>>>>>>>>>
>>>>>>>>>>> On Wed, May 12, 2021 at 9:53 AM Evan Galpin <
>>>>>>>>>>> evan.gal...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, May 10, 2021 at 2:09 PM Boyuan Zhang <
>>>>>>>>>>>> boyu...@google.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Evan,
>>>>>>>>>>>>>
>>>>>>>>>>>>> What do you mean startup delay? Is it the time that from you
>>>>>>>>>>>>> start the pipeline to the time that you notice the first output 
>>>>>>>>>>>>> record from
>>>>>>>>>>>>> PubSub?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Yes that's what I meant, the seemingly idle system waiting for
>>>>>>>>>>>> pubsub output despite data being in the subscription at pipeline 
>>>>>>>>>>>> start
>>>>>>>>>>>> time.
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, May 8, 2021 at 12:50 AM Ismaël Mejía <ieme...@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Can you try running direct runner with the option
>>>>>>>>>>>>>> `--experiments=use_deprecated_read`
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> This seems to work for me, thanks for this! 👍
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>> Seems like an instance of
>>>>>>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-10670?focusedCommentId=17316858&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17316858
>>>>>>>>>>>>>> also reported in
>>>>>>>>>>>>>> https://lists.apache.org/thread.html/re6b0941a8b4951293a0327ce9b25e607cafd6e45b69783f65290edee%40%3Cdev.beam.apache.org%3E
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> We should rollback using the SDF wrapper by default because
>>>>>>>>>>>>>> of the usability and performance issues reported.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, May 8, 2021 at 12:57 AM Evan Galpin <
>>>>>>>>>>>>>> evan.gal...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I’m experiencing very slow performance and startup delay
>>>>>>>>>>>>>>> when testing a pipeline locally. I’m reading data from a Google 
>>>>>>>>>>>>>>> PubSub
>>>>>>>>>>>>>>> subscription as the data source, and before each pipeline 
>>>>>>>>>>>>>>> execution I
>>>>>>>>>>>>>>> ensure that data is present in the subscription (readable from 
>>>>>>>>>>>>>>> GCP
>>>>>>>>>>>>>>> console).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I’m seeing startup delay on the order of minutes with
>>>>>>>>>>>>>>> DirectRunner (5-10 min). Is that expected? I did find a Jira 
>>>>>>>>>>>>>>> ticket[1] that
>>>>>>>>>>>>>>> at first seemed related, but I think it has more to do with BQ 
>>>>>>>>>>>>>>> than
>>>>>>>>>>>>>>> DirectRunner.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I’ve run the pipeline with a debugger connected and
>>>>>>>>>>>>>>> confirmed that it’s minutes before the first DoFn in my 
>>>>>>>>>>>>>>> pipeline receives
>>>>>>>>>>>>>>> any data. Is there a way I can profile the direct runner to see 
>>>>>>>>>>>>>>> what it’s
>>>>>>>>>>>>>>> churning on?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>> Evan
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/BEAM-4548
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>

Reply via email to