Re: Extremely Slow DirectRunner

Evan Galpin Wed, 12 May 2021 13:34:49 -0700

I just tried with v2.29.0 and use_deprecated_read but unfortunately I
observed slow behavior again. Is it possible that use_deprecated_read is
broken in 2.29.0 as well?


Thanks,
Evan

On Wed, May 12, 2021 at 3:21 PM Steve Niemitz <sniem...@apache.org> wrote:

> oops sorry I was off by 10...I meant 2.29 not 2.19.
>
> On Wed, May 12, 2021 at 2:55 PM Evan Galpin <evan.gal...@gmail.com> wrote:
>
>> Thanks for the link/info. v2.19.0 and v2.21.0 did exhibit the "faster"
>> behavior, as did v2.23.0. But that "fast" behavior stopped at v2.25.0 (for
>> my use case at least) regardless of use_deprecated_read setting.
>>
>> Thanks,
>> Evan
>>
>>
>> On Wed, May 12, 2021 at 2:47 PM Steve Niemitz <sniem...@apache.org>
>> wrote:
>>
>>> use_deprecated_read was broken in 2.19 on the direct runner and didn't
>>> do anything. [1]  I don't think the fix is in 2.20 either, but will be in
>>> 2.21.
>>>
>>> [1] https://github.com/apache/beam/pull/14469
>>>
>>> On Wed, May 12, 2021 at 1:41 PM Evan Galpin <evan.gal...@gmail.com>
>>> wrote:
>>>
>>>> I forgot to also mention that in all tests I was setting
>>>> --experiments=use_deprecated_read
>>>>
>>>> Thanks,
>>>> Evan
>>>>
>>>> On Wed, May 12, 2021 at 1:39 PM Evan Galpin <evan.gal...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hmm, I think I spoke too soon. I'm still seeing an issue of overall
>>>>> DirectRunner slowness, not just pubsub. I have a pipeline like so:
>>>>>
>>>>> Read pubsub  |  extract GCS glob patterns  |  FileIO.matchAll()  |
>>>>> FileIO.readMatches()  |  Read file contents  |  etc
>>>>>
>>>>> I have temporarily set up a transform between each step to log what's
>>>>> going on and illustrate timing issues.  I ran a series of tests changing
>>>>> only the SDK version each time since I hadn't noticed this performance
>>>>> issue with 2.19.0 (effectively git-bisect). Before each test, I seeded the
>>>>> pubsub subscription with the exact same contents.
>>>>>
>>>>> SDK version 2.25.0 (I had a build issue with 2.24.0 that I couldn't
>>>>> seem to resolve) and onward show a significant slowdown.
>>>>>
>>>>> Here is a snippet of logging from v2.25.0:
>>>>>
>>>>> *May 12, 2021 11:11:52 A.M.* com.myOrg.myPipeline.PipelineLeg$1
>>>>> processElement
>>>>> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/**
>>>>> May 12, 2021 11:16:59 A.M.
>>>>> org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process
>>>>> INFO: Matched 2 files for pattern
>>>>> gs://my-bucket/my-dir/5004728247517184/**
>>>>> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3
>>>>> processElement
>>>>> INFO: Got ReadableFile: my-file1.json
>>>>> May 12, 2021 11:23:32 A.M. com.myOrg.myPipeline.PipelineLeg$3
>>>>> processElement
>>>>> INFO: Got ReadableFile: my-file2.json
>>>>> May 12, 2021 11:24:35 A.M. com.myOrg.myPipeline.PipelineLeg$4
>>>>> processElement
>>>>> INFO: Got file contents for document_id my-file1.json
>>>>> *May 12, 2021 11:24:35 A.M*. com.myOrg.myPipeline.PipelineLeg$4
>>>>> processElement
>>>>> INFO: Got file contents for document_id my-file2.json
>>>>>
>>>>> Note that end-to-end, these steps took about *13 minutes*. With SDK
>>>>> 2.23.0 and identical user code, the same section of the pipeline took *2
>>>>> seconds*:
>>>>>
>>>>> *May 12, 2021 11:03:39 A.M.* com.myOrg.myPipeline.PipelineLeg$1
>>>>> processElement
>>>>> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/**
>>>>> May 12, 2021 11:03:40 A.M.
>>>>> org.apache.beam.sdk.io.FileIO$MatchAll$MatchFn process
>>>>> INFO: Matched 2 files for pattern
>>>>> gs://my-bucket/my-dir/5004728247517184/**
>>>>> May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3
>>>>> processElement
>>>>> INFO: Got ReadableFile: my-file1.json
>>>>> May 12, 2021 11:03:40 A.M. com.myOrg.myPipeline.PipelineLeg$3
>>>>> processElement
>>>>> INFO: Got ReadableFile: my-file2.json
>>>>> May 12, 2021 11:03:41 A.M. com.myOrg.myPipeline.PipelineLeg$4
>>>>> processElement
>>>>> INFO: Got file contents for document_id my-file1.json
>>>>> *May 12, 2021 11:03:41 A.M.* com.myOrg.myPipeline.PipelineLeg$4
>>>>> processElement
>>>>> INFO: Got file contents for document_id my-file2.json
>>>>>
>>>>> Any thoughts on what could be causing this?
>>>>>
>>>>> Thanks,
>>>>> Evan
>>>>>
>>>>> On Wed, May 12, 2021 at 9:53 AM Evan Galpin <evan.gal...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, May 10, 2021 at 2:09 PM Boyuan Zhang <boyu...@google.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Evan,
>>>>>>>
>>>>>>> What do you mean startup delay? Is it the time that from you start
>>>>>>> the pipeline to the time that you notice the first output record from
>>>>>>> PubSub?
>>>>>>>
>>>>>>
>>>>>> Yes that's what I meant, the seemingly idle system waiting for pubsub
>>>>>> output despite data being in the subscription at pipeline start time.
>>>>>>
>>>>>> On Sat, May 8, 2021 at 12:50 AM Ismaël Mejía <ieme...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Can you try running direct runner with the option
>>>>>>>> `--experiments=use_deprecated_read`
>>>>>>>>
>>>>>>>
>>>>>> This seems to work for me, thanks for this! 👍
>>>>>>
>>>>>>
>>>>>>>> Seems like an instance of
>>>>>>>> https://issues.apache.org/jira/browse/BEAM-10670?focusedCommentId=17316858&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17316858
>>>>>>>> also reported in
>>>>>>>> https://lists.apache.org/thread.html/re6b0941a8b4951293a0327ce9b25e607cafd6e45b69783f65290edee%40%3Cdev.beam.apache.org%3E
>>>>>>>>
>>>>>>>> We should rollback using the SDF wrapper by default because of the
>>>>>>>> usability and performance issues reported.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, May 8, 2021 at 12:57 AM Evan Galpin <evan.gal...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I’m experiencing very slow performance and startup delay when
>>>>>>>>> testing a pipeline locally. I’m reading data from a Google PubSub
>>>>>>>>> subscription as the data source, and before each pipeline execution I
>>>>>>>>> ensure that data is present in the subscription (readable from GCP
>>>>>>>>> console).
>>>>>>>>>
>>>>>>>>> I’m seeing startup delay on the order of minutes with DirectRunner
>>>>>>>>> (5-10 min). Is that expected? I did find a Jira ticket[1] that at 
>>>>>>>>> first
>>>>>>>>> seemed related, but I think it has more to do with BQ than 
>>>>>>>>> DirectRunner.
>>>>>>>>>
>>>>>>>>> I’ve run the pipeline with a debugger connected and confirmed that
>>>>>>>>> it’s minutes before the first DoFn in my pipeline receives any data. 
>>>>>>>>> Is
>>>>>>>>> there a way I can profile the direct runner to see what it’s churning 
>>>>>>>>> on?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Evan
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>>
>>>>>>>>> https://issues.apache.org/jira/plugins/servlet/mobile#issue/BEAM-4548
>>>>>>>>>
>>>>>>>>

Re: Extremely Slow DirectRunner

Reply via email to