I think you need to do something like:

Read (200)  -> GroupByKey (200) -> UnGroup(200) [Not this can be on 200
different workers] -> combine (20M) -> clean (20M) -> filter (20M) -> insert

On Mon, Jun 5, 2017 at 1:14 PM Morand, Sebastien <
sebastien.mor...@veolia.com> wrote:

> Yes fusion looks like my problem. A job ID to look at:
> 2017-06-05_10_14_25-5856213199384263626.
>
> The point is in your link:
> <<
> For example, one case in which fusion can limit Dataflow's ability to
> optimize worker usage is a "high fan-out" ParDo. In such an operation, you
> might have an input collection with relatively few elements, but the ParDo
> produces an output with hundreds or thousands of times as many elements,
> followed by another ParDo
> >>
>
> This is exactly what I'm doing in the step transform-combine-7d5ad942 in
> the above job id.
>
> As fas as I understand, I should create a GroupByKey after the
> transform-combine-7d5ad942 on a unique field and then ungroup the data?
> (meaning I add two operations in the pipeline to help the worker?
>
> Right now:
> Read (200) -> combine (20M) -> clean (20M) -> filter (20M) -> insert
>
> Will become:
> Read (200) -> combine (20M) -> GroupByKey (20M) -> ungroup (20M) -> clean
> (20M) -> filter (20M) -> insert
>
> It this the right way?
>
>
>
>
> *Sébastien MORAND*
> Team Lead Solution Architect
> Technology & Operations / Digital Factory
> Veolia - Group Information Systems & Technology (IS&T)
> Cell.: +33 7 52 66 20 81 / Direct: +33 1 85 57 71 08
> <+33%201%2085%2057%2071%2008>
> Bureau 0144C (Ouest)
> 30, rue Madeleine-Vionnet - 93300 Aubervilliers, France
> *www.veolia.com <http://www.veolia.com>*
> <http://www.veolia.com>
> <https://www.facebook.com/veoliaenvironment/>
> <https://www.youtube.com/user/veoliaenvironnement>
> <https://www.linkedin.com/company/veolia-environnement>
> <https://twitter.com/veolia>
>
> On 5 June 2017 at 21:42, Eugene Kirpichov <kirpic...@google.com> wrote:
>
>> Do you have a Dataflow job ID to look at?
>> It might be due to fusion
>> https://cloud.google.com/dataflow/service/dataflow-service-desc#preventing-fusion
>>
>> On Mon, Jun 5, 2017 at 12:13 PM Prabeesh K. <prabsma...@gmail.com> wrote:
>>
>>> Please try using *--worker_machine_type* n1-standard-4 or n1-standard-8
>>>
>>> On 5 June 2017 at 23:08, Morand, Sebastien <sebastien.mor...@veolia.com>
>>> wrote:
>>>
>>>> I do have a problem with my tries to test scaling on dataflow.
>>>>
>>>> My dataflow is pretty simple: I get a list of files from pubsub, so the
>>>> number of files I'm going to use to feed the flow is well known at the
>>>> begining. Here are my steps:
>>>> Let's say I have 200 files containing about 20,000,000 of records
>>>>
>>>>    - *First Step:* Read file contents from storage: files are .tar.gz
>>>>    containing each 4 files (CSV). I return the file content as the whole 
>>>> in a
>>>>    record
>>>>    *OUT:* 200 records (one for each file containing the data of all 4
>>>>    files). Bascillacy it's a dict : {file1: content_of_file1, file2:
>>>>    content_of_file2, etc...}
>>>>
>>>>    - *Second step:*  Joining the data of the 4 files in one record
>>>>    (the main file contains foreign key to get information from the other 
>>>> files)
>>>>    *OUT:* 20,000,000 records each for every line in the files. Each
>>>>    record is a list of string
>>>>
>>>>    - *Third step:* cleaning data (convert to prepare integration in
>>>>    bigquery) and set them as a dict where keys are bigquery column name.
>>>>    *OUT:* 20,000,000 records as dict for each record
>>>>
>>>>    - *Fourth step:* insert into bigquery
>>>>
>>>> So the first step return 200 records, but I have 20,000,000 records to
>>>> insert.
>>>> This takes about 1 hour and half and always use 1 worker ...
>>>>
>>>> If I manually set the number of workers, it's not really faster. So for
>>>> an unknow reason, it doesn't scale, any ideas how to do it?
>>>>
>>>> Thanks for any help.
>>>>
>>>> *Sébastien MORAND*
>>>> Team Lead Solution Architect
>>>> Technology & Operations / Digital Factory
>>>> Veolia - Group Information Systems & Technology (IS&T)
>>>> Cell.: +33 7 52 66 20 81 / Direct: +33 1 85 57 71 08
>>>> <+33%201%2085%2057%2071%2008>
>>>> Bureau 0144C (Ouest)
>>>> 30, rue Madeleine-Vionnet - 93300 Aubervilliers, France
>>>> *www.veolia.com <http://www.veolia.com>*
>>>> <http://www.veolia.com>
>>>> <https://www.facebook.com/veoliaenvironment/>
>>>> <https://www.youtube.com/user/veoliaenvironnement>
>>>> <https://www.linkedin.com/company/veolia-environnement>
>>>> <https://twitter.com/veolia>
>>>>
>>>>
>>>>
>>>> --------------------------------------------------------------------------------------------
>>>> This e-mail transmission (message and any attached files) may contain
>>>> information that is proprietary, privileged and/or confidential to Veolia
>>>> Environnement and/or its affiliates and is intended exclusively for the
>>>> person(s) to whom it is addressed. If you are not the intended recipient,
>>>> please notify the sender by return e-mail and delete all copies of this
>>>> e-mail, including all attachments. Unless expressly authorized, any use,
>>>> disclosure, publication, retransmission or dissemination of this e-mail
>>>> and/or of its attachments is strictly prohibited.
>>>>
>>>> Ce message electronique et ses fichiers attaches sont strictement
>>>> confidentiels et peuvent contenir des elements dont Veolia Environnement
>>>> et/ou l'une de ses entites affiliees sont proprietaires. Ils sont donc
>>>> destines a l'usage de leurs seuls destinataires. Si vous avez recu ce
>>>> message par erreur, merci de le retourner a son emetteur et de le detruire
>>>> ainsi que toutes les pieces attachees. L'utilisation, la divulgation, la
>>>> publication, la distribution, ou la reproduction non expressement
>>>> autorisees de ce message et de ses pieces attachees sont interdites.
>>>>
>>>> --------------------------------------------------------------------------------------------
>>>>
>>>
>>>
>
>
>
> --------------------------------------------------------------------------------------------
> This e-mail transmission (message and any attached files) may contain
> information that is proprietary, privileged and/or confidential to Veolia
> Environnement and/or its affiliates and is intended exclusively for the
> person(s) to whom it is addressed. If you are not the intended recipient,
> please notify the sender by return e-mail and delete all copies of this
> e-mail, including all attachments. Unless expressly authorized, any use,
> disclosure, publication, retransmission or dissemination of this e-mail
> and/or of its attachments is strictly prohibited.
>
> Ce message electronique et ses fichiers attaches sont strictement
> confidentiels et peuvent contenir des elements dont Veolia Environnement
> et/ou l'une de ses entites affiliees sont proprietaires. Ils sont donc
> destines a l'usage de leurs seuls destinataires. Si vous avez recu ce
> message par erreur, merci de le retourner a son emetteur et de le detruire
> ainsi que toutes les pieces attachees. L'utilisation, la divulgation, la
> publication, la distribution, ou la reproduction non expressement
> autorisees de ce message et de ses pieces attachees sont interdites.
>
> --------------------------------------------------------------------------------------------
>

Reply via email to