Yes you're correct. On Mon, Jun 5, 2017 at 1:23 PM Morand, Sebastien < sebastien.mor...@veolia.com> wrote:
> Between parenthesis of each step I meant the number of records in output > > When I ungroup I send again the 200 data? not the 20M? > > Shouldn't I do instead: > Read (200) -> GroupByKey (200) -> UnGroup(20M) -> combine (20M) -> clean > (20M) -> filter (20M) -> insert > > > *Sébastien MORAND* > Team Lead Solution Architect > Technology & Operations / Digital Factory > Veolia - Group Information Systems & Technology (IS&T) > Cell.: +33 7 52 66 20 81 / Direct: +33 1 85 57 71 08 > <+33%201%2085%2057%2071%2008> > Bureau 0144C (Ouest) > 30, rue Madeleine-Vionnet - 93300 Aubervilliers, France > *www.veolia.com <http://www.veolia.com>* > <http://www.veolia.com> > <https://www.facebook.com/veoliaenvironment/> > <https://www.youtube.com/user/veoliaenvironnement> > <https://www.linkedin.com/company/veolia-environnement> > <https://twitter.com/veolia> > > On 5 June 2017 at 22:17, Sourabh Bajaj <sourabhba...@google.com> wrote: > >> I think you need to do something like: >> >> Read (200) -> GroupByKey (200) -> UnGroup(200) [Not this can be on 200 >> different workers] -> combine (20M) -> clean (20M) -> filter (20M) -> >> insert >> >> On Mon, Jun 5, 2017 at 1:14 PM Morand, Sebastien < >> sebastien.mor...@veolia.com> wrote: >> >>> Yes fusion looks like my problem. A job ID to look at: >>> 2017-06-05_10_14_25-5856213199384263626. >>> >>> The point is in your link: >>> << >>> For example, one case in which fusion can limit Dataflow's ability to >>> optimize worker usage is a "high fan-out" ParDo. In such an operation, you >>> might have an input collection with relatively few elements, but the ParDo >>> produces an output with hundreds or thousands of times as many elements, >>> followed by another ParDo >>> >> >>> >>> This is exactly what I'm doing in the step transform-combine-7d5ad942 in >>> the above job id. >>> >>> As fas as I understand, I should create a GroupByKey after the >>> transform-combine-7d5ad942 on a unique field and then ungroup the data? >>> (meaning I add two operations in the pipeline to help the worker? >>> >>> Right now: >>> Read (200) -> combine (20M) -> clean (20M) -> filter (20M) -> insert >>> >>> Will become: >>> Read (200) -> combine (20M) -> GroupByKey (20M) -> ungroup (20M) -> >>> clean (20M) -> filter (20M) -> insert >>> >>> It this the right way? >>> >>> >>> >>> >>> *Sébastien MORAND* >>> Team Lead Solution Architect >>> Technology & Operations / Digital Factory >>> Veolia - Group Information Systems & Technology (IS&T) >>> Cell.: +33 7 52 66 20 81 / Direct: +33 1 85 57 71 08 >>> <+33%201%2085%2057%2071%2008> >>> Bureau 0144C (Ouest) >>> 30, rue Madeleine-Vionnet - 93300 Aubervilliers, France >>> *www.veolia.com <http://www.veolia.com>* >>> <http://www.veolia.com> >>> <https://www.facebook.com/veoliaenvironment/> >>> <https://www.youtube.com/user/veoliaenvironnement> >>> <https://www.linkedin.com/company/veolia-environnement> >>> <https://twitter.com/veolia> >>> >>> On 5 June 2017 at 21:42, Eugene Kirpichov <kirpic...@google.com> wrote: >>> >>>> Do you have a Dataflow job ID to look at? >>>> It might be due to fusion >>>> https://cloud.google.com/dataflow/service/dataflow-service-desc#preventing-fusion >>>> >>>> On Mon, Jun 5, 2017 at 12:13 PM Prabeesh K. <prabsma...@gmail.com> >>>> wrote: >>>> >>>>> Please try using *--worker_machine_type* n1-standard-4 or >>>>> n1-standard-8 >>>>> >>>>> On 5 June 2017 at 23:08, Morand, Sebastien < >>>>> sebastien.mor...@veolia.com> wrote: >>>>> >>>>>> I do have a problem with my tries to test scaling on dataflow. >>>>>> >>>>>> My dataflow is pretty simple: I get a list of files from pubsub, so >>>>>> the number of files I'm going to use to feed the flow is well known at >>>>>> the >>>>>> begining. Here are my steps: >>>>>> Let's say I have 200 files containing about 20,000,000 of records >>>>>> >>>>>> - *First Step:* Read file contents from storage: files are >>>>>> .tar.gz containing each 4 files (CSV). I return the file content as >>>>>> the >>>>>> whole in a record >>>>>> *OUT:* 200 records (one for each file containing the data of all >>>>>> 4 files). Bascillacy it's a dict : {file1: content_of_file1, file2: >>>>>> content_of_file2, etc...} >>>>>> >>>>>> - *Second step:* Joining the data of the 4 files in one record >>>>>> (the main file contains foreign key to get information from the other >>>>>> files) >>>>>> *OUT:* 20,000,000 records each for every line in the files. Each >>>>>> record is a list of string >>>>>> >>>>>> - *Third step:* cleaning data (convert to prepare integration in >>>>>> bigquery) and set them as a dict where keys are bigquery column name. >>>>>> *OUT:* 20,000,000 records as dict for each record >>>>>> >>>>>> - *Fourth step:* insert into bigquery >>>>>> >>>>>> So the first step return 200 records, but I have 20,000,000 records >>>>>> to insert. >>>>>> This takes about 1 hour and half and always use 1 worker ... >>>>>> >>>>>> If I manually set the number of workers, it's not really faster. So >>>>>> for an unknow reason, it doesn't scale, any ideas how to do it? >>>>>> >>>>>> Thanks for any help. >>>>>> >>>>>> *Sébastien MORAND* >>>>>> Team Lead Solution Architect >>>>>> Technology & Operations / Digital Factory >>>>>> Veolia - Group Information Systems & Technology (IS&T) >>>>>> Cell.: +33 7 52 66 20 81 / Direct: +33 1 85 57 71 08 >>>>>> <+33%201%2085%2057%2071%2008> >>>>>> Bureau 0144C (Ouest) >>>>>> 30, rue Madeleine-Vionnet - 93300 Aubervilliers, France >>>>>> *www.veolia.com <http://www.veolia.com>* >>>>>> <http://www.veolia.com> >>>>>> <https://www.facebook.com/veoliaenvironment/> >>>>>> <https://www.youtube.com/user/veoliaenvironnement> >>>>>> <https://www.linkedin.com/company/veolia-environnement> >>>>>> <https://twitter.com/veolia> >>>>>> >>>>>> >>>>>> >>>>>> -------------------------------------------------------------------------------------------- >>>>>> This e-mail transmission (message and any attached files) may contain >>>>>> information that is proprietary, privileged and/or confidential to Veolia >>>>>> Environnement and/or its affiliates and is intended exclusively for the >>>>>> person(s) to whom it is addressed. If you are not the intended recipient, >>>>>> please notify the sender by return e-mail and delete all copies of this >>>>>> e-mail, including all attachments. Unless expressly authorized, any use, >>>>>> disclosure, publication, retransmission or dissemination of this e-mail >>>>>> and/or of its attachments is strictly prohibited. >>>>>> >>>>>> Ce message electronique et ses fichiers attaches sont strictement >>>>>> confidentiels et peuvent contenir des elements dont Veolia Environnement >>>>>> et/ou l'une de ses entites affiliees sont proprietaires. Ils sont donc >>>>>> destines a l'usage de leurs seuls destinataires. Si vous avez recu ce >>>>>> message par erreur, merci de le retourner a son emetteur et de le >>>>>> detruire >>>>>> ainsi que toutes les pieces attachees. L'utilisation, la divulgation, la >>>>>> publication, la distribution, ou la reproduction non expressement >>>>>> autorisees de ce message et de ses pieces attachees sont interdites. >>>>>> >>>>>> -------------------------------------------------------------------------------------------- >>>>>> >>>>> >>>>> >>> >>> >>> >>> -------------------------------------------------------------------------------------------- >>> This e-mail transmission (message and any attached files) may contain >>> information that is proprietary, privileged and/or confidential to Veolia >>> Environnement and/or its affiliates and is intended exclusively for the >>> person(s) to whom it is addressed. If you are not the intended recipient, >>> please notify the sender by return e-mail and delete all copies of this >>> e-mail, including all attachments. Unless expressly authorized, any use, >>> disclosure, publication, retransmission or dissemination of this e-mail >>> and/or of its attachments is strictly prohibited. >>> >>> Ce message electronique et ses fichiers attaches sont strictement >>> confidentiels et peuvent contenir des elements dont Veolia Environnement >>> et/ou l'une de ses entites affiliees sont proprietaires. Ils sont donc >>> destines a l'usage de leurs seuls destinataires. Si vous avez recu ce >>> message par erreur, merci de le retourner a son emetteur et de le detruire >>> ainsi que toutes les pieces attachees. L'utilisation, la divulgation, la >>> publication, la distribution, ou la reproduction non expressement >>> autorisees de ce message et de ses pieces attachees sont interdites. >>> >>> -------------------------------------------------------------------------------------------- >>> >> > > > > -------------------------------------------------------------------------------------------- > This e-mail transmission (message and any attached files) may contain > information that is proprietary, privileged and/or confidential to Veolia > Environnement and/or its affiliates and is intended exclusively for the > person(s) to whom it is addressed. If you are not the intended recipient, > please notify the sender by return e-mail and delete all copies of this > e-mail, including all attachments. Unless expressly authorized, any use, > disclosure, publication, retransmission or dissemination of this e-mail > and/or of its attachments is strictly prohibited. > > Ce message electronique et ses fichiers attaches sont strictement > confidentiels et peuvent contenir des elements dont Veolia Environnement > et/ou l'une de ses entites affiliees sont proprietaires. Ils sont donc > destines a l'usage de leurs seuls destinataires. Si vous avez recu ce > message par erreur, merci de le retourner a son emetteur et de le detruire > ainsi que toutes les pieces attachees. L'utilisation, la divulgation, la > publication, la distribution, ou la reproduction non expressement > autorisees de ce message et de ses pieces attachees sont interdites. > > -------------------------------------------------------------------------------------------- >