Oof none of the documentation around windowing I have read has said anything about it not working in a batch job. Where did you find that info at if you remember? ________________________________ From: Vincent Marquez <vincent.marq...@gmail.com> Sent: Wednesday, July 21, 2021 12:21 PM To: user <user@beam.apache.org> Subject: Re: [2.28.0] [Java] [Dataflow] ParquetIO writeDynamic stuck in Garbage Collection when writing ~125K files to dynamic destinations
Windowing doesn't work with Batch jobs. You could dump your BQ data to pubsub and then use a streaming job to window. ~Vincent On Wed, Jul 21, 2021 at 10:13 AM Andrew Kettmann <akettm...@evolve24.com<mailto:akettm...@evolve24.com>> wrote: Worker machines are n1-standard-2s (2 cpus and 7.5GB of RAM) Pipeline is simple, but large amounts of end files, ~125K temp files written in one case at least 1. Scan Bigtable (NoSQL DB) 2. Transform with business logic 3. Convert to GenericRecord 4. WriteDynamic to a google bucket as Parquet files partitioned by 15 minute intervals. (gs://bucket/root_dir/CATEGORY/YEAR/MONTH/DAY/HOUR/MINUTE_FLOOR_15/FILENAME.parquet) Everything does fine until I get to the writeDynamic. When it does the groupByKey (FileIO.Write/WriteFiles/GatherTempFileResults/Reshuffle.ViaRandomKey/Reshuffle/GroupByKey) the stackdriver logs show a ton of allocation failure triggered GC that then frees up essentially zero space and never progresses, ends up with a "The worker lost contact with the service." error four times and then fails. Also worth noting that Dataflow sizes down to a single worker during this time, so it is trying to do it all at once. What are my options for splitting Likely I am not hitting GC alerts because I am using a snippet I pulled from a GCP Dataflow template that queries Bigtable that looks to disable the GCThrashing monitoring, due to Bigtable creating at least 5 objects per row scanned. DataflowPipelineDebugOptions debugOptions = options.as<https://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Foptions.as%2F&data=04%7C01%7Cakettmann%40evolve24.com%7C831d21fd39d3455d7f9908d94c6bf430%7Ce36287f1b44849498093fe543a560976%7C0%7C0%7C637624848837573722%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=PZFFQZZuliKFQbeXoaCT1%2BY0CQXbwhKVPf%2FCReHhwWU%3D&reserved=0>(DataflowPipelineDebugOptions.class); debugOptions.setGCThrashingPercentagePerPeriod(100.00); What are my options for splitting this up so that it can process this in smaller chunks? I tried adding windowing but it didn't seem to help, or I needed to do something else other than just the windowing, but I don't really have a key to group it by here. [https://storage.googleapis.com/e24-email-images/e24logonotag.png]<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.evolve24.com%2F&data=04%7C01%7Cakettmann%40evolve24.com%7C831d21fd39d3455d7f9908d94c6bf430%7Ce36287f1b44849498093fe543a560976%7C0%7C0%7C637624848837583677%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=BoMmvr3A89bqBfMYA13ah1F3b85xNgi46AYAPXcHtgE%3D&reserved=0> Andrew Kettmann DevOps Engineer P: 1.314.596.2836 [LinkedIn]<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Flinkedin.com%2Fcompany%2Fevolve24&data=04%7C01%7Cakettmann%40evolve24.com%7C831d21fd39d3455d7f9908d94c6bf430%7Ce36287f1b44849498093fe543a560976%7C0%7C0%7C637624848837583677%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=4wgNfvLwhtIVKf27QR9%2BE%2FPu15lJCdRPuZj5G6y9WXs%3D&reserved=0> [Twitter] <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2Fevolve24&data=04%7C01%7Cakettmann%40evolve24.com%7C831d21fd39d3455d7f9908d94c6bf430%7Ce36287f1b44849498093fe543a560976%7C0%7C0%7C637624848837593625%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=plB3o0gY5OsVvPw1Lbyfh5KREfF3KjD19TCRrcxsFaI%3D&reserved=0> [Instagram] <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.instagram.com%2Fevolve_24&data=04%7C01%7Cakettmann%40evolve24.com%7C831d21fd39d3455d7f9908d94c6bf430%7Ce36287f1b44849498093fe543a560976%7C0%7C0%7C637624848837593625%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Exs%2BVbsUtOmUpzyc%2BAbsejfQIxm6bTB%2ByuMHBaiNTtU%3D&reserved=0> evolve24 Confidential & Proprietary Statement: This email and any attachments are confidential and may contain information that is privileged, confidential or exempt from disclosure under applicable law. It is intended for the use of the recipients. If you are not the intended recipient, or believe that you have received this communication in error, please do not read, print, copy, retransmit, disseminate, or otherwise use the information. Please delete this email and attachments, without reading, printing, copying, forwarding or saving them, and notify the Sender immediately by reply email. No confidentiality or privilege is waived or lost by any transmission in error.