Oof none of the documentation around windowing I have read has said anything 
about it not working in a batch job. Where did you find that info at if you 
remember?
________________________________
From: Vincent Marquez <vincent.marq...@gmail.com>
Sent: Wednesday, July 21, 2021 12:21 PM
To: user <user@beam.apache.org>
Subject: Re: [2.28.0] [Java] [Dataflow] ParquetIO writeDynamic stuck in Garbage 
Collection when writing ~125K files to dynamic destinations

Windowing doesn't work with Batch jobs.  You could dump your BQ data to pubsub 
and then use a streaming job to window.
~Vincent


On Wed, Jul 21, 2021 at 10:13 AM Andrew Kettmann 
<akettm...@evolve24.com<mailto:akettm...@evolve24.com>> wrote:
Worker machines are n1-standard-2s (2 cpus and 7.5GB of RAM)

Pipeline is simple, but large amounts of end files, ~125K temp files written in 
one case at least

  1.  Scan Bigtable (NoSQL DB)
  2.  Transform with business logic
  3.  Convert to GenericRecord
  4.  WriteDynamic to a google bucket as Parquet files partitioned by 15 minute 
intervals. 
(gs://bucket/root_dir/CATEGORY/YEAR/MONTH/DAY/HOUR/MINUTE_FLOOR_15/FILENAME.parquet)

Everything does fine until I get to the writeDynamic. When it does the 
groupByKey 
(FileIO.Write/WriteFiles/GatherTempFileResults/Reshuffle.ViaRandomKey/Reshuffle/GroupByKey)
 the stackdriver logs show a ton of allocation failure triggered GC that then 
frees up essentially zero space and never progresses, ends up with a "The 
worker lost contact with the service." error four times and then fails. Also 
worth noting that Dataflow sizes down to a single worker during this time, so 
it is trying to do it all at once. What are my options for splitting

Likely I am not hitting GC alerts because I am using a snippet I pulled from a 
GCP Dataflow template that queries Bigtable that looks to disable the 
GCThrashing monitoring, due to Bigtable creating at least 5 objects per row 
scanned.

    DataflowPipelineDebugOptions debugOptions = 
options.as<https://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Foptions.as%2F&data=04%7C01%7Cakettmann%40evolve24.com%7C831d21fd39d3455d7f9908d94c6bf430%7Ce36287f1b44849498093fe543a560976%7C0%7C0%7C637624848837573722%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=PZFFQZZuliKFQbeXoaCT1%2BY0CQXbwhKVPf%2FCReHhwWU%3D&reserved=0>(DataflowPipelineDebugOptions.class);
    debugOptions.setGCThrashingPercentagePerPeriod(100.00);

What are my options for splitting this up so that it can process this in 
smaller chunks? I tried adding windowing but it didn't seem to help, or I 
needed to do something else other than just the windowing, but I don't really 
have a key to group it by here.

[https://storage.googleapis.com/e24-email-images/e24logonotag.png]<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.evolve24.com%2F&data=04%7C01%7Cakettmann%40evolve24.com%7C831d21fd39d3455d7f9908d94c6bf430%7Ce36287f1b44849498093fe543a560976%7C0%7C0%7C637624848837583677%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=BoMmvr3A89bqBfMYA13ah1F3b85xNgi46AYAPXcHtgE%3D&reserved=0>
 Andrew Kettmann
DevOps Engineer
P: 1.314.596.2836
[LinkedIn]<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Flinkedin.com%2Fcompany%2Fevolve24&data=04%7C01%7Cakettmann%40evolve24.com%7C831d21fd39d3455d7f9908d94c6bf430%7Ce36287f1b44849498093fe543a560976%7C0%7C0%7C637624848837583677%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=4wgNfvLwhtIVKf27QR9%2BE%2FPu15lJCdRPuZj5G6y9WXs%3D&reserved=0>
 [Twitter] 
<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2Fevolve24&data=04%7C01%7Cakettmann%40evolve24.com%7C831d21fd39d3455d7f9908d94c6bf430%7Ce36287f1b44849498093fe543a560976%7C0%7C0%7C637624848837593625%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=plB3o0gY5OsVvPw1Lbyfh5KREfF3KjD19TCRrcxsFaI%3D&reserved=0>
  [Instagram] 
<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.instagram.com%2Fevolve_24&data=04%7C01%7Cakettmann%40evolve24.com%7C831d21fd39d3455d7f9908d94c6bf430%7Ce36287f1b44849498093fe543a560976%7C0%7C0%7C637624848837593625%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Exs%2BVbsUtOmUpzyc%2BAbsejfQIxm6bTB%2ByuMHBaiNTtU%3D&reserved=0>

evolve24 Confidential & Proprietary Statement: This email and any attachments 
are confidential and may contain information that is privileged, confidential 
or exempt from disclosure under applicable law. It is intended for the use of 
the recipients. If you are not the intended recipient, or believe that you have 
received this communication in error, please do not read, print, copy, 
retransmit, disseminate, or otherwise use the information. Please delete this 
email and attachments, without reading, printing, copying, forwarding or saving 
them, and notify the Sender immediately by reply email. No confidentiality or 
privilege is waived or lost by any transmission in error.

Reply via email to