Thank you everyone for your response. I am not getting any errors as of now. I am just trying to choose the right tool for my task which is data loading from an external source into s3 via Glue/EMR.
I think Glue job would be the best fit for me because I can calculate DPUs needed (maybe keeping some extra buffer) so just wanted to check if there are any edge cases I need to consider. On Tue, May 28, 2024 at 5:39 AM Russell Jurney <russell.jur...@gmail.com> wrote: > If you’re using EMR and Spark, you need to choose nodes with enough RAM to > accommodate any given partition in your data or you can get an OOM error. > Not sure if this job involves a reduce, but I would choose a single 128GB+ > memory optimized instance and then adjust parallelism as via the Dpark docs > using pyspark.sql.DataFrame.repartition(n) at the start of your job. > > Thanks, > Russell Jurney @rjurney <http://twitter.com/rjurney> > russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB > <http://facebook.com/jurney> datasyndrome.com > > > On Mon, May 27, 2024 at 9:15 AM Perez <flinkbyhe...@gmail.com> wrote: > >> Hi Team, >> >> I want to extract the data from DB and just dump it into S3. I >> don't have to perform any transformations on the data yet. My data size >> would be ~100 GB (historical load). >> >> Choosing the right DPUs(Glue jobs) should solve this problem right? Or >> should I move to EMR. >> >> I don't feel the need to move to EMR but wanted the expertise suggestions. >> >> TIA. >> >