Thank you everyone for your response.

I am not getting any errors as of now. I am just trying to choose the right
tool for my task which is data loading from an external source into s3 via
Glue/EMR.

I think Glue job would be the best fit for me because I can calculate DPUs
needed (maybe keeping some extra buffer) so just wanted to check if there
are any edge cases I need to consider.


On Tue, May 28, 2024 at 5:39 AM Russell Jurney <russell.jur...@gmail.com>
wrote:

> If you’re using EMR and Spark, you need to choose nodes with enough RAM to
> accommodate any given partition in your data or you can get an OOM error.
> Not sure if this job involves a reduce, but I would choose a single 128GB+
> memory optimized instance and then adjust parallelism as via the Dpark docs
> using pyspark.sql.DataFrame.repartition(n) at the start of your job.
>
> Thanks,
> Russell Jurney @rjurney <http://twitter.com/rjurney>
> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
> <http://facebook.com/jurney> datasyndrome.com
>
>
> On Mon, May 27, 2024 at 9:15 AM Perez <flinkbyhe...@gmail.com> wrote:
>
>> Hi Team,
>>
>> I want to extract the data from DB and just dump it into S3. I
>> don't have to perform any transformations on the data yet. My data size
>> would be ~100 GB (historical load).
>>
>> Choosing the right DPUs(Glue jobs) should solve this problem right? Or
>> should I move to EMR.
>>
>> I don't feel the need to move to EMR but wanted the expertise suggestions.
>>
>> TIA.
>>
>

Reply via email to