Russell mentioned some of these issues before. So in short your mileage
varies. For a 100 GB data transfer, the speed difference between Glue and
EMR might not be significant, especially considering the benefits of Glue's
managed service aspects. However, for much larger datasets or scenarios
where speed is critical, EMR's customization options might provide a slight
edge.

My recommendation is test and Compare: If speed is a concern, consider
running a test job with both Glue and EMR (if feasible) on a smaller subset
of your data to compare transfer times and costs in your specific
environment.. Focus on Benefits: If the speed difference with Glue is
minimal but it offers significant benefits in terms of management and cost
for your use case, Glue might still be the preferable option.. Also
bandwidth: Ensure your network bandwidth between the database and S3 is
sufficient to handle the data transfer rate, regardless of the service you
choose.


HTH
Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College
London <https://en.wikipedia.org/wiki/Imperial_College_London>
London, United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Tue, 28 May 2024 at 16:40, Perez <flinkbyhe...@gmail.com> wrote:

> Thanks Mich.
>
> Yes, I agree on the costing part but how does the data transfer speed be
> impacted? Is it because glue takes some time to initialize underlying
> resources and then process the data?
>
>
> On Tue, May 28, 2024 at 2:23 PM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> Your mileage varies as usual
>>
>> Glue with DPUs seems like a strong contender for your data transfer needs
>> based on the simplicity, scalability, and managed service aspects. However,
>> if data transfer speed is critical or costs become a concern after testing,
>> consider EMR as an alternative.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>> PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial
>> College London <https://en.wikipedia.org/wiki/Imperial_College_London>
>> London, United Kingdom
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>
>>
>> On Tue, 28 May 2024 at 09:04, Perez <flinkbyhe...@gmail.com> wrote:
>>
>>> Thank you everyone for your response.
>>>
>>> I am not getting any errors as of now. I am just trying to choose the
>>> right tool for my task which is data loading from an external source into
>>> s3 via Glue/EMR.
>>>
>>> I think Glue job would be the best fit for me because I can calculate
>>> DPUs needed (maybe keeping some extra buffer) so just wanted to check if
>>> there are any edge cases I need to consider.
>>>
>>>
>>> On Tue, May 28, 2024 at 5:39 AM Russell Jurney <russell.jur...@gmail.com>
>>> wrote:
>>>
>>>> If you’re using EMR and Spark, you need to choose nodes with enough RAM
>>>> to accommodate any given partition in your data or you can get an OOM
>>>> error. Not sure if this job involves a reduce, but I would choose a single
>>>> 128GB+ memory optimized instance and then adjust parallelism as via the
>>>> Dpark docs using pyspark.sql.DataFrame.repartition(n) at the start of your
>>>> job.
>>>>
>>>> Thanks,
>>>> Russell Jurney @rjurney <http://twitter.com/rjurney>
>>>> russell.jur...@gmail.com LI <http://linkedin.com/in/russelljurney> FB
>>>> <http://facebook.com/jurney> datasyndrome.com
>>>>
>>>>
>>>> On Mon, May 27, 2024 at 9:15 AM Perez <flinkbyhe...@gmail.com> wrote:
>>>>
>>>>> Hi Team,
>>>>>
>>>>> I want to extract the data from DB and just dump it into S3. I
>>>>> don't have to perform any transformations on the data yet. My data size
>>>>> would be ~100 GB (historical load).
>>>>>
>>>>> Choosing the right DPUs(Glue jobs) should solve this problem right? Or
>>>>> should I move to EMR.
>>>>>
>>>>> I don't feel the need to move to EMR but wanted the expertise
>>>>> suggestions.
>>>>>
>>>>> TIA.
>>>>>
>>>>

Reply via email to