Thanks Mich for the detailed explanation.
On Tue, May 28, 2024 at 9:53 PM Mich Talebzadeh
wrote:
> Russell mentioned some of these issues before. So in short your mileage
> varies. For a 100 GB data transfer, the speed difference between Glue and
> EMR might not be significant, especially consid
If Glue lets you take a configuration based approach, and you don't have to
operate any servers as with EMR... use Glue. Try EMR if that is troublesome.
Russ
On Tue, May 28, 2024 at 9:23 AM Mich Talebzadeh
wrote:
> Russell mentioned some of these issues before. So in short your mileage
> varies
Russell mentioned some of these issues before. So in short your mileage
varies. For a 100 GB data transfer, the speed difference between Glue and
EMR might not be significant, especially considering the benefits of Glue's
managed service aspects. However, for much larger datasets or scenarios
where
Thanks Mich.
Yes, I agree on the costing part but how does the data transfer speed be
impacted? Is it because glue takes some time to initialize underlying
resources and then process the data?
On Tue, May 28, 2024 at 2:23 PM Mich Talebzadeh
wrote:
> Your mileage varies as usual
>
> Glue with D
Your mileage varies as usual
Glue with DPUs seems like a strong contender for your data transfer needs
based on the simplicity, scalability, and managed service aspects. However,
if data transfer speed is critical or costs become a concern after testing,
consider EMR as an alternative.
HTH
Mich
Thank you everyone for your response.
I am not getting any errors as of now. I am just trying to choose the right
tool for my task which is data loading from an external source into s3 via
Glue/EMR.
I think Glue job would be the best fit for me because I can calculate DPUs
needed (maybe keeping s
If you’re using EMR and Spark, you need to choose nodes with enough RAM to
accommodate any given partition in your data or you can get an OOM error.
Not sure if this job involves a reduce, but I would choose a single 128GB+
memory optimized instance and then adjust parallelism as via the Dpark docs
What exactly is the error? Is it erroring out while reading the data from
db? How are you partitioning the data?
How much memory currently do you have? What is the network time out?
Regards,
Meena
On Mon, May 27, 2024 at 4:22 PM Perez wrote:
> Hi Team,
>
> I want to extract the data from DB a
Hi Team,
I want to extract the data from DB and just dump it into S3. I
don't have to perform any transformations on the data yet. My data size
would be ~100 GB (historical load).
Choosing the right DPUs(Glue jobs) should solve this problem right? Or
should I move to EMR.
I don't feel the need t