1. if you have join by some specific field(e.g. user id or account-id or
whatever) you may try to partition parquet file by this field and then join
will be more efficient.
2. you need to see in spark metrics what is performance of particular join,
how much partitions is there, what is shuffle size
Can you share your approximate data size? all should be valid use cases for
spark, wondering if you are providing enough resources.
Also - do you have some expectations in terms of performance? what does "slow
down" mean?
For this usecase I would personally favor parquet over DB, and sql/datafr
Hello all community members,
I need opinion of people who was using Spark before and can share there
experience to help me select technical approach.
I have a project in Proof Of Concept phase, where we are evaluating possibility
of Spark usage for our use case.
Here is brief task description.