I think a lot will depend on what the scripts do. I've seen some legacy hive scripts which were written in an awkward way (e.g. lots of subqueries, nested explodes) because pre-spark it was the only way to express certain logic. For fairly straightforward operations I expect Catalyst would reduce both code to similar plans.
On Tue, Oct 6, 2020 at 12:07 PM Manu Jacob <manu.ja...@sas.com.invalid> wrote: > Hi All, > > > > Not sure if I need to ask this question on spark community or hive > community. > > > > We have a set of hive scripts that runs on EMR (Tez engine). We would like > to experiment by moving some of it onto Spark. We are planning to > experiment with two options. > > > > 1. Use the current code based on HQL, with engine set as spark. > 2. Write pure spark code in scala/python using SparkQL and hive > integration. > > > > The first approach helps us to transition to Spark quickly but not sure if > this is the best approach in terms of performance. Could not find any > reasonable comparisons of this two approaches. It looks like writing pure > Spark code, gives us more control to add logic and also control some of the > performance features, for example things like caching/evicting etc. > > > > > > Any advice on this is much appreciated. > > > > > > Thanks, > > -Manu > > > -- *Patrick McCarthy * Senior Data Scientist, Machine Learning Engineering Dstillery 470 Park Ave South, 17th Floor, NYC 10016