As long as you don't use python lambdas in your Spark job there should be almost no difference between the Scala and Python dataframe code. Once you introduce python lambdas you will hit some significant serialization penalties as well as have to run actual work code in python. As long as no lambdas are used, everything will operate with Catalyst compiled java code so there won't be a big difference between python and scala.
On Fri, Oct 9, 2020 at 3:57 PM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > I have come across occasions when the teams use Python with Spark for ETL, > for example processing data from S3 buckets into Snowflake with Spark. > > The only reason I think they are choosing Python as opposed to Scala is > because they are more familiar with Python. Since Spark is written in > Scala, itself is an indication of why I think Scala has an edge. > > I have not done one to one comparison of Spark with Scala vs Spark with > Python. I understand for data science purposes most libraries like > TensorFlow etc. are written in Python but I am at loss to understand the > validity of using Python with Spark for ETL purposes. > > These are my understanding but they are not facts so I would like to get > some informed views on this if I can? > > Many thanks, > > Mich > > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > >