What is the use case? Unless you have unlimited funding and time to waste you would usually start with that.
Regards, Gourav On Fri, Oct 9, 2020 at 10:29 PM Russell Spitzer <russell.spit...@gmail.com> wrote: > Spark in Scala (or java) Is much more performant if you are using RDD's, > those operations basically force you to pass lambdas, hit serialization > between java and python types and yes hit the Global Interpreter Lock. But, > none of those things apply to Data Frames which will generate Java code > regardless of what language you use to describe the Dataframe operations as > long as you don't use python lambdas. A Dataframe operation without python > lambdas should not require any remote python code execution. > > TLDR, If you are using Dataframes it doesn't matter if you use Scala, > Java, Python, R, SQL, the planning and work will all happen in the JVM. > > As for a repl, you can run PySpark which will start up a repl. There are > also a slew of notebooks which provide interactive python environments as > well. > > > On Fri, Oct 9, 2020 at 4:19 PM Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > >> Thanks >> >> So ignoring Python lambdas is it a matter of individuals familiarity with >> the language that is the most important factor? Also I have noticed that >> Spark document preferences have been switched from Scala to Python as the >> first example. However, some codes for example JDBC calls are the same for >> Scala and Python. >> >> Some examples like this website >> <https://www.kdnuggets.com/2018/05/apache-spark-python-scala.html#:~:text=Scala%20is%20frequently%20over%2010,languages%20are%20faster%20than%20interpreted.> >> claim that Scala performance is an order of magnitude better than Python >> and also when it comes to concurrency Scala is a better choice. Maybe it is >> pretty old (2018)? >> >> Also (and may be my ignorance I have not researched it) does Spark offer >> REPL in the form of spark-shell with Python? >> >> >> Regards, >> >> Mich >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Fri, 9 Oct 2020 at 21:59, Russell Spitzer <russell.spit...@gmail.com> >> wrote: >> >>> As long as you don't use python lambdas in your Spark job there should >>> be almost no difference between the Scala and Python dataframe code. Once >>> you introduce python lambdas you will hit some significant serialization >>> penalties as well as have to run actual work code in python. As long as no >>> lambdas are used, everything will operate with Catalyst compiled java code >>> so there won't be a big difference between python and scala. >>> >>> On Fri, Oct 9, 2020 at 3:57 PM Mich Talebzadeh < >>> mich.talebza...@gmail.com> wrote: >>> >>>> I have come across occasions when the teams use Python with Spark for >>>> ETL, for example processing data from S3 buckets into Snowflake with Spark. >>>> >>>> The only reason I think they are choosing Python as opposed to Scala is >>>> because they are more familiar with Python. Since Spark is written in >>>> Scala, itself is an indication of why I think Scala has an edge. >>>> >>>> I have not done one to one comparison of Spark with Scala vs Spark with >>>> Python. I understand for data science purposes most libraries like >>>> TensorFlow etc. are written in Python but I am at loss to understand the >>>> validity of using Python with Spark for ETL purposes. >>>> >>>> These are my understanding but they are not facts so I would like to >>>> get some informed views on this if I can? >>>> >>>> Many thanks, >>>> >>>> Mich >>>> >>>> >>>> >>>> >>>> LinkedIn * >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >>>> >>>> >>>> >>>> >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>