A sensible default strategy is to use the same language in which a system was developed or a highly compatible language. That would be Scala for Spark, however I assume you don't currently know Scala to the same degree as Python or at all. In which case to help you make the decision you should also consider your own personal/team productivity and project constraints. If you have time and/or require the bleeding edge features and performance then learning/strengthening in Scala is worth it and you should use the Scala API. If you're already very productive in Python and have tighter time constraints and don't need the bleeding edge features and maximum performance isn't a high priority then I'd recommend using the Python API.
On Mon, 21 Nov 2016 at 11:58 Jon Gregg <jonrgr...@gmail.com> wrote: > Spark is written in Scala, so yes it's still the strongest option. You > also get the Dataset type with Scala (compile time type-safety), and that's > not an available feature with Python. > > That said, I think the Python API is a viable candidate if you use Pandas > for Data Science. There are similarities between the DataFrame and Pandas > APIs, and you can convert a Spark DataFrame to a Pandas DataFrame. > > On Mon, Nov 21, 2016 at 1:51 PM, Brandon White <bwwintheho...@gmail.com> > wrote: > > Hello all, > > I will be starting a new Spark codebase and I would like to get opinions > on using Python over Scala. Historically, the Scala API has always been the > strongest interface to Spark. Is this still true? Are there still many > benefits and additional features in the Scala API that are not available in > the Python API? Are there any performance concerns using the Python API > that do not exist when using the Scala API? Anything else I should know > about? > > I appreciate any insight you have on using the Scala API over the Python > API. > > Brandon > > >