Re: Scala vs Python for ETL with Spark

Mich Talebzadeh Sun, 11 Oct 2020 12:47:36 -0700

Hi,

With regard to your statement below


".technology choices are agnostic to use cases according to you...."

If I may say, I do not think that was the message implied. What was said
was that in addition to "best technology fit" there are other factors
"equally important" that need to be considered, when a company makes a
decision on a given product use case.

As others have stated, what technology stacks you choose may not be the
best available technology but something that provides an adequate solution
at a reasonable TCO. Case in point if Scala in a given use case is the best
fit but at higher TCO (labour cost), then you may opt to use Python or
another because you have those resources available in-house at lower cost
and your Data Scientists are eager to invest in Python. Companies these
days are very careful where to spend their technology dollars or just
cancel the projects totally. From my experience, the following are
crucial in deciding what to invest in


   - Total Cost of Ownership
   - Internal Supportability & OpIerability thus avoiding single point of
   failure
   - Maximum leverage, strategic as opposed to tactical (example is Python
   considered more of a strategic product or Scala)
   -  Agile and DevOps compatible
   - Cloud-ready, flexible, scale-out
   - Vendor support
   - Documentation
   - Minimal footprint

I trust this answers your point.


Mich


LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sun, 11 Oct 2020 at 17:39, Gourav Sengupta <gourav.sengu...@gmail.com>
wrote:

> So Mich and rest,
>
> technology choices are agnostic to use cases according to you? This is
> interesting, really interesting. Perhaps I stand corrected.
>
> Regards,
> Gourav
>
> On Sun, Oct 11, 2020 at 5:00 PM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> if we take Spark and its massive parallel processing and in-memory
>> cache away, then one can argue anything can do the "ETL" job. just write
>> some Java/Scala/SQL/Perl/python to read data and write to from one DB to
>> another often using JDBC connections. However, we all concur that may not
>> be good enough with Big Data volumes. Generally speaking, there are two
>> ways of making a process faster:
>>
>>
>>    1. Do more intelligent work by creating indexes, cubes etc thus
>>    reducing the processing time
>>    2. Throw hardware and memory at it using something like Spark
>>    multi-cluster with fully managed cloud service like Google Dataproc
>>
>>
>> In general, one would see an order of magnitude performance gains.
>>
>>
>> HTH,
>>
>>
>> Mich
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sun, 11 Oct 2020 at 13:33, ayan guha <guha.a...@gmail.com> wrote:
>>
>>> But when you have fairly large volume of data that is where spark comes
>>> in the party. And I assume the requirement of using spark is already
>>> established in the original qs and the discussion is to use python vs
>>> scala/java.
>>>
>>> On Sun, 11 Oct 2020 at 10:51 pm, Sasha Kacanski <skacan...@gmail.com>
>>> wrote:
>>>
>>>> If org has folks that can do python seriously why then spark in the
>>>> first place. You can do workflow on your own, streaming or batch or what
>>>> ever you want.
>>>> I would not do anything else aside from python, but that is me.
>>>>
>>>> On Sat, Oct 10, 2020, 9:42 PM ayan guha <guha.a...@gmail.com> wrote:
>>>>
>>>>> I have one observation: is "python udf is slow due to deserialization
>>>>> penulty" still relevant? Even after arrow is used as in memory data mgmt
>>>>> and so heavy investment from spark dev community on making pandas first
>>>>> class citizen including Udfs.
>>>>>
>>>>> As I work with multiple clients, my exp is org culture and available
>>>>> people are most imp driver for this choice regardless the use case. Use
>>>>> case is relevant only when there is a feature imparity
>>>>>
>>>>> On Sun, 11 Oct 2020 at 7:39 am, Gourav Sengupta <
>>>>> gourav.sengu...@gmail.com> wrote:
>>>>>
>>>>>> Not quite sure how meaningful this discussion is, but in case someone
>>>>>> is really faced with this query the question still is 'what is the use
>>>>>> case'?
>>>>>> I am just a bit confused with the one size fits all deterministic
>>>>>> approach here thought that those days were over almost 10 years ago.
>>>>>> Regards
>>>>>> Gourav
>>>>>>
>>>>>> On Sat, 10 Oct 2020, 21:24 Stephen Boesch, <java...@gmail.com> wrote:
>>>>>>
>>>>>>> I agree with Wim's assessment of data engineering / ETL vs Data
>>>>>>> Science.    I wrote pipelines/frameworks for large companies and scala 
>>>>>>> was
>>>>>>> a much better choice. But for ad-hoc work interfacing directly with data
>>>>>>> science experiments pyspark presents less friction.
>>>>>>>
>>>>>>> On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh <
>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Many thanks everyone for their valuable contribution.
>>>>>>>>
>>>>>>>> We all started with Spark a few years ago where Scala was the talk
>>>>>>>> of the town. I agree with the note that as long as Spark stayed nish 
>>>>>>>> and
>>>>>>>> elite, then someone with Scala knowledge was attracting premiums. In
>>>>>>>> fairness in 2014-2015, there was not much talk of Data Science input 
>>>>>>>> (I may
>>>>>>>> be wrong). But the world has moved on so to speak. Python itself has 
>>>>>>>> been
>>>>>>>> around a long time (long being relative here). Most people either knew 
>>>>>>>> UNIX
>>>>>>>> Shell, C, Python or Perl or a combination of all these. I recall we 
>>>>>>>> had a
>>>>>>>> director a few years ago who asked our Hadoop admin for root password 
>>>>>>>> to
>>>>>>>> log in to the edge node. Later he became head of machine learning
>>>>>>>> somewhere else and he loved C and Python. So Python was a gift in 
>>>>>>>> disguise.
>>>>>>>> I think Python appeals to those who are very familiar with CLI and 
>>>>>>>> shell
>>>>>>>> programming (Not GUI fan). As some members alluded to there are more 
>>>>>>>> people
>>>>>>>> around with Python knowledge. Most managers choose Python as the 
>>>>>>>> unifying
>>>>>>>> development tool because they feel comfortable with it. Frankly I have 
>>>>>>>> not
>>>>>>>> seen a manager who feels at home with Scala. So in summary it is a bit
>>>>>>>> disappointing to abandon Scala and switch to Python just for the sake 
>>>>>>>> of it.
>>>>>>>>
>>>>>>>> Disclaimer: These are opinions and not facts so to speak :)
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>>
>>>>>>>> Mich
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh <
>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> I have come across occasions when the teams use Python with Spark
>>>>>>>>> for ETL, for example processing data from S3 buckets into Snowflake 
>>>>>>>>> with
>>>>>>>>> Spark.
>>>>>>>>>
>>>>>>>>> The only reason I think they are choosing Python as opposed to
>>>>>>>>> Scala is because they are more familiar with Python. Since Spark is 
>>>>>>>>> written
>>>>>>>>> in Scala, itself is an indication of why I think Scala has an edge.
>>>>>>>>>
>>>>>>>>> I have not done one to one comparison of Spark with Scala vs Spark
>>>>>>>>> with Python. I understand for data science purposes most libraries 
>>>>>>>>> like
>>>>>>>>> TensorFlow etc. are written in Python but I am at loss to understand 
>>>>>>>>> the
>>>>>>>>> validity of using Python with Spark for ETL purposes.
>>>>>>>>>
>>>>>>>>> These are my understanding but they are not facts so I would like
>>>>>>>>> to get some informed views on this if I can?
>>>>>>>>>
>>>>>>>>> Many thanks,
>>>>>>>>>
>>>>>>>>> Mich
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> LinkedIn * 
>>>>>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>>> for any loss, damage or destruction of data or any other property 
>>>>>>>>> which may
>>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>>> disclaimed. The author will in no case be liable for any monetary 
>>>>>>>>> damages
>>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> --
>>>>> Best Regards,
>>>>> Ayan Guha
>>>>>
>>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>

Re: Scala vs Python for ETL with Spark

Reply via email to