Gourav, with all respect, I really don't want to start a conversation
about your political correctness. I don't think my comments offend
anyone in this group (including you) except big corporations. Again, I
am looking for concrete answers to my questions that can help me to get
my project started, not some C-level talks. If you don't know the
answers, I'd appreciate you just ignore my posts...
-- ND
On 7/30/21 12:15 PM, Gourav Sengupta wrote:
Hi Artemis,
no one, and I repeat no one, is monopolising the data science market,
in fact almost all algorithms and code and papers are available for
free with largest open source contributions coming in from Amazon,
Google, and Azure, who you are saying are trying to monopolise the
market.
I think that we owe to these large corporations who spent billions and
then open source their products. In this chain, which I am one of the
oldest members, you will receive responses from Matei Zaharia, Reynold
Xin, Burak, TD, Michael Amburst, and so on.
I personally find myself fortunate to be a part of this kind of a
group. They still are founders of Databricks which is a profit making
company, but all innovations from Databricks are eventually given away
for free by projects which are headed by the employees of Databricks.
Let us please be grateful and acknowledge their kindness if possible.
I am sure we will all find help that we seek, but the help will most
likely come from those as well who are paid and supported by companies
towards whom you are being so unkind
Regards,
Gourav Sengupta
On Fri, Jul 30, 2021 at 4:02 PM Artemis User <arte...@dtechspace.com
<mailto:arte...@dtechspace.com>> wrote:
Thanks Gourav for the info. Actually I am looking for concrete
experiences and detailed best practices from people who have build
their own GPU-powered environment instead of relying on big cloud
providers who are dominating and trying to monopolize the data
science market....
-- ND
On 7/30/21 4:37 AM, Gourav Sengupta wrote:
Hi,
there are no cons of using SPARK with GPU's you just have to be
careful about the GPU memory and a few other details.
I have seen sometimes 10x improvement over general SPARK 3.x
performance and sometimes around 30x.
Not all the queries will be performant with GPU's and it is up to
you to test out scenarios specific to you. I use EMR for this
option and it is really impressive what NVIDIA folks have done.
I think, there was an initial promise with SPARK 3.x release that
SPARK dataframes can be transferred directly through native
integration to tensorflow and others, which is a brilliant way
forward for SPARK, but I think that SPARK project leaders are yet
to prioritise it.
Also Ray, another project by Berkeley, is trying to make SPARK
dataframes transfer to tensorflow. Clearly if SPARK users use Ray
to transfer SPARK dataframes to tensorflow and other frameworks,
then obviously Ray will have massive adoption.
Personally I think that SPARK community could have just built the
integration with other frameworks natively given the fantastic
contributions by NVIDIA to SPARK and such a large active
development community, but surely Ray also has to win as well and
nothing better than to ride on the success of SPARK. But I may be
wrong, and SPARK community may still be developing those
integrations.
Regards,
Gourav Sengupta
On Fri, Jul 30, 2021 at 2:46 AM Artemis User
<arte...@dtechspace.com <mailto:arte...@dtechspace.com>> wrote:
Has anyone had any experience with running Spark-Rapids on a
GPU-powered cluster (https://github.com/NVIDIA/spark-rapids
<https://github.com/NVIDIA/spark-rapids>)? I am very
interested in knowing:
1. What is the hardware/software platform and the type of
Spark cluster you are using to run Spark-Rapids?
2. How easy was the installation process?
3. Are you running Scala or PySpark or both with Spark-Rapids?
4. Have performance you've seen compared with running a
CPU-only cluster?
5. Any pros/cons of using Spark-Rapids?
Thanks a lot in advance!
-- ND