Re: GraphFrames Hackathon on Friday, February 21

2025-02-01 Thread Holden Karau
Can you stop double sending to Spark mailing lists & the group group you’ve created? I get email errors responding to the Google group. Twitter: https://twitter.com/holdenkarau Fight Health Insurance: https://www.fighthealthinsurance.com/ Books (L

Re: Drop Python 2 support from GraphFrames?

2025-01-31 Thread Holden Karau
We no longer support Python 2 in Spark Twitter: https://twitter.com/holdenkarau Fight Health Insurance: https://www.fighthealthinsurance.com/ Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9

Re: LLM based data pre-processing

2025-01-03 Thread Holden Karau
it relevant to his query? > > Thanks, > Russell > > On Fri, Jan 3, 2025 at 9:03 AM Holden Karau > wrote: > >> So I've been working on similar LLM pre-processing of data and I would >> say one of the questions worth answering is do you want/need your models t

Re: LLM based data pre-processing

2025-01-03 Thread Holden Karau
So I've been working on similar LLM pre-processing of data and I would say one of the questions worth answering is do you want/need your models to be collocated? If you're running on prem in a GPU rich env there's a lot of benefits, but even with a custom model, if your using 3rd party inference or

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-11-12 Thread Holden Karau
So it’s deprecated but I will review some basic graph X PRs as I would like us to bring graph X back to life — but under our current release structure we need to deprecate now if we want to be able to remove it in the next few years. Twitter: https://twitter.com/holdenkarau Fight Health Insurance:

Re: Spark 4.0 - Expected Stable Release Date

2024-10-22 Thread Holden Karau
But also this is software development and release dates may slip Twitter: https://twitter.com/holdenkarau Fight Health Insurance: https://www.fighthealthinsurance.com/ Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-10-07 Thread Holden Karau
just blog all about the motif matching in GraphFrames: > > https://blog.graphlet.ai/financial-crime-and-corruption-network-motifs-4cf2e8e10eb5 > > Russ > > On Mon, Oct 7, 2024 at 5:38 PM Holden Karau > wrote: > >> So this discuss thread and the vote thread to deprecat

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-10-07 Thread Holden Karau
way to raise visibility here? > > On Mon, Oct 7, 2024 at 4:24 PM Holden Karau > wrote: > >> There are no specific tickets associated with the lack of maintaince or >> this as the component has not been maintained for a sufficiently long time. >> If your interested in

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-10-07 Thread Holden Karau
4PR7PeJ4MUBOS8bbD7CNssUIMqRMvY_pOqbh7PfLY0lRpQh9mfqBC0KnSHBZzxxSJJr-55r5kv6YjYwrA,,&typo=1> >> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau >> >> Pronouns: she/her >> >> >> >> >> >> On Sat, Oct 5, 2024 at 9:17 PM Ángel &g

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-10-05 Thread Holden Karau
their code. > > I think that’s probably the way to go. > > El dom, 6 oct 2024 a las 6:09, Holden Karau () > escribió: > >> So removing GraphX from Spark would not prevent GraphFrames from >> continuing, they could pick up the GraphX source and incorporate it into >

Re: [DISCUSS] Deprecate GraphX OR Find new maintainers interested in GraphX OR leave it as is?

2024-10-05 Thread Holden Karau
; >> > >>>> >> > On Fri, Oct 4, 2024 at 4:56 PM Mark Hamstra >>>> wrote: >>>> >> >> >>>> >> >> I'm -1(*) because, while it technically means "might be removed >>>> in the >>>> >> >> future&

Re: issue forwarding SPARK_CONF_DIR to start workers

2024-07-20 Thread Holden Karau
This might a good discussion for the dev@ list, I don’t know much about SLURM deployments personally. Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 YouTube Live Streams: https://www.youtube.com/user

Re: Spark 4.0 Query Analyzer Bug Report

2024-02-20 Thread Holden Karau
Do you mean Spark 3.4? 4.0 is very much not released yet. Also it would help if you could share your query & more of the logs leading up to the error. On Tue, Feb 20, 2024 at 3:07 PM Sharma, Anup wrote: > Hi Spark team, > > > > We ran into a dataframe issue after upgrading from spark 3.1 to 4.

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-13 Thread Holden Karau
This looks really cool :) Out of interest what are the differences in the approach between this and Glutten? On Tue, Feb 13, 2024 at 12:42 PM Chao Sun wrote: > Hi all, > > We are very happy to announce that Project Comet, a plugin to > accelerate Spark query execution via leveraging DataFusion a

Re: Spark-Connect: Param `--packages` does not take effect for executors.

2023-12-04 Thread Holden Karau
So I think this sounds like a bug to me, in the help options for both regular spark-submit and ./sbin/start-connect-server.sh we say: " --packages Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths. Will

Re: Classpath isolation per SparkSession without Spark Connect

2023-11-27 Thread Holden Karau
So I don’t think we make any particular guarantees around class path isolation there, so even if it does work it’s something you’d need to pay attention to on upgrades. Class path isolation is tricky to get right. On Mon, Nov 27, 2023 at 2:58 PM Faiz Halde wrote: > Hello, > > We are using spark

Re: Write Spark Connection client application in Go

2023-09-12 Thread Holden Karau
That’s so cool! Great work y’all :) On Tue, Sep 12, 2023 at 8:14 PM bo yang wrote: > Hi Spark Friends, > > Anyone interested in using Golang to write Spark application? We created a > Spark > Connect Go Client library . > Would love to hear feedback/t

Re: Elasticsearch support for Spark 3.x

2023-08-27 Thread Holden Karau
What’s the version of the ES connector you are using? On Sat, Aug 26, 2023 at 10:17 AM Dipayan Dev wrote: > Hi All, > > We're using Spark 2.4.x to write dataframe into the Elasticsearch index. > As we're upgrading to Spark 3.3.0, it throwing out error > Caused by: java.lang.ClassNotFoundExceptio

Re: Dynamic allocation does not deallocate executors

2023-08-08 Thread Holden Karau
for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such lo

Re: Dynamic allocation does not deallocate executors

2023-08-07 Thread Holden Karau
I think you need to set "spark.dynamicAllocation.shuffleTracking.enabled=true" to false. On Mon, Aug 7, 2023 at 2:50 AM Mich Talebzadeh wrote: > Yes I have seen cases where the driver gone but a couple of executors > hanging on. Sounds like a code issue. > > HTH > > Mich Talebzadeh, > Solutions

Re: SPIP: Shutting down spark structured streaming when the streaming process completed current process

2023-02-18 Thread Holden Karau
Is there someone focused on streaming work these days who would want to shepherd this? On Sat, Feb 18, 2023 at 5:02 PM Dongjoon Hyun wrote: > Thank you for considering me, but may I ask what makes you think to put me > there, Mich? I'm curious about your reason. > > > I have put dongjoon.hyun as

Re: [PySpark] Reader/Writer for bgzipped data

2022-12-06 Thread Holden Karau
> > On Tue, Dec 6, 2022 at 9:22 AM Holden Karau wrote: > >> There is the splittable gzip Hadoop input format, maybe someone could >> extend that to use support bgzip? >> >> On Tue, Dec 6, 2022 at 1:43 PM Oliver Ruebenacker < >> oliv...@broadinstitute.org>

Re: [PySpark] Reader/Writer for bgzipped data

2022-12-06 Thread Holden Karau
There is the splittable gzip Hadoop input format, maybe someone could extend that to use support bgzip? On Tue, Dec 6, 2022 at 1:43 PM Oliver Ruebenacker < oliv...@broadinstitute.org> wrote: > > Hello Chris, > > Yes, you can use gunzip/gzip to uncompress a file created by bgzip, but > to s

Re: Dataproc serverless for Spark

2022-11-28 Thread Holden Karau
This sounds like a great question for the Google DataProc folks (I know there was some interesting work being done around it but I left before it was finished so I don't want to provide a possibly incorrect answer). If your a GCP customer try reaching out to their support for details. On Mon, Nov

Re: Dynamic Scaling without Kubernetes

2022-10-26 Thread Holden Karau
So Spark can dynamically scale on YARN, but standalone mode becomes a bit complicated — where do you envision Spark gets the extra resources from? On Wed, Oct 26, 2022 at 12:18 PM Artemis User wrote: > Has anyone tried to make a Spark cluster dynamically scalable, i.e., > adding a new worker nod

Re: Jupyter notebook on Dataproc versus GKE

2022-09-06 Thread Holden Karau
rise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On Mon, 5 Sept 2022 at 20

Re: Jupyter notebook on Dataproc versus GKE

2022-09-05 Thread Holden Karau
ion of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Mon, 5 Sept 2022 at 1

Re: Jupyter notebook on Dataproc versus GKE

2022-09-05 Thread Holden Karau
I’ve run Jupyter w/Spark on K8s, haven’t tried it with Dataproc personally. The Spark K8s pod scheduler is now more pluggable for Yunikorn and Volcano can be used with less effort. On Mon, Sep 5, 2022 at 7:44 AM Mich Talebzadeh wrote: > > Hi, > > > Has anyone got experience of running Jupyter o

Re: Reverse proxy for Spark UI on Kubernetes

2022-05-17 Thread Holden Karau
Could we make it do the same sort of history server fallback approach? On Tue, May 17, 2022 at 10:41 PM bo yang wrote: > It is like Web Application Proxy in YARN ( > https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html), > to provide easy access for Spark U

Re: Reverse proxy for Spark UI on Kubernetes

2022-05-16 Thread Holden Karau
Oh that’s rad 😊 On Tue, May 17, 2022 at 7:47 AM bo yang wrote: > Hi Spark Folks, > > I built a web reverse proxy to access Spark UI on Kubernetes (working > together with https://github.com/GoogleCloudPlatform/spark-on-k8s-operator). > Want to share here in case other people have similar need. >

Re: Unable to access Google buckets using spark-submit

2022-02-12 Thread Holden Karau
You can also put the GS access jar with your Spark jars — that’s what the class not found exception is pointing you towards. On Fri, Feb 11, 2022 at 11:58 PM Mich Talebzadeh wrote: > BTW I also answered you in in stackoverflow : > > > https://stackoverflow.com/questions/71088934/unable-to-access

Re: Spark 3.1.2 full thread dumps

2022-02-04 Thread Holden Karau
We don’t block scaling up after node failure in classic Spark if that’s the question. On Fri, Feb 4, 2022 at 6:30 PM Mich Talebzadeh wrote: > From what I can see in auto scaling setup, you will always need a min of > two worker nodes as primary. It also states and I quote "Scaling primary > work

Re: Log4j 1.2.17 spark CVE

2021-12-12 Thread Holden Karau
My understanding is it only applies to log4j 2+ so we don’t need to do anything. On Sun, Dec 12, 2021 at 8:46 PM Pralabh Kumar wrote: > Hi developers, users > > Spark is built using log4j 1.2.17 . Is there a plan to upgrade based on > recent CVE detected ? > > > Regards > Pralabh kumar > -- Tw

Re: Choice of IDE for Spark

2021-10-01 Thread Holden Karau
Personally I like Jupyter notebooks for my interactive work and then once I’ve done my exploration I switch back to emacs with either scala-metals or Python mode. I think the main takeaway is: do what feels best for you, there is no one true way to develop in Spark. On Fri, Oct 1, 2021 at 1:28 AM

Drop-In Virtual Office Hour round 2 :)

2021-09-28 Thread Holden Karau
Hi Folks, I'm going to do another drop-in virtual office hour and I've made a public google calendar to track them so hopefully it's easier for folks to add events https://calendar.google.com/calendar/?cid=cXBubTY3Z2VzcmNjbnEzOWIzb3RyOWI1am9AZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ or ics feed at https:

Re: Drop-In Virtual Office Half-Hour

2021-09-20 Thread Holden Karau
Hey folks I'm doing my drop-in half-hour now - http://meet.google.com/ccd-mkbd-gfv :) On Mon, Sep 13, 2021 at 4:12 PM Holden Karau wrote: > Hi Folks, > > I'm going to experiment with a drop-in virtual half-hour office hour type > thing next Monday, if you've got any b

Re: Drop-In Virtual Office Half-Hour

2021-09-17 Thread Holden Karau
Meet joining info Video call link: https://meet.google.com/ccd-mkbd-gfv On Mon, Sep 13, 2021 at 4:12 PM Holden Karau wrote: > Hi Folks, > > I'm going to experiment with a drop-in virtual half-hour office hour type > thing next Monday, if you've got any burning Spark or general O

Re: Drop-In Virtual Office Half-Hour

2021-09-13 Thread Holden Karau
, Sep 13, 2021 at 5:11 PM Holden Karau wrote: > Ah thanks for pointing that out. I changed the visibility on it to public > so it should work now. > > On Mon, Sep 13, 2021 at 4:26 PM Gourav Sengupta > wrote: > >> Hi Holden, >> >> This is such a wonderful op

Re: Drop-In Virtual Office Half-Hour

2021-09-13 Thread Holden Karau
s, > Gourav > > On Tue, Sep 14, 2021 at 12:13 AM Holden Karau > wrote: > >> Hi Folks, >> >> I'm going to experiment with a drop-in virtual half-hour office hour type >> thing next Monday, if you've got any burning Spark or general OSS questions >&g

Drop-In Virtual Office Half-Hour

2021-09-13 Thread Holden Karau
Hi Folks, I'm going to experiment with a drop-in virtual half-hour office hour type thing next Monday, if you've got any burning Spark or general OSS questions you haven't had the time to ask anyone else I hope you'll swing by and join me. If no one comes with questions I'll tour some of the Spark

Re: Spark on Kubernetes scheduler variety

2021-07-08 Thread Holden Karau
2021 at 8:56 AM Holden Karau wrote: > That's awesome, I'm just starting to get context around Volcano but maybe > we can schedule an initial meeting for all of us interested in pursuing > this to get on the same page. > > On Wed, Jun 23, 2021 at 6:54 PM Klaus Ma wrote: &g

Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread Holden Karau
sclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monet

Re: CVEs

2021-06-21 Thread Holden Karau
If you get to a point where you find something you think is highly likely a valid vulnerability the best path forward is likely reaching out to private@ to figure out how to do a security release. On Mon, Jun 21, 2021 at 4:42 PM Eric Richardson wrote: > Thanks for the quick reply. Yes, since it

Re: Scala vs Python for ETL with Spark

2020-10-17 Thread Holden Karau
Scala and Python have their advantages and disadvantages with Spark. In my experience with performance is super important you’ll end up needing to do some of your work in the JVM, but in many situations what matters work is what your team and company are familiar with and the ecosystem of tooling

Re: REST Structured Steaming Sink

2020-07-01 Thread Holden Karau
xplicitly tune. > . foreachWriter is typically used for such use cases, not foreachBatch. > It's also pretty hard to guarantee exactly-once, rate limiting, etc. > > Best, > Burak > > On Wed, Jul 1, 2020 at 5:54 PM Holden Karau wrote: > >> I think adding someth

Re: REST Structured Steaming Sink

2020-07-01 Thread Holden Karau
I think adding something like this (if it doesn't already exist) could help make structured streaming easier to use, foreachBatch is not the best API. On Wed, Jul 1, 2020 at 2:21 PM Jungtaek Lim wrote: > I guess the method, query parameter, header, and the payload would be all > different for al

[ANNOUNCE] Apache Spark 2.4.6 released

2020-06-10 Thread Holden Karau
We are happy to announce the availability of Spark 2.4.6! Spark 2.4.6 is a maintenance release containing stability, correctness, and security fixes. This release is based on the branch-2.4 maintenance branch of Spark. We strongly recommend all 2.4 users to upgrade to this stable release. To down

Re: Spark API and immutability

2020-05-25 Thread Holden Karau
So even on RDDs cache/persist mutate the RDD object. The important thing for Spark is that the data represented/in the RDD/Dataframe isn’t mutated. On Mon, May 25, 2020 at 10:56 AM Chris Thomas wrote: > > The cache() method on the DataFrame API caught me out. > > Having learnt that DataFrames a

Re: Watch "Airbus makes more of the sky with Spark - Jesse Anderson & Hassene Ben Salem" on YouTube

2020-04-25 Thread Holden Karau
Also it’s ok if Spark and Flink evolve in different directions, were both part of the same open source foundation. Sometimes being everything to everyone isn’t as important as being the best at what you need. I like to think of our relationship with other Apache projects as less competitive and mo

Re: Copyright Infringment

2020-04-25 Thread Holden Karau
ache >>> foundation's free licence agreement ? >>> >>> >>> >>> On Sat, 25 Apr 2020, 16:18 Sean Owen, wrote: >>> >>>> You'll want to ask the authors directly ; the book is not produced by >>>> the project

Re: Copyright Infringment

2020-04-25 Thread Holden Karau
ion because I do not want to commit an unlawful act. >>> Can you please clarify if I would be infringing copyright due to this >>> text. >>> *Book: High Performance Spark * >>> *authors: holden Karau Rachel Warren.* >>> *page xii:* >>> >>> *

Re: Going it alone.

2020-04-16 Thread Holden Karau
I want to be clear I believe the language in janethrope1s email is unacceptable for the mailing list and possibly a violation of the Apache code of conduct. I’m glad we don’t see messages like this often. I know this is a stressful time for many of us, but let’s try and do our best to not take it

Re: SPARK Suitable IDE

2020-03-04 Thread Holden Karau
I work in emacs with ensime. I think really any IDE is ok, so go with the one you feel most at home in. On Wed, Mar 4, 2020 at 5:49 PM tianlangstudio wrote: > We use IntelliJ IDEA,Whether it's Java, Scala or Python > > >

Re: PySpark Pandas UDF

2019-11-12 Thread Holden Karau
Thanks for sharing that. I think we should maybe add some checks around this so it’s easier to debug. I’m CCing Bryan who might have some thoughts. On Tue, Nov 12, 2019 at 7:42 AM gal.benshlomo wrote: > SOLVED! > thanks for the help - I found the issue. it was the version of pyarrow > (0.15.1) w

Re: PySpark Pandas UDF

2019-11-10 Thread Holden Karau
Can you switch the write for a count just so we can isolate if it’s the write or the count? Also what’s the output path your using? On Sun, Nov 10, 2019 at 7:31 AM Gal Benshlomo wrote: > > > Hi, > > > > I’m using pandas_udf and not able to run it from cluster mode, even though > the same code wo

Re: Why Spark generates Java code and not Scala?

2019-11-10 Thread Holden Karau
If you look inside of the generation we generate java code and compile it with Janino. For interested folks the conversation moved over to the dev@ list On Sat, Nov 9, 2019 at 10:37 AM Marcin Tustin wrote: > What do you mean by this? Spark is written in a combination of Scala and > Java, and the

Re: pyspark - memory leak leading to OOM after submitting 100 jobs?

2019-11-01 Thread Holden Karau
On Thu, Oct 31, 2019 at 10:04 PM Nicolas Paris wrote: > have you deactivated the spark.ui ? > I have read several thread explaining the ui can lead to OOM because it > stores 1000 dags by default > > > On Sun, Oct 20, 2019 at 03:18:20AM -0700, Paul Wais wrote: > > Dear List, > > > > I've observed

Re: Loop through Dataframes

2019-10-06 Thread Holden Karau
So if you want to process the contents of a dataframe locally but not pull all of the data back at once toLocaliterator is probably what you're looking for, it's still not great though so maybe you can share the root problem which your trying to solve and folks might have some suggestions there. O

Re: Announcing .NET for Apache Spark 0.5.0

2019-09-30 Thread Holden Karau
Congratulations on the release :) On Mon, Sep 30, 2019 at 9:38 AM Terry Kim wrote: > We are thrilled to announce that .NET for Apache Spark 0.5.0 has been just > released ! > > > > Some of the highlights of this release include: > >- Delta

Re: Release Apache Spark 2.4.4

2019-08-14 Thread Holden Karau
t;> [SPARK-27234][SS][PYTHON] Use InheritableThreadLocal for current epoch in >> EpochTracker (to support Python UDFs) >> <https://github.com/apache/spark/pull/24946> >> >> Thanks, >> Terry >> >> On Tue, Aug 13, 2019 at 10:24 PM Wenchen Fan wrote: >> >

Re: Release Apache Spark 2.4.4

2019-08-13 Thread Holden Karau
+1 Does anyone have any critical fixes they’d like to see in 2.4.4? On Tue, Aug 13, 2019 at 5:22 PM Sean Owen wrote: > Seems fine to me if there are enough valuable fixes to justify another > release. If there are any other important fixes imminent, it's fine to > wait for those. > > > On Tue, A

Re: Should python-2 be supported in Spark 3.0?

2019-05-31 Thread Holden Karau
+1 On Fri, May 31, 2019 at 5:41 PM Bryan Cutler wrote: > +1 and the draft sounds good > > On Thu, May 30, 2019, 11:32 AM Xiangrui Meng wrote: > >> Here is the draft announcement: >> >> === >> Plan for dropping Python 2 support >> >> As many of you already knew, Python core development team and

Re: How to preserve event order per key in Structured Streaming Repartitioning By Key?

2018-12-11 Thread Holden Karau
So it's been awhile since I poked at the streaming code base, but I don't think we make an promises about stable sort during repartition, and there's notes in there about how some of these components should be re-written into core so even if we did have stable sort I wouldn't depend on it unless it

Re: [Spark Shell on AWS K8s Cluster]: Is there more documentation regarding how to run spark-shell on k8s cluster?

2018-11-15 Thread Holden Karau
If folks are interested, while it's not on Amazon, I've got a live stream of getting client mode with Jupyternotebook to work on GCP/GKE : https://www.youtube.com/watch?v=eMj0Pv1-Nfo&index=3&list=PLRLebp9QyZtZflexn4Yf9xsocrR_aSryx On Wed, Oct 31, 2018 at 5:55 PM Zhang, Yuqi wrote: > Hi Li, > > >

Re: Is there any Spark source in Java

2018-11-03 Thread Holden Karau
Parts of it are indeed written in Java. You probably want to reach out to the developers list to talk about changing Spark. On Sat, Nov 3, 2018, 11:42 AM Soheil Pourbafrani Hi, I want to customize some part of Spark. I was wondering if there any > Spark source is written in Java language, or all

Code review and Coding livestreams today

2018-10-12 Thread Holden Karau
I’ll be doing my regular weekly code review at 10am Pacific today - https://youtu.be/IlH-EGiWXK8 with a look at the current RC, and in the afternoon at 3pm Pacific I’ll be doing some live coding around WIP graceful decommissioning PR - https://youtu.be/4FKuYk2sbQ8 -- Twitter: https://twitter.com/h

Re: Live Streamed Code Review today at 11am Pacific

2018-09-20 Thread Holden Karau
order batches) is my current plan to start with :) On Thu, Jul 19, 2018 at 11:38 PM Holden Karau wrote: > Heads up tomorrows Friday review is going to be at 8:30 am instead of 9:30 > am because I had to move some flights around. > > On Fri, Jul 13, 2018 at 12:03 PM, Holden Karau &

Re: Use Arrow instead of Pickle without pandas_udf

2018-07-25 Thread Holden Karau
Not currently. What's the problem with pandas_udf for your use case? On Wed, Jul 25, 2018 at 1:27 PM, Hichame El Khalfi wrote: > Hi There, > > > Is there a way to use Arrow format instead of Pickle but without using > pandas_udf ? > > > Thank for your help, > > > Hichame > -- Twitter: https:

Live Code Reviews, Coding, and Dev Tools

2018-07-24 Thread Holden Karau
Tomorrow afternoon @ 3pm pacific I'll be doing some dev tools poking for Beam and Spark - https://www.youtube.com/watch?v=6cTmC_fP9B0 for mention-bot. On Friday I'll be doing my normal code reviews - https://www.youtube.com/watch?v=O4rRx-3PTiM On Monday July 30th @ 9:30am I'll be doing some more

Re: Live Streamed Code Review today at 11am Pacific

2018-07-19 Thread Holden Karau
Heads up tomorrows Friday review is going to be at 8:30 am instead of 9:30 am because I had to move some flights around. On Fri, Jul 13, 2018 at 12:03 PM, Holden Karau wrote: > This afternoon @ 3pm pacific I'll be looking at review tooling for Spark & > Beam https://www.yout

Re: Pyspark access to scala/java libraries

2018-07-15 Thread Holden Karau
If you want to see some examples in a library shows a way to do it - https://github.com/sparklingpandas/sparklingml and high performance spark also talks about it. On Sun, Jul 15, 2018, 11:57 AM <0xf0f...@protonmail.com.invalid> wrote: > Check > https://stackoverflow.com/questions/31684842/callin

Re: Live Streamed Code Review today at 11am Pacific

2018-07-13 Thread Holden Karau
ySpark and working on Sparkling ML - https://www.youtube.com/watch?v=kCnBDpNce9A&list=PLRLebp9QyZtYF46jlSnIu2x1NDBkKa2uw&index=32 On Wed, Jun 27, 2018 at 10:44 AM, Holden Karau wrote: > Today @ 1:30pm pacific I'll be looking at the current Spark 2.1.3 RC and > see how we validate S

[ANNOUNCE] Apache Spark 2.1.3

2018-07-01 Thread Holden Karau
We are happy to announce the availability of Spark 2.1.3! Apache Spark 2.1.3 is a maintenance release, based on the branch-2.1 maintenance branch of Spark. We strongly recommend all 2.1.x users to upgrade to this stable release. The release notes are available at http://spark.apache.org/releases/s

Re: Live Streamed Code Review today at 11am Pacific

2018-06-27 Thread Holden Karau
.com/user/holdenkarau & https://www.twitch.tv/holdenkarau/events . Hopefully this can encourage more folks to help with RC validation & PR reviews :) On Thu, Jun 14, 2018 at 6:07 AM, Holden Karau wrote: > Next week is pride in San Francisco but I'm still going to do two quick > sess

Re: Live Streamed Code Review today at 11am Pacific

2018-06-14 Thread Holden Karau
d the other will be the regular Friday code review ( https://www.youtube.com/watch?v=IAWm4OLRoyY / https://www.twitch.tv/events/v0qzXxnNQ_K7a8JYFsIiKQ ) also at 9am. On Thu, Jun 7, 2018 at 9:10 PM, Holden Karau wrote: > I'll be doing another one tomorrow morning at 9am pacific focused on &

Re: Live Streamed Code Review today at 11am Pacific

2018-06-07 Thread Holden Karau
I'll be doing another one tomorrow morning at 9am pacific focused on Python + K8s support & improved JSON support - https://www.youtube.com/watch?v=Z7ZEkvNwneU & https://www.twitch.tv/events/xU90q9RGRGSOgp2LoNsf6A :) On Fri, Mar 9, 2018 at 3:54 PM, Holden Karau wrote: > If anyon

Spark ML online serving

2018-06-06 Thread Holden Karau
At Spark Summit some folks were talking about model serving and we wanted to collect requirements from the community. -- Twitter: https://twitter.com/holdenkarau

Re: Dataframe from 1.5G json (non JSONL)

2018-06-05 Thread Holden Karau
If it’s one 33mb file which decompressed to 1.5g then there is also a chance you need to split the inputs since gzip is a non-splittable compression format. On Tue, Jun 5, 2018 at 11:55 AM Anastasios Zouzias wrote: > Are you sure that your JSON file has the right format? > > spark.read.json(...)

Re: testing frameworks

2018-05-30 Thread Holden Karau
So Jessie has an excellent blog post on how to use it with Java applications - http://www.jesse-anderson.com/2016/04/unit-testing-spark-with-java/ On Wed, May 30, 2018 at 4:14 AM Spico Florin wrote: > Hello! > I'm also looking for unit testing spark Java application. I've seen the > great work

Re: testing frameworks

2018-05-21 Thread Holden Karau
So I’m biased as the author of spark-testing-base but I think it’s pretty ok. Are you looking for unit or integration or something else? On Mon, May 21, 2018 at 5:24 AM Steve Pruitt wrote: > Hi, > > > > Can anyone recommend testing frameworks suitable for Spark jobs. > Something that can be inte

Re: [Spark on Google Kubernetes Engine] Properties File Error

2018-04-30 Thread Holden Karau
So, while its not perfect, I have a guide focused on running custom Spark on GKE https://cloud.google.com/blog/big-data/2018/03/testing-future-apache-spark-releases-and-changes-on-google-kubernetes-engine-and-cloud-dataproc and if you want to run pre-built Spark on GKE there is a solutions article

Re: Live Stream Code Reviews :)

2018-04-13 Thread Holden Karau
t; zone I guess. > > Regards, > Gourav Sengupta > > On Thu, Apr 12, 2018 at 8:23 PM, Holden Karau > wrote: > >> Hi Y'all, >> >> If your interested in learning more about how the development process in >> Apache Spark works I've been doin

Re: Live Stream Code Reviews :)

2018-04-12 Thread Holden Karau
ezone? >> >> Il gio 12 apr 2018, 21:23 Holden Karau ha scritto: >> >>> Hi Y'all, >>> >>> If your interested in learning more about how the development process in >>> Apache Spark works I've been doing a weekly live streamed code re

Live Stream Code Reviews :)

2018-04-12 Thread Holden Karau
Hi Y'all, If your interested in learning more about how the development process in Apache Spark works I've been doing a weekly live streamed code review most Fridays at 11am. This weeks will be on twitch/youtube ( https://www.twitch.tv/holdenkarau / https://www.youtube.com/watch?v=vGVSa9KnD80 ). I

Re: Open sourcing Sparklens: Qubole's Spark Tuning Tool

2018-03-21 Thread Holden Karau
Super exciting! I look forward to digging through it this weekend. On Wed, Mar 21, 2018 at 9:33 PM ☼ R Nair (रविशंकर नायर) < ravishankar.n...@gmail.com> wrote: > Excellent. You filled a missing link. > > Best, > Passion > > On Wed, Mar 21, 2018 at 11:36 PM, Rohit Karlupia > wrote: > >> Hi, >> >>

Re: Live Streamed Code Review today at 11am Pacific

2018-03-09 Thread Holden Karau
If anyone wants to watch the recording: https://www.youtube.com/watch?v=lugG_2QU6YU I'll do one next week as well - March 16th @ 11am - https://www.youtube.com/watch?v=pXzVtEUjrLc On Fri, Mar 9, 2018 at 9:28 AM, Holden Karau wrote: > Hi folks, > > If your curious in learning

Live Streamed Code Review today at 11am Pacific

2018-03-09 Thread Holden Karau
Hi folks, If your curious in learning more about how Spark is developed, I’m going to expirement doing a live code review where folks can watch and see how that part of our process works. I have two volunteers already for having their PRs looked at live, and if you have a Spark PR your working on

Re: Spark not releasing shuffle files in time (with very large heap)

2018-02-23 Thread Holden Karau
You can also look at the shuffle file cleanup tricks we do inside of the ALS algorithm in Spark. On Fri, Feb 23, 2018 at 6:20 PM, vijay.bvp wrote: > have you looked at > http://apache-spark-user-list.1001560.n3.nabble.com/Limit- > Spark-Shuffle-Disk-Usage-td23279.html > > and the post mentioned

Re: Can spark handle this scenario?

2018-02-16 Thread Holden Karau
I'm not sure what you mean by it could be hard to serialize complex operations? Regardless I think the question is do you want to parallelize this on multiple machines or just one? On Feb 17, 2018 4:20 PM, "Lian Jiang" wrote: > Thanks Ayan. RDD may support map better than Dataset/DataFrame. How

Re: pyspark+spacy throwing pickling exception

2018-02-15 Thread Holden Karau
So you left out the exception. On one hand I’m also not sure how well spacy serializes, so to debug this I would start off by moving the nlp = inside of my function and see if it still fails. On Thu, Feb 15, 2018 at 9:08 PM Selvam Raman wrote: > import spacy > > nlp = spacy.load('en') > > > > de

FOSDEM mini-office hour?

2018-01-31 Thread Holden Karau
Hi Spark Friends, If any folks are around for FOSDEM this year I was planning on doing a coffee office hour on the last day after my talks . Maybe like 6pm? I'm also going to see if any BEAM folks are around and interested :) Cheers, Holden

Re: Spark Tuning Tool

2018-01-22 Thread Holden Karau
That's very interesting, and might also get some interest on the dev@ list if it was open source. On Tue, Jan 23, 2018 at 4:02 PM, Roger Marin wrote: > I'd be very interested. > > On 23 Jan. 2018 4:01 pm, "Rohit Karlupia" wrote: > >> Hi, >> >> I have been working on making the performance tunin

Re: Access to Applications metrics

2017-12-05 Thread Holden Karau
I've done a SparkListener to record metrics for validation (it's a bit out of date). Are you just looking to have graphing/alerting set up on the Spark metrics? On Tue, Dec 5, 2017 at 1:53 PM, Thakrar, Jayesh < jthak...@conversantmedia.com> wrote: > You can also get the metrics from the Spark app

Re: Recommended way to serialize Hadoop Writables' in Spark

2017-12-03 Thread Holden Karau
So is there a reason you want to shuffle Hadoop types rather than the Java types? As for your specific question, for Kyro you also need to register your serializers, did you do that? On Sun, Dec 3, 2017 at 10:02 AM pradeepbaji wrote: > Hi, > > Is there any recommended way of serializing Hadoop

Re: Is Databricks REST API open source ?

2017-12-02 Thread Holden Karau
That API is not open source. There are some other options as separate projects you can check out (like Livy,spark-jobserver, etc). On Sat, Dec 2, 2017 at 8:30 PM kant kodali wrote: > HI All, > > Is REST API (https://docs.databricks.com/api/index.html) open source? > where I can submit spark jobs

Re: NLTK with Spark Streaming

2017-11-26 Thread Holden Karau
So it’s certainly doable (it’s not super easy mind you), but until the arrow udf release goes out it will be rather slow. On Sun, Nov 26, 2017 at 8:01 AM ashish rawat wrote: > Hi, > > Has someone tried running NLTK (python) with Spark Streaming (scala)? I > was wondering if this is a good idea a

What do you pay attention to when validating Spark jobs?

2017-11-21 Thread Holden Karau
Hi Folks, I'm working on updating a talk and I was wondering if any folks in the community wanted to share their best practices for validating your Spark jobs? Are there any counters folks have found useful for monitoring/validating your Spark jobs? Cheers, Holden :) -- Twitter: https://twitte

Re: PySpark 2.2.0, Kafka 0.10 DataFrames

2017-11-20 Thread Holden Karau
What command did you use to launch your Spark application? The https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#deploying documentation suggests using spark-submit with the `--packages` flag to include the required Kafka package. e.g. ./bin/spark-submit --packages o

Re: Use of Accumulators

2017-11-14 Thread Holden Karau
t toggle it saying there is some change while > processing the data. > > > > Please let me know if we can runtime do this. > > > > > > Thanks! > > *~Kedar Dixit* > > Bigdata Analytics at Persistent Systems Ltd. > > > > *From:* Holden Karau [via

Re: Use of Accumulators

2017-11-13 Thread Holden Karau
So you want to set an accumulator to 1 after a transformation has fully completed? Or what exactly do you want to do? On Mon, Nov 13, 2017 at 9:47 PM vaquar khan wrote: > Confirmed ,you can use Accumulators :) > > Regards, > Vaquar khan > > On Mon, Nov 13, 2017 at 10:58 AM, Kedarnath Dixit < > k

  1   2   3   >