Re: REST Structured Steaming Sink

2020-07-01 Thread Holden Karau
I think adding something like this (if it doesn't already exist) could help make structured streaming easier to use, foreachBatch is not the best API. On Wed, Jul 1, 2020 at 2:21 PM Jungtaek Lim wrote: > I guess the method, query parameter, header, and the payload would be all > different for al

Re: REST Structured Steaming Sink

2020-07-01 Thread Holden Karau
xplicitly tune. > . foreachWriter is typically used for such use cases, not foreachBatch. > It's also pretty hard to guarantee exactly-once, rate limiting, etc. > > Best, > Burak > > On Wed, Jul 1, 2020 at 5:54 PM Holden Karau wrote: > >> I think adding someth

Re: Scala vs Python for ETL with Spark

2020-10-17 Thread Holden Karau
Scala and Python have their advantages and disadvantages with Spark. In my experience with performance is super important you’ll end up needing to do some of your work in the JVM, but in many situations what matters work is what your team and company are familiar with and the ecosystem of tooling

Re: CVEs

2021-06-21 Thread Holden Karau
If you get to a point where you find something you think is highly likely a valid vulnerability the best path forward is likely reaching out to private@ to figure out how to do a security release. On Mon, Jun 21, 2021 at 4:42 PM Eric Richardson wrote: > Thanks for the quick reply. Yes, since it

Re: Spark on Kubernetes scheduler variety

2021-06-24 Thread Holden Karau
sclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monet

Re: Spark on Kubernetes scheduler variety

2021-07-08 Thread Holden Karau
2021 at 8:56 AM Holden Karau wrote: > That's awesome, I'm just starting to get context around Volcano but maybe > we can schedule an initial meeting for all of us interested in pursuing > this to get on the same page. > > On Wed, Jun 23, 2021 at 6:54 PM Klaus Ma wrote: &g

Drop-In Virtual Office Half-Hour

2021-09-13 Thread Holden Karau
Hi Folks, I'm going to experiment with a drop-in virtual half-hour office hour type thing next Monday, if you've got any burning Spark or general OSS questions you haven't had the time to ask anyone else I hope you'll swing by and join me. If no one comes with questions I'll tour some of the Spark

Re: Drop-In Virtual Office Half-Hour

2021-09-13 Thread Holden Karau
s, > Gourav > > On Tue, Sep 14, 2021 at 12:13 AM Holden Karau > wrote: > >> Hi Folks, >> >> I'm going to experiment with a drop-in virtual half-hour office hour type >> thing next Monday, if you've got any burning Spark or general OSS questions >&g

Re: Drop-In Virtual Office Half-Hour

2021-09-13 Thread Holden Karau
, Sep 13, 2021 at 5:11 PM Holden Karau wrote: > Ah thanks for pointing that out. I changed the visibility on it to public > so it should work now. > > On Mon, Sep 13, 2021 at 4:26 PM Gourav Sengupta > wrote: > >> Hi Holden, >> >> This is such a wonderful op

Re: Drop-In Virtual Office Half-Hour

2021-09-17 Thread Holden Karau
Meet joining info Video call link: https://meet.google.com/ccd-mkbd-gfv On Mon, Sep 13, 2021 at 4:12 PM Holden Karau wrote: > Hi Folks, > > I'm going to experiment with a drop-in virtual half-hour office hour type > thing next Monday, if you've got any burning Spark or general O

Re: Drop-In Virtual Office Half-Hour

2021-09-20 Thread Holden Karau
Hey folks I'm doing my drop-in half-hour now - http://meet.google.com/ccd-mkbd-gfv :) On Mon, Sep 13, 2021 at 4:12 PM Holden Karau wrote: > Hi Folks, > > I'm going to experiment with a drop-in virtual half-hour office hour type > thing next Monday, if you've got any b

Drop-In Virtual Office Hour round 2 :)

2021-09-28 Thread Holden Karau
Hi Folks, I'm going to do another drop-in virtual office hour and I've made a public google calendar to track them so hopefully it's easier for folks to add events https://calendar.google.com/calendar/?cid=cXBubTY3Z2VzcmNjbnEzOWIzb3RyOWI1am9AZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ or ics feed at https:

Re: Choice of IDE for Spark

2021-10-01 Thread Holden Karau
Personally I like Jupyter notebooks for my interactive work and then once I’ve done my exploration I switch back to emacs with either scala-metals or Python mode. I think the main takeaway is: do what feels best for you, there is no one true way to develop in Spark. On Fri, Oct 1, 2021 at 1:28 AM

Re: Log4j 1.2.17 spark CVE

2021-12-12 Thread Holden Karau
My understanding is it only applies to log4j 2+ so we don’t need to do anything. On Sun, Dec 12, 2021 at 8:46 PM Pralabh Kumar wrote: > Hi developers, users > > Spark is built using log4j 1.2.17 . Is there a plan to upgrade based on > recent CVE detected ? > > > Regards > Pralabh kumar > -- Tw

Re: Spark 3.1.2 full thread dumps

2022-02-04 Thread Holden Karau
We don’t block scaling up after node failure in classic Spark if that’s the question. On Fri, Feb 4, 2022 at 6:30 PM Mich Talebzadeh wrote: > From what I can see in auto scaling setup, you will always need a min of > two worker nodes as primary. It also states and I quote "Scaling primary > work

Re: Unable to access Google buckets using spark-submit

2022-02-12 Thread Holden Karau
You can also put the GS access jar with your Spark jars — that’s what the class not found exception is pointing you towards. On Fri, Feb 11, 2022 at 11:58 PM Mich Talebzadeh wrote: > BTW I also answered you in in stackoverflow : > > > https://stackoverflow.com/questions/71088934/unable-to-access

Re: Reverse proxy for Spark UI on Kubernetes

2022-05-16 Thread Holden Karau
Oh that’s rad 😊 On Tue, May 17, 2022 at 7:47 AM bo yang wrote: > Hi Spark Folks, > > I built a web reverse proxy to access Spark UI on Kubernetes (working > together with https://github.com/GoogleCloudPlatform/spark-on-k8s-operator). > Want to share here in case other people have similar need. >

Re: Reverse proxy for Spark UI on Kubernetes

2022-05-17 Thread Holden Karau
Could we make it do the same sort of history server fallback approach? On Tue, May 17, 2022 at 10:41 PM bo yang wrote: > It is like Web Application Proxy in YARN ( > https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/WebApplicationProxy.html), > to provide easy access for Spark U

Re: Jupyter notebook on Dataproc versus GKE

2022-09-05 Thread Holden Karau
I’ve run Jupyter w/Spark on K8s, haven’t tried it with Dataproc personally. The Spark K8s pod scheduler is now more pluggable for Yunikorn and Volcano can be used with less effort. On Mon, Sep 5, 2022 at 7:44 AM Mich Talebzadeh wrote: > > Hi, > > > Has anyone got experience of running Jupyter o

Re: Jupyter notebook on Dataproc versus GKE

2022-09-05 Thread Holden Karau
ion of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Mon, 5 Sept 2022 at 1

Re: Jupyter notebook on Dataproc versus GKE

2022-09-06 Thread Holden Karau
rise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On Mon, 5 Sept 2022 at 20

Re: Dynamic Scaling without Kubernetes

2022-10-26 Thread Holden Karau
So Spark can dynamically scale on YARN, but standalone mode becomes a bit complicated — where do you envision Spark gets the extra resources from? On Wed, Oct 26, 2022 at 12:18 PM Artemis User wrote: > Has anyone tried to make a Spark cluster dynamically scalable, i.e., > adding a new worker nod

Re: Dataproc serverless for Spark

2022-11-28 Thread Holden Karau
This sounds like a great question for the Google DataProc folks (I know there was some interesting work being done around it but I left before it was finished so I don't want to provide a possibly incorrect answer). If your a GCP customer try reaching out to their support for details. On Mon, Nov

Re: [PySpark] Reader/Writer for bgzipped data

2022-12-06 Thread Holden Karau
There is the splittable gzip Hadoop input format, maybe someone could extend that to use support bgzip? On Tue, Dec 6, 2022 at 1:43 PM Oliver Ruebenacker < oliv...@broadinstitute.org> wrote: > > Hello Chris, > > Yes, you can use gunzip/gzip to uncompress a file created by bgzip, but > to s

Re: [PySpark] Reader/Writer for bgzipped data

2022-12-06 Thread Holden Karau
> > On Tue, Dec 6, 2022 at 9:22 AM Holden Karau wrote: > >> There is the splittable gzip Hadoop input format, maybe someone could >> extend that to use support bgzip? >> >> On Tue, Dec 6, 2022 at 1:43 PM Oliver Ruebenacker < >> oliv...@broadinstitute.org>

Re: SPIP: Shutting down spark structured streaming when the streaming process completed current process

2023-02-18 Thread Holden Karau
Is there someone focused on streaming work these days who would want to shepherd this? On Sat, Feb 18, 2023 at 5:02 PM Dongjoon Hyun wrote: > Thank you for considering me, but may I ask what makes you think to put me > there, Mich? I'm curious about your reason. > > > I have put dongjoon.hyun as

Re: Dynamic allocation does not deallocate executors

2023-08-07 Thread Holden Karau
I think you need to set "spark.dynamicAllocation.shuffleTracking.enabled=true" to false. On Mon, Aug 7, 2023 at 2:50 AM Mich Talebzadeh wrote: > Yes I have seen cases where the driver gone but a couple of executors > hanging on. Sounds like a code issue. > > HTH > > Mich Talebzadeh, > Solutions

Re: Dynamic allocation does not deallocate executors

2023-08-08 Thread Holden Karau
for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such lo

Re: Elasticsearch support for Spark 3.x

2023-08-27 Thread Holden Karau
What’s the version of the ES connector you are using? On Sat, Aug 26, 2023 at 10:17 AM Dipayan Dev wrote: > Hi All, > > We're using Spark 2.4.x to write dataframe into the Elasticsearch index. > As we're upgrading to Spark 3.3.0, it throwing out error > Caused by: java.lang.ClassNotFoundExceptio

Re: Write Spark Connection client application in Go

2023-09-12 Thread Holden Karau
That’s so cool! Great work y’all :) On Tue, Sep 12, 2023 at 8:14 PM bo yang wrote: > Hi Spark Friends, > > Anyone interested in using Golang to write Spark application? We created a > Spark > Connect Go Client library . > Would love to hear feedback/t

Re: Classpath isolation per SparkSession without Spark Connect

2023-11-27 Thread Holden Karau
So I don’t think we make any particular guarantees around class path isolation there, so even if it does work it’s something you’d need to pay attention to on upgrades. Class path isolation is tricky to get right. On Mon, Nov 27, 2023 at 2:58 PM Faiz Halde wrote: > Hello, > > We are using spark

Re: Spark-Connect: Param `--packages` does not take effect for executors.

2023-12-04 Thread Holden Karau
So I think this sounds like a bug to me, in the help options for both regular spark-submit and ./sbin/start-connect-server.sh we say: " --packages Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths. Will

Re: Introducing Comet, a plugin to accelerate Spark execution via DataFusion and Arrow

2024-02-13 Thread Holden Karau
This looks really cool :) Out of interest what are the differences in the approach between this and Glutten? On Tue, Feb 13, 2024 at 12:42 PM Chao Sun wrote: > Hi all, > > We are very happy to announce that Project Comet, a plugin to > accelerate Spark query execution via leveraging DataFusion a

Re: Spark 4.0 Query Analyzer Bug Report

2024-02-20 Thread Holden Karau
Do you mean Spark 3.4? 4.0 is very much not released yet. Also it would help if you could share your query & more of the logs leading up to the error. On Tue, Feb 20, 2024 at 3:07 PM Sharma, Anup wrote: > Hi Spark team, > > > > We ran into a dataframe issue after upgrading from spark 3.1 to 4.

Re: issue forwarding SPARK_CONF_DIR to start workers

2024-07-20 Thread Holden Karau
This might a good discussion for the dev@ list, I don’t know much about SLURM deployments personally. Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 YouTube Live Streams: https://www.youtube.com/user

Re: Help needed with Py4J

2015-05-20 Thread Holden Karau
Are your jars included in both the driver and worker class paths? On Wednesday, May 20, 2015, Addanki, Santosh Kumar < santosh.kumar.adda...@sap.com> wrote: > Hi Colleagues > > > > We need to call a Scala Class from pySpark in Ipython notebook. > > > > We tried something like below : > > > > fro

Re: Help needed with Py4J

2015-05-20 Thread Holden Karau
from the same JAR > > Regards > Santosh > > > On May 20, 2015, at 7:26 PM, Holden Karau wrote: > > Are your jars included in both the driver and worker class paths? > > On Wednesday, May 20, 2015, Addanki, Santosh Kumar < > santosh.kumar.adda...@sap.com> wrote: >

Re: Query a Dataframe in rdd.map()

2015-05-21 Thread Holden Karau
So DataFrames, like RDDs, can only be accused from the driver. If your IP Frequency table is small enough you could collect it and distribute it as a hashmap with broadcast or you could also join your rdd with the ip frequency table. Hope that helps :) On Thursday, May 21, 2015, ping yan wrote:

Re: Compute Median in Spark Dataframe

2015-06-02 Thread Holden Karau
Not super easily, the GroupedData class uses a strToExpr function which has a pretty limited set of functions so we cant pass in the name of an arbitrary hive UDAF (unless I'm missing something). We can instead construct an column with the expression you want and then pass it in to agg() that way (

Re: Compute Median in Spark Dataframe

2015-06-02 Thread Holden Karau
lection.ReflectionEngine.getMethod(ReflectionEngine.java:333) > > Any idea ? > > Olivier. > Le mar. 2 juin 2015 à 18:02, Holden Karau > a écrit : > >> Not super easily, the GroupedData class uses a strToExpr function which >> has a pretty limited set of functions so we can

Re: Standard Scaler taking 1.5hrs

2015-06-04 Thread Holden Karau
take(5) will only evaluate enough partitions to provide 5 elements (sometimes a few more but you get the idea), so it won't trigger a full evaluation of all partitions unlike count(). On Thursday, June 4, 2015, Piero Cinquegrana wrote: > Hi DB, > > > > Yes I am running count() operations on the

Re: Compute Median in Spark Dataframe

2015-06-04 Thread Holden Karau
his but it does terrible things to access Spark internals. > I also need to call a Hive UDAF in a dataframe agg function. Are there any > examples of what Column expects? > > Deenar > > On 2 June 2015 at 21:13, Holden Karau wrote: > >> So for column you need to pass in a

Re: map V mapPartitions

2015-06-23 Thread Holden Karau
I think one of the primary cases where mapPartitions is useful if you are going to be doing any setup work that can be re-used between processing each element, this way the setup work only needs to be done once per partition (for example creating an instance of jodatime). Both map and mapPartition

Re: DataFrame Filter Inside Another Data Frame Map

2015-07-01 Thread Holden Karau
Collecting it as a regular (Java/scala/Python) map. You can also broadcast the map if your going to use it multiple times. On Wednesday, July 1, 2015, Ashish Soni wrote: > Thanks , So if i load some static data from database and then i need to > use than in my map function to filter records what

Re: Unit tests of spark application

2015-07-10 Thread Holden Karau
Somewhat biased of course, but you can also use spark-testing-base from spark-packages.org as a basis for your unittests. On Fri, Jul 10, 2015 at 12:03 PM, Daniel Siegmann < daniel.siegm...@teamaol.com> wrote: > On Fri, Jul 10, 2015 at 1:41 PM, Naveen Madhire > wrote: > >> I want to write junit

Re: SFTP Compressed CSV into Dataframe

2016-03-02 Thread Holden Karau
So doing a quick look through the README & code for spark-sftp it seems that the way this connector works is by downloading the file locally on the driver program and this is not configurable - so you would probably need to find a different connector (and you probably shouldn't use spark-sftp for l

Re: Spark reduce serialization question

2016-03-06 Thread Holden Karau
You might want to try treeAggregate On Sunday, March 6, 2016, Takeshi Yamamuro wrote: > Hi, > > I'm not exactly sure what's your codes like though, ISTM this is a correct > behaviour. > If the size of data that a driver fetches exceeds the limit, the driver > throws this exception. > (See > http

Re: Saving Spark generated table into underlying Hive table using Functional programming

2016-03-07 Thread Holden Karau
So what about if you just start with a hive context, and create your DF using the HiveContext? On Monday, March 7, 2016, Mich Talebzadeh wrote: > Hi, > > I have done this Spark-shell and Hive itself so it works. > > I am exploring whether I can do it programmatically. The problem I > encounter w

Re: Partitioning to speed up processing?

2016-03-10 Thread Holden Karau
Are they entire data set aggregates or is there some grouping applied? On Thursday, March 10, 2016, Gerhard Fiedler wrote: > I have a number of queries that result in a sequence Filter > Project > > Aggregate. I wonder whether partitioning the input table makes sense. > > > > Does Aggregate bene

Re: Python unit tests - Unable to ru it with Python 2.6 or 2.7

2016-03-11 Thread Holden Karau
So the run tests command allows you to specify the python version to test again - maybe specify python2.7 On Friday, March 11, 2016, Gayathri Murali wrote: > I do have 2.7 installed and unittest2 package available. I still see this > error : > > Please install unittest2 to test with Python 2.6 o

Re: Reading Back a Cached RDD

2016-03-24 Thread Holden Karau
Even checkpoint() is maybe not exactly what you want, since if reference tracking is turned on it will get cleaned up once the original RDD is out of scope and GC is triggered. If you want to share persisted RDDs right now one way to do this is sharing the same spark context (using something like t

Re: python support of mapWithState

2016-03-24 Thread Holden Karau
In general the Python API lags behind the Scala & Java APIs. The Scala & Java APIs tend to be easier to keep in sync since they are both in the JVM and a bit more work is needed to expose the same functionality from the JVM in Python (or re-implement the Scala code in Python where appropriate). On

Re: since spark can not parallelize/serialize functions, how to distribute algorithms on the same data?

2016-03-28 Thread Holden Karau
You probably want to look at the map transformation, and the many more defined on RDDs. The function you pass in to map is serialized and the computation is distributed. On Monday, March 28, 2016, charles li wrote: > > use case: have a dataset, and want to use different algorithms on that, > and

Re: Scala: Perform Unit Testing in spark

2016-04-01 Thread Holden Karau
You can also look at spark-testing-base which works in both Scalatest and Junit and see if that works for your use case. On Friday, April 1, 2016, Ted Yu wrote: > Assuming your code is written in Scala, I would suggest using ScalaTest. > > Please take a look at the XXSuite.scala files under mlli

Re: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0

2016-04-05 Thread Holden Karau
I'm very much in favor of this, the less porting work there is the better :) On Tue, Apr 5, 2016 at 5:32 PM, Joseph Bradley wrote: > +1 By the way, the JIRA for tracking (Scala) API parity is: > https://issues.apache.org/jira/browse/SPARK-4591 > > On Tue, Apr 5, 2016 at 4:58 PM, Matei Zaharia

Re: About nested RDD

2016-04-08 Thread Holden Karau
It seems like the union function on RDDs might be what you are looking for, or was there something else you were trying to achieve? On Thursday, April 7, 2016, Tenghuan He wrote: > Hi all, > > I know that nested RDDs are not possible like linke rdd1.map(x => x + > rdd2.count()) > I tried to crea

Re: JSON Usage

2016-04-14 Thread Holden Karau
You could certainly use RDDs for that, you might also find using Dataset selecting the fields you need to construct the URL to fetch and then using the map function to be easier. On Thu, Apr 14, 2016 at 12:01 PM, Benjamin Kim wrote: > I was wonder what would be the best way to use JSON in Spark/

Re: error "Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe."

2016-04-14 Thread Holden Karau
The org.apache.spark.sql.execution.EvaluatePython.takeAndServe exception can happen in a lot of places it might be easier to figure out if you have a code snippet you can share where this is occurring? On Wed, Apr 13, 2016 at 2:27 PM, AlexModestov wrote: > I get this error. > Who knows what does

Re: how to write pyspark interface to scala code?

2016-04-14 Thread Holden Karau
Its a bit tricky - if the users data is represented in a DataFrame or Dataset then its much easier. Assuming that the function is going to be called from the driver program (e.g. not inside of a transformation or action) then you can use the Py4J context to make the calls. You might find looking at

Re: Calling Python code from Scala

2016-04-18 Thread Holden Karau
So if there is just a few python functions your interested in accessing you can also use the pipe interface (you'll have to manually serialize your data on both ends in ways that Python and Scala can respectively parse) - but its a very generic approach and can work with many different languages.

Re: Confused - returning RDDs from functions

2016-05-12 Thread Holden Karau
This is not the expected behavior, can you maybe post the code where you are running into this? On Thursday, May 12, 2016, Dood@ODDO wrote: > Hello all, > > I have been programming for years but this has me baffled. > > I have an RDD[(String,Int)] that I return from a function after extensive >

Re: ImportError: No module named numpy

2016-06-01 Thread Holden Karau
Generally this means numpy isn't installed on the system or your PYTHONPATH has somehow gotten pointed somewhere odd, On Wed, Jun 1, 2016 at 8:31 AM, Bhupendra Mishra wrote: > If any one please can help me with following error. > > File > "/opt/mapr/spark/spark-1.6.1/python/lib/pyspark.zip/pysp

Re: --driver-cores for Standalone and YARN only?! What about Mesos?

2016-06-02 Thread Holden Karau
Also seems like this might be better suited for dev@ On Thursday, June 2, 2016, Sun Rui wrote: > yes, I think you can fire a JIRA issue for this. > But why removing the default value. Seems the default core is 1 according > to > https://github.com/apache/spark/blob/master/core/src/main/scala/org

Re: JIRA SPARK-2984

2016-06-09 Thread Holden Karau
I think your error could possibly be different - looking at the original JIRA the issue was happening on HDFS and you seem to be experiencing the issue on s3n, and while I don't have full view of the problem I could see this being s3 specific (read-after-write on s3 is trickier than read-after-writ

Re: JIRA SPARK-2984

2016-06-09 Thread Holden Karau
I'd do some searching and see if there is a JIRA related to this problem on s3 and if you don't find one go ahead and make one. Even if it is an intrinsic problem with s3 (and I'm not super sure since I'm just reading this on mobile) - it would maybe be a good thing for us to document. On Thursday

Re: Spark Installation to work on Spark Streaming and MLlib

2016-06-10 Thread Holden Karau
Hi Ram, Not super certain what you are looking to do. Are you looking to add a new algorithm to Spark MLlib for streaming or use Spark MLlib on streaming data? Cheers, Holden On Friday, June 10, 2016, Ram Krishna wrote: > Hi All, > > I am new to this this field, I want to implement new ML alg

Re: Spark Installation to work on Spark Streaming and MLlib

2016-06-10 Thread Holden Karau
So that's a bit complicated - you might want to start with reading the code for the existing algorithms and go from there. If your goal is to contribute the algorithm to Spark you should probably take a look at the JIRA as well as the contributing to Spark guide on the wiki. Also we have a seperate

Re: Creating a python port for a Scala Spark Projeect

2016-06-22 Thread Holden Karau
PySpark RDDs are (on the Java side) are essentially RDD of pickled objects and mostly (but not entirely) opaque to the JVM. It is possible (by using some internals) to pass a PySpark DataFrame to a Scala library (you may or may not find the talk I gave at Spark Summit useful https://www.youtube.com

Re: Call Scala API from PySpark

2016-06-30 Thread Holden Karau
So I'm a little biased - I think the bet bride between the two is using DataFrames. I've got some examples in my talk and on the high performance spark GitHub https://github.com/high-performance-spark/high-performance-spark-examples/blob/master/high_performance_pyspark/simple_perf_test.py calls som

Working of Streaming Kmeans

2016-07-05 Thread Holden Karau
Hi Biplob, The current Streaming KMeans code only updates data which comes in through training (e.g. trainOn), predictOn does not update the model. Cheers, Holden :) P.S. Traffic on the list might be have been bit slower right now because of Canada Day and 4th of July weekend respectively. On

Re: Bootstrap Action to Install Spark 2.0 on EMR?

2016-07-05 Thread Holden Karau
Just to be clear Spark 2.0 isn't released yet, there is a preview version for developers to explore and test compatibility with. That being said Roy Hasson has a blog post discussing using Spark 2.0-preview with EMR - https://medium.com/@royhasson/running-spark-2-0-preview-on-emr-635081e01341#.r1tz

Re: pyspark 1.5 0 save model ?

2016-07-18 Thread Holden Karau
If you used RandomForestClassifier from mllib you can use the save method described in http://spark.apache.org/docs/1.5.0/api/python/pyspark.mllib.html#module-pyspark.mllib.classification which will write out some JSON metadata as well as parquet for the actual model. For the newer ml pipeline one

Re: which one spark ml or spark mllib

2016-07-19 Thread Holden Karau
So Spark ML is going to be the actively developed Machine Learning library going forward, however back in Spark 1.5 it was still relatively new and an experimental component so not all of the the save/load support implemented for the same models. That being said for 2.0 ML doesn't have PMML export

Re: spark single PROCESS_LOCAL task

2016-07-19 Thread Holden Karau
So its possible that you have a lot of data in one of the partitions which is local to that process, maybe you could cache & count the upstream RDD and see what the input partitions look like? On the otherhand - using groupByKey is often a bad sign to begin with - can you rewrite your code to avoid

Re: Spark 7736

2016-07-19 Thread Holden Karau
Indeed there is, signup for an Apache JIRA account then then when you visit the JIRA page logged in you should see a "reopen issue" button. For issues like this (reopening a JIRA) - you might find the dev list to be more useful. On Wed, Jul 13, 2016 at 4:47 AM, ayan guha wrote: > Hi > > I am fac

Re: Should it be safe to embed Spark in Local Mode?

2016-07-19 Thread Holden Karau
That's interesting and might be better suited to the dev list. I know in some cases System exit off -1 were added so the task would be marked as failure. On Tuesday, July 19, 2016, Brett Randall wrote: > This question is regarding > https://issues.apache.org/jira/browse/SPARK-15685 (StackOverflo

Re: Spark Beginner Question

2016-07-26 Thread Holden Karau
So you will need to convert your input DataFrame into something with vectors and labels to train on - the Spark ML documentation has examples http://spark.apache.org/docs/latest/ml-guide.html (although the website seems to be having some issues mid update to Spark 2.0 so if you want to read it righ

Re: Plans for improved Spark DataFrame/Dataset unit testing?

2016-08-01 Thread Holden Karau
Thats a good point - there is an open issue for spark-testing-base to support this shared sparksession approach - but I haven't had the time ( https://github.com/holdenk/spark-testing-base/issues/123 ). I'll try and include this in the next release :) On Mon, Aug 1, 2016 at 9:22 AM, Koert Kuipers

Re: Spark 2.0.0 - Broadcast variable - What is ClassTag?

2016-08-07 Thread Holden Karau
Classtag is Scala concept (see http://docs.scala-lang.org/overviews/reflection/typetags-manifests.html) - although this should not be explicitly required - looking at http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext we can see that in Scala the classtag tag is

Re: Spark 2.0.0 - Broadcast variable - What is ClassTag?

2016-08-08 Thread Holden Karau
tag > > sparkSession.sparkContext().broadcast > > On Mon, Aug 8, 2016 at 12:09 PM, Holden Karau > wrote: > >> Classtag is Scala concept (see http://docs.scala-lang.or >> g/overviews/reflection/typetags-manifests.html) - although this should >> not be explicitly required - look

Re: Spark2 SBT Assembly

2016-08-10 Thread Holden Karau
What are you looking to use the assembly jar for - maybe we can think of a workaround :) On Wednesday, August 10, 2016, Efe Selcuk wrote: > Sorry, I should have specified that I'm specifically looking for that fat > assembly behavior. Is it no longer possible? > > On Wed, Aug 10, 2016 at 10:46 A

Re: groupByKey() compile error after upgrading from 1.6.2 to 2.0.0

2016-08-10 Thread Holden Karau
So it looks like (despite the name) pair_rdd is actually a Dataset - my guess is you might have a map on a dataset up above which used to return an RDD but now returns another dataset or an unexpected implicit conversion. Just add rdd() before the groupByKey call to push it into an RDD. That being

Re: Is there a reduceByKey functionality in DataFrame API?

2016-08-10 Thread Holden Karau
Hi Luis, You might want to consider upgrading to Spark 2.0 - but in Spark 1.6.2 you can do groupBy followed by a reduce on the GroupedDataset ( http://spark.apache.org/docs/1.6.2/api/scala/index.html#org.apache.spark.sql.GroupedDataset ) - this works on a per-key basis despite the different name.

Re: Getting Co-oefficients of a logistic regression model for a pipelinemodel Spark ML library

2016-01-21 Thread Holden Karau
Hi Vinayaka, You can access the different stages in your pipeline through the stages array on our pipeline model ( http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.PipelineModel ) and then cast it to the correct stage (if working in Scala or if in Python just access the

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Holden Karau
Can you post more of your log? How big are the partitions? What is the action you are performing? On Thu, Jan 21, 2016 at 2:02 PM, Arun Luthra wrote: > Example warning: > > 16/01/21 21:57:57 WARN TaskSetManager: Lost task 2168.0 in stage 1.0 (TID > 4436, XXX): TaskCommitDenied (Driver denied

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Holden Karau
tor.extraJavaOptions=-verbose:gc > -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" \ > > my.jar > > > There are 2262 input files totaling just 98.6G. The DAG is basically > textFile().map().filter().groupByKey().saveAsTextFile(). > > On Thu, Jan 21, 2016 at 2:14 PM, H

Re: TaskCommitDenied (Driver denied task commit)

2016-01-21 Thread Holden Karau
"--.MyRegistrator") > .set("spark.kryo.registrationRequired", "true") > .set("spark.yarn.executor.memoryOverhead","600") > > On Thu, Jan 21, 2016 at 2:50 PM, Josh Rosen > wrote: > >> Is speculation enabled? T

Re: local class incompatible: stream classdesc serialVersionUID

2016-02-01 Thread Holden Karau
So I'm a little confused to exactly how this might have happened - but one quick guess is that maybe you've built an assembly jar with Spark core, can you mark it is a provided and or post your build file? On Fri, Jan 29, 2016 at 7:35 AM, Ted Yu wrote: > I logged SPARK-13084 > > For the moment,

Re: local class incompatible: stream classdesc serialVersionUID

2016-02-01 Thread Holden Karau
on, Feb 1, 2016 at 2:08 PM, Holden Karau wrote: > >> So I'm a little confused to exactly how this might have happened - but >> one quick guess is that maybe you've built an assembly jar with Spark core, >> can you mark it is a provided and or post your build file?

Re: Using accumulator to push custom logs to driver

2016-02-01 Thread Holden Karau
I wouldn't use accumulators for things which could get large, they can become kind of a bottle neck. Do you have a lot of string messages you want to bring back or only a few? On Mon, Feb 1, 2016 at 3:24 PM, Utkarsh Sengar wrote: > I am trying to debug code executed in executors by logging. Even

Re: Using accumulator to push custom logs to driver

2016-02-01 Thread Holden Karau
> info about the dataset etc. > I would assume the strings will vary from 100-200lines max, that would be > about 50-100KB if they are really long lines. > > -Utkarsh > > On Mon, Feb 1, 2016 at 3:40 PM, Holden Karau wrote: > >> I wouldn't use accumulators f

Re: Unit test with sqlContext

2016-02-04 Thread Holden Karau
Thanks for recommending spark-testing-base :) Just wanted to add if anyone has feature requests for Spark testing please get in touch (or add an issue on the github) :) On Thu, Feb 4, 2016 at 8:25 PM, Silvio Fiorito < silvio.fior...@granturing.com> wrote: > Hi Steve, > > Have you looked at the s

Re: install databricks csv package for spark

2016-02-19 Thread Holden Karau
So with --packages to spark-shell and spark-submit Spark will automatically fetch the requirements from maven. If you want to use an explicit local jar you can do that with the --jars syntax. You might find http://spark.apache.org/docs/latest/submitting-applications.html useful. On Fri, Feb 19, 20

Re: Spark stream job is take up /TMP with 100%

2016-02-19 Thread Holden Karau
Thats a good question, you can find most of what you are looking for in the configuration guide at http://spark.apache.org/docs/latest/configuration.html - you probably want to change the spark.local.dir to point to your scratch directory. Out of interest what problems have you been seeing with YAR

Re: Submitting Jobs Programmatically

2016-02-19 Thread Holden Karau
How are you trying to launch your application? Do you have the Spark jars on your class path? On Friday, February 19, 2016, Arko Provo Mukherjee < arkoprovomukher...@gmail.com> wrote: > Hello, > > I am trying to submit a spark job via a program. > > When I run it, I receive the following error: >

SF Spark Office Hours Experiment - Friday Afternoon

2015-10-20 Thread Holden Karau
Hi SF based folks, I'm going to try doing some simple office hours this Friday afternoon outside of Paramo Coffee. If no one comes by I'll just be drinking coffee hacking on some Spark PRs so if you just want to hangout and hack on Spark as a group come by too. (See https://twitter.com/holdenkarau

Re: Spark-Testing-Base Q/A

2015-10-21 Thread Holden Karau
On Wednesday, October 21, 2015, Mark Vervuurt wrote: > Hi Everyone, > > I am busy trying out ‘Spark-Testing-Base > ’. I have the following > questions? > > >- Can you test Spark Streaming Jobs using Java? > > The current base class for testing st

Re: SF Spark Office Hours Experiment - Friday Afternoon

2015-10-21 Thread Holden Karau
> -- > Jacek Laskowski | http://blog.japila.pl | http://blog.jaceklaskowski.pl > Follow me at https://twitter.com/jaceklaskowski > Upvote at http://stackoverflow.com/users/1305344/jacek-laskowski > > > On Wed, Oct 21, 2015 at 12:55 AM, Holden Karau > wrote: > > Hi SF based f

Re: Spark-Testing-Base Q/A

2015-10-21 Thread Holden Karau
ython. > Sounds reasonable, I'll add it this week. > > If i am not wrong it’s 4:00 AM for you in California ;) > > Yup, I'm not great a regular schedules but I make up for it by doing stuff when I've had too much coffee to sleep :p > Regards, > Mark > >

Re: Spark with business rules

2015-10-26 Thread Holden Karau
Spark SQL seems like it might be the best interface if your users are already familiar with SQL. On Mon, Oct 26, 2015 at 3:12 PM, danilo wrote: > Hi All, I want to create a monitoring tool using my sensor data. I receive > the events every seconds and I need to create a report using node.js. Rig

  1   2   3   >