Re: [DISCUSS][SPARK SQL] SPARK-51710: Using Dataframe.dropDuplicates with an empty array as argument behaves "unexpectedly"

2025-05-17 Thread David Kunzmann
onable > change, as the previous behavior doesn't make sense which always returns > the first row. For safety, we can add a legacy config for fallback and > mention it in the migration guide. > > On Wed, May 14, 2025 at 9:21 AM David Kunzmann > wrote: > >> Hi James, >

Re: [DISCUSS][SPARK SQL] SPARK-51710: Using Dataframe.dropDuplicates with an empty array as argument behaves "unexpectedly"

2025-05-14 Thread David Kunzmann
Willis wrote: > This seems like the correct behavior to me. Every value of the null set of > columns will match between any pair of Rows. > > > > On Thu, May 8, 2025 at 11:37 AM David Kunzmann > wrote: > >> Hello everyone, >> >> Following the creation of t

[DISCUSS][SPARK SQL] SPARK-51710: Using Dataframe.dropDuplicates with an empty array as argument behaves "unexpectedly"

2025-05-08 Thread David Kunzmann
cates and remove them. > This behavior is the same on the Scala side where df.dropDuplicates(Seq.empty) returns the first row. Would it make sense to change the behavior of df.dropDuplicates(Seq.empty) to be the same as df.dropDuplicates() ? Cheers, David

Re: [DISCUSS] Ongoing projects for Spark 4.0

2025-01-22 Thread David Milicevic
CASE, WHILE, REPEAT, LOOP, FOR, LEAVE, ITERATE, etc. SQL Scripting still doesn't work with Spark Connect. Thanks, David On Wed, Jan 22, 2025 at 12:25 PM Stefan Kandic wrote: > Hi, > > I am working on adding collation support ( > https://issues.apache.org/jira/projects/SPARK/is

Re: Why spark-submit works with package not with jar

2024-05-06 Thread David Rabinowitz
Hi, It seems this library is several years old. Have you considered using the Google provided connector? You can find it in https://github.com/GoogleCloudDataproc/spark-bigquery-connector Regards, David Rabinowitz On Sun, May 5, 2024 at 6:07 PM Jeff Zhang wrote: > Are you s

Re: JDK version support policy?

2023-06-13 Thread David Li
d in open JDK until 2026, I'm not sure if we're >>>> going to see enough folks moving to JRE17 by the Spark 4 release unless we >>>> have a strong benefit from dropping 11 support I'd be inclined to keep it. >>>> >>>> On Tue, Jun 6, 2023

JDK version support policy?

2023-06-06 Thread David Li
Hello Spark developers, I'm from the Apache Arrow project. We've discussed Java version support [1], and crucially, whether to continue supporting Java 8 or not. As Spark is a big user of Arrow in Java, I was curious what Spark's policy here was. If Spark intends to stay on Java 8, for instance

SPARK-22256

2020-12-11 Thread David McWhorter
Hello, my name is David McWhorter and I created a new pull request to address the SPARK-22256 ticket at https://github.com/apache/spark/pull/30739. This change adds a memory overhead setting for the spark driver running on mesos. This is a reopening of a prior pull request that was never merged

Unsubscribe

2020-12-08 Thread David Zhou
Unsubscribe

[DISCUSS] Reducing memory usage of toPandas with Arrow "self_destruct" option

2020-09-10 Thread David Li
filers/memory_profiler [3]: https://github.com/pandas-dev/pandas/issues/35530 [*] See my comment in https://issues.apache.org/jira/browse/ARROW-9878. Thanks, David - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Spark 3.0 and ORC 1.6

2020-01-28 Thread David Christle
dependence on Hadoop 2.9 is not required). Again, these may be non-issues, but I wanted to kindle discussion around whether this can make the cut for 3.0, since I imagine it’s a major upgrade many users will focus on migrating to once released. Kind regards, David Christle

Re: [Kubernetes] Resource requests and limits for Driver and Executor Pods

2018-03-30 Thread David Vogelbacher
fault for spark.kubernetes.executor.cores would be. Seeing that I wanted more than 1 and Yinan wants less, leaving it at 1 night be best. Thanks, David From: Kimoon Kim Date: Friday, March 30, 2018 at 4:28 PM To: Yinan Li Cc: David Vogelbacher , "dev@spark.apache.org" Subject: Re: [K

[Kubernetes] Resource requests and limits for Driver and Executor Pods

2018-03-29 Thread David Vogelbacher
, David smime.p7s Description: S/MIME cryptographic signature

RE: Launching multiple spark jobs within a main spark job.

2016-12-21 Thread David Hodeffi
I am not familiar of any problem with that. Anyway, If you run spark applicaction you would have multiple jobs, which makes sense that it is not a problem. Thanks David. From: Naveen [mailto:hadoopst...@gmail.com] Sent: Wednesday, December 21, 2016 9:18 AM To: dev@spark.apache.org; u

Re: SPARK-13843 and future of streaming backends

2016-03-25 Thread David Nalley
> As far as group / artifact name compatibility, at least in the case of > Kafka we need different artifact names anyway, and people are going to > have to make changes to their build files for spark 2.0 anyway. As > far as keeping the actual classes in org.apache.spark to not break > code despi

Re: [ANNOUNCE] New SAMBA Package = Spark + AWS Lambda

2016-02-02 Thread David Russell
the vision is to get rid of all cluster > management when using Spark. You might find one of the hosted Spark platform solutions such as Databricks or Amazon EMR that handle cluster management for you a good place to start. At least in my experience, they got me

[ANNOUNCE] New SAMBA Package = Spark + AWS Lambda

2016-02-01 Thread David Russell
> ROSE Spark Package: https://github.com/onetapbeyond/opencpu-spark-executor <https://github.com/onetapbeyond/opencpu-spark-executor> Questions, suggestions, feedback welcome. David -- "*All that is gold does not glitter,** Not all those who wander are lost."*

Re: ROSE: Spark + R on the JVM.

2016-01-13 Thread David Russell
PIs in Java, JavaScript and .NET that can easily support your use case. The outputs of your DeployR integration could then become inputs to your data processing system. David "All that is gold does not glitter, Not all those who wander are lost." Original Message Subject: R

Re: ROSE: Spark + R on the JVM.

2016-01-12 Thread David Russell
weight as ROSE and it not designed to work in a clustered environment. ROSE on the other hand is designed for scale. David "All that is gold does not glitter, Not all those who wander are lost." Original Message Subject: Re: ROSE: Spark + R on the JVM. Local Time:

Re: Eigenvalue solver

2016-01-12 Thread David Hall
ps://github.com/thomasnat1/cdcNewsRanker/blob/71b0ff3989d5191dc6a78c40c4a7a9967cbb0e49/venv/lib/python2.7/site-packages/scipy/sparse/linalg/eigen/arpack/arpack.py#L1049 ) I'm happy to help more if you decide to go this route, here, or on the scala-breeze google group, or on github. -- David On Tue, Jan 12, 2016 at 10:28 AM, L

Re: ROSE: Spark + R on the JVM.

2016-01-12 Thread David Russell
Hi Corey, > Would you mind providing a link to the github? Sure, here is the github link you're looking for: https://github.com/onetapbeyond/opencpu-spark-executor David "All that is gold does not glitter, Not all those who wander are lost." Original Message --

ROSE: Spark + R on the JVM.

2016-01-12 Thread David
ou to [take a look](https://github.com/onetapbeyond/opencpu-spark-executor). Any feedback, questions etc very welcome. David "All that is gold does not glitter, Not all those who wander are lost."

Re: [discuss] dropping Python 2.6 support

2016-01-11 Thread David Chin
on 2.7. Some libraries that Spark depend on >>> stopped supporting 2.6. We can still convince the library maintainers to >>> support 2.6, but it will be extra work. I'm curious if anybody still uses >>> Python 2.6 to run Spark. >>> >>> Thanks. >>> >>> >>> >> -- David Chin, Ph.D. david.c...@drexel.eduSr. Systems Administrator, URCF, Drexel U. http://www.drexel.edu/research/urcf/ https://linuxfollies.blogspot.com/ +1.215.221.4747 (mobile) https://github.com/prehensilecode

Differing performance in self joins

2015-08-26 Thread David Smith
ut the following is equally slow: df.join(laggard, (df("series") === laggard("p_series")) && (df("eday") - laggard("p_eday")).between(1,7)).count Any advice about the general principle at work here would be welcome. T

Re: [mllib] Is there any bugs to divide a Breeze sparse vectors at Spark v1.3.0-rc3?

2015-03-18 Thread David Hall
sure. On Wed, Mar 18, 2015 at 12:19 AM, Debasish Das wrote: > Hi David, > > We are stress testing breeze.optimize.proximal and nnls...if you are > cutting a release now, we will need another release soon once we get the > runtime optimizations in place and merged to breeze. &g

Re: [mllib] Is there any bugs to divide a Breeze sparse vectors at Spark v1.3.0-rc3?

2015-03-17 Thread David Hall
ping? On Sun, Mar 15, 2015 at 9:38 PM, David Hall wrote: > snapshot is pushed. If you verify I'll publish the new artifacts. > > On Sun, Mar 15, 2015 at 1:14 AM, Yu Ishikawa > wrote: > >> David Hall who is a breeze creator told me that it's a bug. So, I made

Re: [mllib] Is there any bugs to divide a Breeze sparse vectors at Spark v1.3.0-rc3?

2015-03-15 Thread David Hall
snapshot is pushed. If you verify I'll publish the new artifacts. On Sun, Mar 15, 2015 at 1:14 AM, Yu Ishikawa wrote: > David Hall who is a breeze creator told me that it's a bug. So, I made a > jira > ticket about this issue. We need to upgrade breeze from 0.11.1 to 0.11.2

GSoC 2015

2015-03-08 Thread David J. Manglano
n the goals for Spark with GSoC (it is my understanding that Manoj Kumar is the mentor), though I may be incorrect. I have been reading the Spark codebase on GitHub and think I may be able to help develop Spark's Python API. To get involved, what next steps should I take? Thanks! David J. Mang

Re: Implementing TinkerPop on top of GraphX

2015-01-15 Thread David Robinson
I am new to Spark and GraphX, however, I use Tinkerpop backed graphs and think the idea of using Tinkerpop as the API for GraphX is a great idea and hope you are still headed in that direction. I noticed that Tinkerpop 3 is moving into the Apache family: http://wiki.apache.org/incubator/TinkerPopP

Re: spark-yarn_2.10 1.2.0 artifacts

2014-12-22 Thread David McWhorter
Thank you, Sean, using spark-network-yarn seems to do the trick. On 12/19/2014 12:13 PM, Sean Owen wrote: I believe spark-yarn does not exist from 1.2 onwards. Have a look at spark-network-yarn for where some of that went, I believe. On Fri, Dec 19, 2014 at 5:09 PM, David McWhorter wrote: Hi

spark-yarn_2.10 1.2.0 artifacts

2014-12-19 Thread David McWhorter
on Any help or insights into how to use spark-yarn_2.10 1.2.0 in a maven build would be appreciated. David -- David McWhorter Software Engineer Commonwealth Computer Research, Inc. 1422 Sachem Place, Unit #1 Charlottesville, VA 22901 mcwhor...@ccri.com | 4

Re: EC2 clusters ready in launch time + 30 seconds

2014-10-06 Thread David Rowe
n existing issue -- SPARK-3314 > > > > <https://issues.apache.org/jira/browse/SPARK-3314> -- about > scripting > > > the > > > > creation of Spark AMIs. > > > > > > > > With Packer, it looks like we may be able to script the creat

Re: Breeze Library usage in Spark

2014-10-03 Thread David Hall
yeah, breeze.storage.Zero was introduced in either 0.8 or 0.9. On Fri, Oct 3, 2014 at 9:45 AM, Xiangrui Meng wrote: > Did you add a different version of breeze to the classpath? In Spark > 1.0, we use breeze 0.7, and in Spark 1.1 we use 0.9. If the breeze > version you used is different from the

Re: EC2 clusters ready in launch time + 30 seconds

2014-10-02 Thread David Rowe
I think this is exactly what packer is for. See e.g. http://www.packer.io/intro/getting-started/build-image.html On a related note, the current AMI for hvm systems (e.g. m3.*, r3.*) has a bad package for httpd, whcih causes ganglia not to start. For some reason I can't get access to the raw AMI to

Re: BlockManager issues

2014-09-22 Thread David Rowe
I've run into this with large shuffles - I assumed that there was contention between the shuffle output files and the JVM for memory. Whenever we start getting these fetch failures, it corresponds with high load on the machines the blocks are being fetched from, and in some cases complete unrespons

Source code for mining big data with Spark

2014-09-14 Thread David Tung
Hi all, I watched am impressed spark demo video by Reynold Xin and Aaron Davidson in youtube ( https://www.youtube.com/watch?v=FjhRkfAuU7I ). Can someone let me know where can I find the source codes for the demo? I can¹t see the source codes from video clearly. Thanks in advance CONFIDENTIALITY

Re: Is breeze thread safe in Spark?

2014-09-03 Thread David Hall
mutating operations are not thread safe. Operations that don't mutate should be thread safe. I can't speak to what Evan said, but I would guess that the way they're using += should be safe. On Wed, Sep 3, 2014 at 11:58 AM, RJ Nowling wrote: > David, > > Can you confi

Re: Is breeze thread safe in Spark?

2014-09-03 Thread David Hall
In general, in Breeze we allocate separate work arrays for each call to lapack, so it should be fine. In general concurrent modification isn't thread safe of course, but things that "ought" to be thread safe really should be. On Wed, Sep 3, 2014 at 10:41 AM, RJ Nowling wrote: > No, it's not in

Re: Linear CG solver

2014-06-27 Thread David Hall
I have no ideas on benchmarks, but breeze has a CG solver: https://github.com/scalanlp/breeze/tree/master/math/src/main/scala/breeze/optimize/linear/ConjugateGradient.scala https://github.com/scalanlp/breeze/blob/e2adad3b885736baf890b306806a56abc77a3ed3/math/src/test/scala/breeze/optimize/linear/C

Re: mllib vector templates

2014-05-05 Thread David Hall
On Mon, May 5, 2014 at 3:40 PM, DB Tsai wrote: > David, > > Could we use Int, Long, Float as the data feature spaces, and Double for > optimizer? > Yes. Breeze doesn't allow operations on mixed types, so you'd need to convert the double vectors to Floats if you wanted,

Re: mllib vector templates

2014-05-05 Thread David Hall
memory available... > > > On Mon, May 5, 2014 at 3:06 PM, David Hall wrote: > > > Lbfgs and other optimizers would not work immediately, as they require > > vector spaces over double. Otherwise it should work. > > On May 5, 2014 3:03 PM, "DB Tsai" wrote: &g

Re: mllib vector templates

2014-05-05 Thread David Hall
Lbfgs and other optimizers would not work immediately, as they require vector spaces over double. Otherwise it should work. On May 5, 2014 3:03 PM, "DB Tsai" wrote: > Breeze could take any type (Int, Long, Double, and Float) in the matrix > template. > > > Sincerely, > > DB Tsai > ---

Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-29 Thread David Hall
in this online bfgs: http://jmlr.org/proceedings/papers/v2/schraudolph07a/schraudolph07a.pdf -- David On Tue, Apr 29, 2014 at 3:30 PM, DB Tsai wrote: > Have a quick hack to understand the behavior of SLBFGS > (Stochastic-LBFGS) by overwriting the breeze iterations method to get the >

Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-28 Thread David Hall
That's right. FWIW, caching should be automatic now, but it might be the version of Breeze you're using doesn't do that yet. Also, In breeze.util._ there's an implicit that adds a tee method to iterator, and also a last method. Both are useful for things like this. -- D

Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-25 Thread David Hall
asing objective value. If you're regularizing, are you including the regularizer in the objective value computation? GD is almost never worth your time. -- David On Fri, Apr 25, 2014 at 2:57 PM, DB Tsai wrote: > Another interesting benchmark. > > *News20 dataset - 0.14M row, 1,355

Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-23 Thread David Hall
ture is sparse. > > > > Sincerely, > > > > DB Tsai > > --- > > My Blog: https://www.dbtsai.com > > LinkedIn: https://www.linkedin.com/in/dbtsai > > > > > > On Wed, Apr 23, 2014 at

Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-23 Thread David Hall
Was the weight vector sparse? The gradients? Or just the feature vectors? On Wed, Apr 23, 2014 at 10:08 PM, DB Tsai wrote: > The figure showing the Log-Likelihood vs Time can be found here. > > > https://github.com/dbtsai/spark-lbfgs-benchmark/raw/fd703303fb1c16ef5714901739154728550becf4/result

Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-23 Thread David Hall
ance you remember what the problems were? I'm sure it could be better, but it's good to know where improvements need to happen. -- David > > > On Apr 23, 2014, at 9:21 PM, DB Tsai wrote: > > > > Hi all, > > > > I'm benchmarking Logistic Regression

Re: RFC: varargs in Logging.scala?

2014-04-11 Thread David Hall
Another usage that's nice is: logDebug { val timeS = timeMillis/1000.0 s"Time: $timeS" } which can be useful for more complicated expressions. On Thu, Apr 10, 2014 at 5:55 PM, Michael Armbrust wrote: > BTW... > > You can do calculations in string interpolation: > s"Time: ${timeMillis / 1

Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-03-31 Thread David Hall
tic constraints... > I wonder how our general projected gradient solver would do? Clearly having dedicated QP support is better, but in terms of just getting it working, it might be enough. -- David > > > > On Sun, Mar 30, 2014 at 4:40 PM, David Hall wrote: > > > On Sun,

Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-03-30 Thread David Hall
On Sun, Mar 30, 2014 at 2:01 PM, Debasish Das wrote: > Hi David, > > I have started to experiment with BFGS solvers for Spark GLM over large > scale data... > > I am also looking to add a good QP solver in breeze that can be used in > Spark ALS for constraint solves...More

Re: Making RDDs Covariant

2014-03-22 Thread David Hall
//issues.scala-lang.org/browse/SI-2509 -- David

Re: graphx samples in Java

2014-03-22 Thread David Soroko
Is there a time frame for adding a Java API? -- david > On 22 Mar 2014, at 05:11, "Reynold Xin" wrote: > > There is no Java API yet. > > >> On Fri, Mar 21, 2014 at 3:18 AM, David Soroko wrote: >> >> Hi >> >> Where ca

graphx samples in Java

2014-03-21 Thread David Soroko
, "student")), (7L, ("jgonzal", "postdoc")), (5L, ("franklin", "prof")), (2L, ("istoica", "prof" thanks --david

Code documentation

2014-03-15 Thread David Thomas
Is there any documentation available that explains the code architecture that can help a new Spark framework developer?

Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-03-06 Thread David Hall
On Thu, Mar 6, 2014 at 4:21 PM, DB Tsai wrote: > Hi David, > > I can converge to the same result with your breeze LBFGS and Fortran > implementations now. Probably, I made some mistakes when I tried > breeze before. I apologize that I claimed it's not stable. >

Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-03-05 Thread David Hall
I did not. They would be nice to have. On Wed, Mar 5, 2014 at 5:21 PM, Debasish Das wrote: > David, > > There used to be standard BFGS testcases in Professor Nocedal's > package...did you stress test the solver with them ? > > If not I will shoot him an email for

Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-03-05 Thread David Hall
On Wed, Mar 5, 2014 at 1:57 PM, DB Tsai wrote: > Hi David, > > On Tue, Mar 4, 2014 at 8:13 PM, dlwh wrote: > > I'm happy to help fix any problems. I've verified at points that the > > implementation gives the exact same sequence of iterates for a few > differen

Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-03-05 Thread David Hall
On Wed, Mar 5, 2014 at 8:50 AM, Debasish Das wrote: > Hi David, > > Few questions on breeze solvers: > > 1. I feel the right place of adding useful things from RISO LBFGS (based on > Professor Nocedal's fortran code) will be breeze. It will involve stress > testing br