Re: When is a Bigint a long and when is a long a long

2020-06-28 Thread Sean Owen
'bigint' is a long, not a Java BigInteger. On Sun, Jun 28, 2020 at 5:52 AM Anwar AliKhan wrote: > > I wish to draw your attention for your consideration to this approach > where the BigInt data type maps to Long without drawing an error. > > https://stackoverflow.com/questions/31011797/bug-in

Re: XmlReader not Parsing the Nested elements in XML properly

2020-06-30 Thread Sean Owen
This is more a question about spark-xml, which is not part of Spark. You can ask at https://github.com/databricks/spark-xml/ but if you do please show some example of the XML input and schema and output. On Tue, Jun 30, 2020 at 11:39 AM mars76 wrote: > > Hi, > > I am trying to read XML data fro

Re: Is it possible to use Hadoop 3.x and Hive 3.x using spark 2.4?

2020-07-06 Thread Sean Owen
2.4 works with Hadoop 3 (optionally) and Hive 1. I doubt it will work connecting to Hadoop 3 / Hive 3; it's possible in a few cases. It's also possible some vendor distributions support this combination. On Mon, Jul 6, 2020 at 7:51 AM Teja wrote: > > We use spark 2.4.0 to connect to Hadoop 2.7 cl

Re: When does SparkContext.defaultParallelism have the correct value?

2020-07-07 Thread Sean Owen
If not set explicitly with spark.default.parallelism, it will default to the number of cores currently available (minimum 2). At the very start, some executors haven't completed registering, which I think explains why it goes up after a short time. (In the case of dynamic allocation it will change

Re: com.fasterxml.jackson.databind.JsonMappingException: Scala module 2.9.6 requires Jackson Databind version >= 2.9.0 and < 2.10.0

2020-07-09 Thread Sean Owen
You have a Jackson version conflict somewhere. It might be from other libraries you include in your application. I am not sure Spark 2.3 works with Hadoop 3.1, so this may be the issue. Make sure you match these to Spark, and/or use the latest versions. On Thu, Jul 9, 2020 at 8:23 AM Julian Jiang

Re: [Spark 3.0 Kubernetes] Does Spark 3.0 support production deployment

2020-07-09 Thread Sean Owen
I haven't used the K8S scheduler personally, but, just based on that comment I wouldn't worry too much. It's been around for several versions and AFAIK works fine in general. We sometimes aren't so great about removing "experimental" labels. That said I know there are still some things that could b

Re: Strange WholeStageCodegen UI values

2020-07-09 Thread Sean Owen
It sounds like you have huge data skew? On Thu, Jul 9, 2020 at 4:15 PM Bobby Evans wrote: > > Sadly there isn't a lot you can do to fix this. All of the operations take > iterators of rows as input and produce iterators of rows as output. For > efficiency reasons, the timing is not done for e

Re: Issue in parallelization of CNN model using spark

2020-07-13 Thread Sean Owen
There is a multilayer perceptron implementation in Spark ML, but that's not what you're looking for. To parallelize model training developed using standard libraries like Keras, use Horovod from Uber. https://horovod.readthedocs.io/en/stable/spark_include.html On Mon, Jul 13, 2020 at 6:59 AM Mukht

Re: scala RDD[MyCaseClass] to Dataset[MyCaseClass] perfomance

2020-07-13 Thread Sean Owen
Wouldn't toDS() do this without conversion? On Mon, Jul 13, 2020 at 5:25 PM Ivan Petrov wrote: > > Hi! > I'm trying to understand the cost of RDD to Dataset conversion > It takes me 60 minutes to create RDD [MyCaseClass] with 500.000.000.000 > records > It takes around 15 minutes to convert them

Re: Issue in parallelization of CNN model using spark

2020-07-14 Thread Sean Owen
It is still copyrighted material, no matter its state of editing. Yes, you should not be sharing this on the internet. On Tue, Jul 14, 2020 at 9:46 AM Anwar AliKhan wrote: > > Please note It is freely available because it is an early unedited raw > edition. > It is not 100% complete , it is not

Re: download of spark

2020-07-15 Thread Sean Owen
Works for me - do you have javascript disabled? it will be necessary. On Wed, Jul 15, 2020 at 11:52 AM Ming Liao wrote: > To whom it may concern, > > Hope this email finds you well. > I am trying to download spark but I was not able to select the release and > package type. Could you please help

Re: Using pyspark with Spark 2.4.3 a MultiLayerPerceptron model givens inconsistent outputs if a large amount of data is fed into it and at least one of the model outputs is fed to a Python UDF.

2020-07-17 Thread Sean Owen
I can't reproduce it (on Databricks / Spark 2.4), but as you say, sounds really specific to some way of executing it. I can't off the top of my head imagine why that would be. As you say, no matter the model, it should be the same result. I don't recall a bug being fixed around there, but neverthel

Re: Spark DataFrame Creation

2020-07-22 Thread Sean Owen
You'd probably do best to ask that project, but scanning the source code, that looks like it's how it's meant to work. It downloads to a temp file on the driver then copies to distributed storage then returns a DataFrame for that. I can't see how it would be implemented directly over sftp as there

Re: [Spark ML] existence of Matrix Factorization ALS algorithm's log version

2020-07-29 Thread Sean Owen
No there isn't a log version. You could probably copy and hack the implementation easily if necessary. On Wed, Jul 29, 2020 at 11:05 AM jyuan1986 wrote: > > Hi Team, > > I'm looking for information regarding MF_ALS algorithm's log version if > implemented. In original Hu et al.'s paper "Collabora

Re: Tab delimited csv import and empty columns

2020-07-31 Thread Sean Owen
Try setting nullValue to anything besides the empty string. Because its default is the empty string, empty strings become null by default. On Fri, Jul 31, 2020 at 3:20 AM Stephen Coy wrote: > That does not work. > > This is Spark 3.0 by the way. > > I have been looking at the Spark unit tests an

Re: CVE-2020-9480: Apache Spark RCE vulnerability in auth-enabled standalone master

2020-08-03 Thread Sean Owen
3.0.0+. For those using vendor distros, you may want to check with your vendor about whether the relevant patch has been applied. Sean On Mon, Jun 22, 2020 at 4:49 PM Sean Owen wrote: > > Severity: Important > > Vendor: The Apache Software Foundation > > Versions Affected: &

Re: Comments conventions in Spark distribution official examples

2020-08-05 Thread Sean Owen
These only matter to our documentation, which includes the source of these examples inline in the docs. For brevity, the examples don't need to show all the imports that are otherwise necessary for the source file. You can ignore them like the compiler does as comments if you are using the example

Re: [SPARK-SQL] How to return GenericInternalRow from spark udf

2020-08-06 Thread Sean Owen
The UDF should return the result value you want, not a whole Row. In Scala it figures out the schema of the UDF's result from the signature. On Thu, Aug 6, 2020 at 7:56 AM Amit Joshi wrote: > > Hi, > > I have a spark udf written in scala that takes couuple of columns and apply > some logic and o

Re: Spark Streaming with Kafka and Python

2020-08-12 Thread Sean Owen
What supports Python in (Kafka?) 0.8? I don't think Spark ever had a specific Python-Kafka integration. But you have always been able to use it to read DataFrames as in Structured Streaming. Kafka 0.8 support is deprecated (gone in 3.0) but 0.10 means 0.10+ - works with the latest 2.x. What is the

Re: How can I use pyspark to upsert one row without replacing entire table

2020-08-12 Thread Sean Owen
It's not so much Spark but the data format, whether it supports upserts. Parquet, CSV, JSON, etc would not. That is what Delta, Hudi et al are for, and yes you can upsert them in Spark. On Wed, Aug 12, 2020 at 9:57 AM Siavash Namvar wrote: > > Hi, > > I have a use case, and read data from a db ta

Re: Spark - Scala-Java interoperablity

2020-08-16 Thread Sean Owen
That should be fine. The JVM doesn't care how the bytecode it is executing was produced. As long as you were able to compile it together - which sometimes means using plugins like scala-maven-plugin for mixed compilation - the result should be fine. On Sun, Aug 16, 2020 at 4:28 PM Ramesh Mathikuma

Re: Referencing a scala/java PipelineStage from pyspark - constructor issues with HasInputCol

2020-08-17 Thread Sean Owen
Looks like you are building vs Spark 3 and running on Spark 2, or something along those lines. On Mon, Aug 17, 2020 at 4:02 AM Aviad Klein wrote: > Hi, I've referenced the same problem on stack overflow and can't seem to > find answers. > > I have custom spark pipelinestages written in scala tha

Re: Referencing a scala/java PipelineStage from pyspark - constructor issues with HasInputCol

2020-08-17 Thread Sean Owen
Hm, next guess: you need a no-arg constructor this() on FooTransformer? also consider extending UnaryTransformer. On Mon, Aug 17, 2020 at 9:08 AM Aviad Klein wrote: > Hi Owen, it's omitted from what I pasted but I'm using spark 2.4.4 on both. > > On Mon, Aug 17, 2020 at 4:37

Re: Ability to have CountVectorizerModel vocab as empty

2020-08-19 Thread Sean Owen
I think that's true. You're welcome to open a pull request / JIRA to remove that requirement. On Wed, Aug 19, 2020 at 3:21 AM Jatin Puri wrote: > > Hello, > > This is wrt > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala#L244 > >

Re: Referencing a scala/java PipelineStage from pyspark - constructor issues with HasInputCol

2020-08-25 Thread Sean Owen
That looks roughly right, though you will want to mark Spark dependencies as provided. Do you need netlib directly? Pyspark won't matter here if you're in Scala; what's installed with pip would not matter in any event. On Tue, Aug 25, 2020 at 3:30 AM Aviad Klein wrote: > > Hey Chris and Sean, tha

Re: Iterating all columns in a pyspark dataframe

2020-09-04 Thread Sean Owen
Do you need to iterate anything? you can always write a function of all columns, df.columns. You can operate on a whole Row at a time too. On Fri, Sep 4, 2020 at 2:11 AM Devi P.V wrote: > > Hi all, > What is the best approach for iterating all columns in a pyspark dataframe?I > want to apply som

Re: Missing / Duplicate Data when Spark retries

2020-09-10 Thread Sean Owen
It's more likely a subtle issue with your code or data, but hard to say without knowing more. The lineage is fine and deterministic, but your data or operations might not be. On Thu, Sep 10, 2020 at 12:03 AM Ruijing Li wrote: > > Hi all, > > I am on Spark 2.4.4 using Mesos as the task resource sc

Re: [DISCUSS] Spark cannot identify the problem executor

2020-09-11 Thread Sean Owen
-dev, +user Executors do not communicate directly, so I don't think that's quite what you are seeing. You'd have to clarify. On Fri, Sep 11, 2020 at 12:08 AM 陈晓宇 wrote: > > Hello all, > > We've been using spark 2.3 with blacklist enabled and often meet the problem > that when executor A has som

Re: 【Spark ML】How to get access of the MLlib's LogisticRegressionWithSGD after 3.0.0?

2020-09-22 Thread Sean Owen
-dev See the migration guide: https://spark.apache.org/docs/3.0.0/ml-migration-guide.html Use ml.LogisticRegression, which should still let you use SGD On Tue, Sep 22, 2020 at 12:54 AM Lyx <1181245...@qq.com> wrote: > > Hi, > I have updated my Spark to the version of 3.0.0, > and it seems th

Re: Is RDD.persist honoured if multiple actions are executed in parallel

2020-09-23 Thread Sean Owen
It is but it happens asynchronously. If you access the same block twice quickly, the cached block may not yet be available the second time yet. On Wed, Sep 23, 2020, 7:17 AM Arya Ketan wrote: > Hi, > I have a spark streaming use-case ( spark 2.2.1 ). And in my spark job, I > have multiple action

Re: A simple example that demonstrates that a Spark distributed cluster is faster than Spark Local Standalone

2020-09-24 Thread Sean Owen
If you have the same amount of resource (cores, memory, etc) on one machine, that is pretty much always going to be faster than using those same resources split across several machines. Even if you have somewhat more resource available on a cluster, the distributed version could be slower if you, f

Re: A simple example that demonstrates that a Spark distributed cluster is faster than Spark Local Standalone

2020-09-25 Thread Sean Owen
ne could just point > me at an example with some quick code and a large public data set and say > this runs faster on a cluster than standalone. I'd be happy to make a post > myself for any new people interested in Spark. > > Thanks > > > > > > > >

Re: A simple example that demonstrates that a Spark distributed cluster is faster than Spark Local Standalone

2020-09-25 Thread Sean Owen
optimising someone else's code that has no material value to me; I'm > interested in seeing a simple example of something working that I can then > carry across to my own datasets with a view to adopting the platform. > > Thx > > > > On Fri, Sep 25, 2020 at 2:29

Re: Apache Spark Bogotá Meetup

2020-09-30 Thread Sean Owen
Sure, we just ask people to open a pull request against https://github.com/apache/spark-website to update the page and we can merge it. On Wed, Sep 30, 2020 at 7:30 AM Miguel Angel Díaz Rodríguez < madiaz...@gmail.com> wrote: > Hello > > I am Co-organizer of Apache Spark Bogotá Meetup from Colomb

Re: [Spark SQL] does pyspark udf support spark.sql inside def

2020-09-30 Thread Sean Owen
No, you can't use the SparkSession from within a function executed by Spark tasks. On Wed, Sep 30, 2020 at 7:29 AM Lakshmi Nivedita wrote: > Here is a spark udf structure as an example > > Def sampl_fn(x): >Spark.sql(“select count(Id) from sample Where Id = x ”) > > > Spark.udf.regis

Re: Exception handling in Spark throws recursive value for DF needs type error

2020-10-01 Thread Sean Owen
You are reusing HiveDF for two vars and it ends up ambiguous. Just rename one. On Thu, Oct 1, 2020, 5:02 PM Mich Talebzadeh wrote: > Hi, > > > Spark version 2.3.3 on Google Dataproc > > > I am trying to use databricks to other databases > > > https://spark.apache.org/docs/latest/sql-data-sources

Re: Exception handling in Spark throws recursive value for DF needs type error

2020-10-02 Thread Sean Owen
It would be quite trivial. None of that affects any of the Spark execution. It doesn't seem like it helps though - you are just swallowing the cause. Just let it fly? On Fri, Oct 2, 2020 at 9:34 AM Mich Talebzadeh wrote: > As a side question consider the following read JDBC read > > > val lowerB

Re: Why spark-submit works with package not with jar

2020-10-20 Thread Sean Owen
Probably because your JAR file requires other JARs which you didn't supply. If you specify a package, it reads metadata like a pom.xml file to understand what other dependent JARs also need to be loaded. On Tue, Oct 20, 2020 at 10:50 AM Mich Talebzadeh wrote: > Hi, > > I have a scenario that I u

Re: Why spark-submit works with package not with jar

2020-10-20 Thread Sean Owen
>From the looks of it, it's the com.google.http-client ones. But there may be more. You should not have to reason about this. That's why you let Maven / Ivy resolution figure it out. It is not true that everything in .ivy2 is on the classpath. On Tue, Oct 20, 2020 at 3:48 PM Mich Talebzadeh wrote

Re: Why spark-submit works with package not with jar

2020-10-20 Thread Sean Owen
Rather, let --packages (via Ivy) worry about them, because they tell Ivy what they need. There's no 100% guarantee that conflicting dependencies are resolved in a way that works in every single case, which you run into sometimes when using incompatible libraries, but yes this is the point of --pack

Re: Why spark-submit works with package not with jar

2020-10-21 Thread Sean Owen
Yes, it's reasonable to build an uber-jar in development, using Maven/Ivy to resolve dependencies (and of course excluding 'provided' dependencies like Spark), and push that to production. That gives you a static artifact to run that does not depend on external repo access in production. On Wed, O

Re: Scala vs Python for ETL with Spark

2020-10-22 Thread Sean Owen
I don't find this trolling; I agree with the observation that 'the skills you have' are a valid and important determiner of what tools you pick. I disagree that you just have to pick the optimal tool for everything. Sounds good until that comes in contact with the real world. For Spark, Python vs S

Re: Ask about Pyspark ML interaction

2020-11-09 Thread Sean Owen
I think you have this flipped around - you want to one-hot encode, then compute interactions. As it is you are treating the product of {0,1,2,3,4} x {0,1,2,3,4} as if it's a categorical index. That doesn't have nearly 25 possible values and probably is not what you intend. On Mon, Nov 9, 2020 at 7

Re: Spark 2.4 lifetime

2020-11-11 Thread Sean Owen
I don't think there's an official EOL for Spark 2.4.x, but would expect another maintenance release in the first half of 2021 at least. I'd also guess it wouldn't be maintained by 2022. On Wed, Nov 11, 2020 at 12:24 AM Netanel Malka wrote: > Hi folks, > Do you know about how long Spark will cont

Re: Spark Dataset withColumn issue

2020-11-12 Thread Sean Owen
You can still simply select the columns by name in order, after .withColumn() On Thu, Nov 12, 2020 at 9:49 AM Vikas Garg wrote: > I am deriving the col2 using with colunn which is why I cant use it like > you told me > > On Thu, Nov 12, 2020, 20:11 German Schiavon > wrote: > >> ds.select("Col1"

Re: Purpose of type in pandas_udf

2020-11-12 Thread Sean Owen
It's the return value On Thu, Nov 12, 2020 at 5:20 PM Daniel Stojanov wrote: > Hi, > > > Note "double" in the function decorator. Is this specifying the type of > the data that goes into pandas_mean, or the type returned by that function? > > > Regards, > > > > > @pandas_udf("double", PandasUDFT

Re: Running the driver on a laptop but data is on the Spark server

2020-11-25 Thread Sean Owen
NFS is a simple option for this kind of usage, yes. But --files is making N copies of the data - you may not want to do that for large data, or for data that you need to mutate. On Wed, Nov 25, 2020 at 9:16 PM Artemis User wrote: > Ah, I almost forgot that there is an even easier solution for yo

Re: Remove subsets from FP Growth output

2020-12-02 Thread Sean Owen
-dev Increase the threshold? Just filter the rules as desired after they are generated? It's not clear what your criteria are. On Wed, Dec 2, 2020 at 7:30 AM Aditya Addepalli wrote: > Hi, > > Is there a good way to remove all the subsets of patterns from the output > given by FP Growth? > > For

Re: Regexp_extract not giving correct output

2020-12-02 Thread Sean Owen
As in Java/Scala, in Python you'll need to escape the backslashes with \\. "\[" means just "[" in a string. I think you could also prefix the string literal with 'r' to disable Python's handling of escapes. On Wed, Dec 2, 2020 at 9:34 AM Sachit Murarka wrote: > Hi All, > > I am using Pyspark to

Re: Regexp_extract not giving correct output

2020-12-02 Thread Sean Owen
nyid").show() > > and as I mentioned when I am using 2 backslashes it is giving an exception > as follows: > : java.util.regex.PatternSyntaxException: Unknown inline modifier near > index 21 > > (^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]

Re: Spark ML / ALS question

2020-12-02 Thread Sean Owen
There is only a fit() method in spark.ml's ALS http://spark.apache.org/docs/latest/api/scala/org/apache/spark/ml/recommendation/ALS.html The older spark.mllib interface has a train() method. You'd generally use the spark.ml version. On Wed, Dec 2, 2020 at 2:13 PM Steve Pruitt wrote: > I am havi

Re: Caching

2020-12-07 Thread Sean Owen
No, it's not true that one action means every DF is evaluated once. This is a good counterexample. On Mon, Dec 7, 2020 at 11:47 AM Amit Sharma wrote: > Thanks for the information. I am using spark 2.3.3 There are few more > questions > > 1. Yes I am using DF1 two times but at the end action is

Re: Using Lambda function to generate random data in PySpark throws not defined error

2020-12-11 Thread Sean Owen
Looks like a simple Python error - you haven't shown the code that produces it. Indeed, I suspect you'll find there is no such symbol. On Fri, Dec 11, 2020 at 9:09 AM Mich Talebzadeh wrote: > Hi, > > This used to work but not anymore. > > I have UsedFunctions.py file that has these functions > >

Re: Using Lambda function to generate random data in PySpark throws not defined error

2020-12-13 Thread Sean Owen
clustered(x, numRows)),[1,2,3,4])) >>>>> >>>>> If it does, i'd look in what's inside your Range and what you get out >>>>> of it. I suspect something wrong in there >>>>> >>>>> If there was something with the cl

Re: Convert Seq[Any] to Seq[String]

2020-12-18 Thread Sean Owen
It's not really a Spark question. .toDF() takes column names. atrb.head.toSeq.map(_.toString)? but it's not clear what you mean the col names to be On Fri, Dec 18, 2020 at 8:37 AM Vikas Garg wrote: > Hi, > > Can someone please help me how to convert Seq[Any] to Seq[String] > > For line > val df

Re: No matter how many instances and cores configured for spark on k8s, only one executor is reading file

2020-12-21 Thread Sean Owen
Pass more partitions to the second argument of parallelize()? On Mon, Dec 21, 2020 at 7:39 AM 沈俊 wrote: > Hi > > I am now trying to use spark to do tcpdump pcap file analysis. The first > step is to read the file and parse the content to dataframe according to > analysis requirements. > > I've

Re: Using UDF based on Numpy functions in Spark SQL

2020-12-23 Thread Sean Owen
Why do you want to use this function instead of the built-in stddev function? On Wed, Dec 23, 2020 at 2:52 PM Mich Talebzadeh wrote: > Hi, > > > This is a shot in the dark so to speak. > > > I would like to use the standard deviation std offered by numpy in > PySpark. I am using SQL for now > >

Re: Using UDF based on Numpy functions in Spark SQL

2020-12-24 Thread Sean Owen
Just wanted to see what numpy would come back with > > Thanks > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is

Re: Using UDF based on Numpy functions in Spark SQL

2020-12-24 Thread Sean Owen
Why not just use STDDEV_SAMP? it's probably more accurate than the differences-of-squares calculation. You can write an aggregate UDF that calls numpy and register it for SQL, but, it is already a built-in. On Thu, Dec 24, 2020 at 8:12 AM Mich Talebzadeh wrote: > Thanks for the feedback. > > I h

Re: Correctness bug on Shuffle+Repartition scenario

2020-12-29 Thread Sean Owen
Total guess here, but your key is a case class. It does define hashCode and equals for you, but, you have an array as one of the members. Array equality is by reference, so, two arrays of the same elements are not equal. You may have to define hashCode and equals manually to make them correct. On

Re: Correctness bug on Shuffle+Repartition scenario

2020-12-29 Thread Sean Owen
t;reduce by key") and some "pkey" missing. > Since it only happens when executors being preempted, I believe this is a > bug (nondeterministic shuffle) that SPARK-23207 trying to solve. > > Thanks, > > Shiao-An Yuan > > On Tue, Dec 29, 2020 at 10:53 PM Sean Owe

Re: How Spark Framework works a Compiler

2021-01-03 Thread Sean Owen
No it's much simpler than that. Spark is just a bunch of APIs that user applications call into to cause it to form a DAG and execute it. There's no need to reflection or transpiling or anything. The user app is just calling the framework directly, not the other way around. On Sun, Jan 3, 2021 at 4

Re: A question on extrapolation of a nonlinear curve fit beyond x value

2021-01-05 Thread Sean Owen
If your data set is 11 points, surely this is not a distributed problem? or are you asking how to build tens of thousands of those projections in parallel? On Tue, Jan 5, 2021 at 6:04 AM Mich Talebzadeh wrote: > Hi, > > I am not sure Spark forum is the correct avenue for this question. > > I am

Re: A question on extrapolation of a nonlinear curve fit beyond x value

2021-01-05 Thread Sean Owen
37428353 +/- 0.45979189 (5.49%) (init = 3.5) > > fwhm: 16.7485671 +/- 0.91958379 (5.49%) == '2.000*sigma' > > height: 1182407.88 +/- 15681.8211 (1.33%) == > '0.3183099*amplitude/max(2.220446049250313e-16, sigma)' > > [[Correlations]] (unr

Re: A question on extrapolation of a nonlinear curve fit beyond x value

2021-01-05 Thread Sean Owen
sk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damag

Re: Extending GraphFrames without running into serialization issues

2021-01-05 Thread Sean Owen
It's because this calls the no-arg superclass constructor that sets _vertices and _edges in the actual GraphFrame class to null. That yields the error. Normally you'd just show you want to call the two-arg superclass constructor with "extends GraphFrame(_vertices, _edges)" but that constructor is p

Re: Does Spark dynamic allocation work with more than one workers?

2021-01-07 Thread Sean Owen
Yes it does. It controls how many executors are allocated on workers, and isn't related to the number of workers. Something else is wrong with your setup. You would not typically, by the way, run multiple workers per machine at that scale. On Thu, Jan 7, 2021 at 7:15 AM Varun kumar wrote: > Hi,

Re: PyCharm, Running spark-submit calling jars and a package at run time

2021-01-08 Thread Sean Owen
I don't see anywhere that you provide 'sparkstuff'? how would the Spark app have this code otherwise? On Fri, Jan 8, 2021 at 10:20 AM Mich Talebzadeh wrote: > Thanks Riccardo. > > I am well aware of the submission form > > However, my question relates to doing submission within PyCharm itself. >

Re: PyCharm, Running spark-submit calling jars and a package at run time

2021-01-08 Thread Sean Owen
m relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Fri, 8 Jan 2021 at 16:38, Riccardo Ferrari wrote: > >> I think spark

Re: Customizing K-Means for Anomaly Detection

2021-01-12 Thread Sean Owen
You could fit the k-means pipeline, get the cluster centers, create a Transformer using that info, then create a new PipelineModel including all the original elements and the new Transformer. Does that work? It's not out of the question to expose a new parameter in KMeansModel that lets you also ad

Re: Spark 3.0.1 giving warning while running with Java 11

2021-01-14 Thread Sean Owen
You can ignore that. Spark 3.x works with Java 11 but it will generate some warnings that are safe to disregard. On Thu, Jan 14, 2021 at 11:26 PM Sachit Murarka wrote: > Hi All, > > Getting warning while running spark3.0.1 with Java11 . > > > WARNING: An illegal reflective access operation has o

Re: Correctness bug on Shuffle+Repartition scenario

2021-01-17 Thread Sean Owen
Hm, FWIW I can't reproduce that on Spark 3.0.1. What version are you using? On Sun, Jan 17, 2021 at 6:22 AM Shiao-An Yuan wrote: > Hi folks, > > I finally found the root cause of this issue. > It can be easily reproduced by the following code. > We ran it on a standalone mode 4 cores * 4 instanc

Re: subscribe user@spark.apache.org

2021-01-19 Thread Sean Owen
You have to sign up by sending an email - see http://spark.apache.org/community.html for what to send where. On Tue, Jan 19, 2021 at 12:25 PM Peter Podlovics < peter.d.podlov...@gmail.com> wrote: > Hello, > > I would like to subscribe to the above mailing list. I already tried > subscribing throu

Re: RDD filter in for loop gave strange results

2021-01-20 Thread Sean Owen
That looks very odd indeed. Things like this work as expected: rdd = spark.sparkContext.parallelize([0,1,2]) def my_filter(data, i): return data.filter(lambda x: x != i) for i in range(3): rdd = my_filter(rdd, i) rdd.collect() ... as does unrolling the loop. But your example behaves as if

Re: Spark RDD + HBase: adoption trend

2021-01-20 Thread Sean Owen
RDDs are still relevant in a few ways - there is no Dataset in Python for example, so RDD is still the 'typed' API. They still underpin DataFrames. And of course it's still there because there's probably still a lot of code out there that uses it. Occasionally it's still useful to drop into that AP

Re: RDD filter in for loop gave strange results

2021-01-20 Thread Sean Owen
No, because the final rdd is really the result of chaining 3 filter operations. They should all execute. It _should_ work like "rdd.filter(...).filter(..).filter(...)" On Wed, Jan 20, 2021 at 9:46 AM Zhu Jingnan wrote: > I thought that was right result. > > As rdd runs on a lacy basis. so every

Re: RDD filter in for loop gave strange results

2021-01-20 Thread Sean Owen
Heh that could make sense, but that definitely was not my mental model of how python binds variables! Definitely is not how Scala works. On Wed, Jan 20, 2021 at 10:00 AM Marco Wong wrote: > Hmm, I think I got what Jingnan means. The lambda function is x != i and i > is not evaluated when the lam

Re: Only one Active task in Spark Structured Streaming application

2021-01-21 Thread Sean Owen
Is your app accumulating a lot of streaming state? that's one reason something could slow down after a long time. Some memory leak in your app putting GC/memory pressure on the JVM, etc too. On Thu, Jan 21, 2021 at 5:13 AM Eric Beabes wrote: > Hello, > > My Spark Structured Streaming application

Re: Pyspark How to groupBy -> fit

2021-01-21 Thread Sean Owen
If you mean you want to train N models in parallel, you wouldn't be able to do that with a groupBy first. You apply logic to the result of groupBy with Spark, but can't use Spark within Spark. You can run N Spark jobs in parallel on the driver but you'd have to have each read the subset of data tha

Re: Pyspark How to groupBy -> fit

2021-01-21 Thread Sean Owen
ach model. I was hoping to find a more elegant approach. > > > > On Thu, Jan 21, 2021 at 5:28 PM Sean Owen wrote: > >> If you mean you want to train N models in parallel, you wouldn't be able >> to do that with a groupBy first. You apply logic to the result of groupB

Re: Using same rdd from two threads

2021-01-22 Thread Sean Owen
RDDs are immutable, and Spark itself is thread-safe. This should be fine. Something else is going on in your code. On Fri, Jan 22, 2021 at 7:59 AM jelmer wrote: > HI, > > I have a piece of code in which an rdd is created from a main method. > It then does work on this rdd from 2 different thread

Re: Apache Spark

2021-01-26 Thread Sean Owen
To clarify: Apache projects and the ASF do not provide paid support. However there are many vendors who provide distributions of Apache Spark who will provide technical support - not nearly just Databricks but Cloudera, etc. There are also plenty of consultancies and individuals who can provide pro

Re: Java/Spark

2021-02-01 Thread Sean Owen
The Spark distro does not include Java. That has to be present in the environment where the Spark cluster is run. It works with Java 8, and 11 in 3.x (Oracle and OpenJDK AFAIK). It seems to 99% work on 14+ even. On Mon, Feb 1, 2021 at 9:11 AM wrote: > Hello, > > > > I am looking for information

Re: Java/Spark

2021-02-01 Thread Sean Owen
use is strictly prohibited and subject to prosecution to the > fullest extent of the law! If you are not the intended recipient, please > delete this electronic message and DO NOT ACT UPON, FORWARD, COPY OR > OTHERWISE DISSEMINATE IT OR ITS CONTENTS." > > > > *From:* Sean

Re: Exception on Avro Schema Object Serialization

2021-02-02 Thread Sean Owen
Your function is somehow capturing the actual Avro schema object, which won't seiralize. Try rewriting it to ensure that it isn't used in the function. On Tue, Feb 2, 2021 at 2:32 PM Artemis User wrote: > We tried to standardize the SQL data source management using the Avro > schema, but encount

Re: Poor performance caused by coalesce to 1

2021-02-03 Thread Sean Owen
Probably could also be because that coalesce can cause some upstream transformations to also have parallelism of 1. I think (?) an OK solution is to cache the result, then coalesce and write. Or combine the files after the fact. or do what Silvio said. On Wed, Feb 3, 2021 at 12:55 PM James Yu wro

Re: vm.swappiness value for Spark on Kubernetes

2021-02-16 Thread Sean Owen
You probably don't want swapping in any environment. Some tasks will grind to a halt under mem pressure rather than just fail quickly. You would want to simply provision more memory. On Tue, Feb 16, 2021, 7:57 AM Jahar Tyagi wrote: > Hi, > > We have recently migrated from Spark 2.4.4 to Spark 3.

Re: Using Custom Scala Spark ML Estimator in PySpark

2021-02-16 Thread Sean Owen
You won't be able to use it in python if it is implemented in Java - needs a python wrapper too. On Mon, Feb 15, 2021, 11:29 PM HARSH TAKKAR wrote: > Hi , > > I have created a custom Estimator in scala, which i can use successfully > by creating a pipeline model in Java and scala, But when i try

Re: spark 3.1.1 release date?

2021-02-20 Thread Sean Owen
Another RC is starting imminently, which looks pretty good. If it succeeds, probably next week. It will support Scala 2.12, but I believe a Scala 2.13 build is only coming in 3.2.0. On Sat, Feb 20, 2021 at 1:54 PM Bulldog20630405 wrote: > > what is the expected ballpark release date of spark 3.1

Re: A serious bug in the fitting of a binary logistic regression.

2021-02-22 Thread Sean Owen
I'll take a look. At a glance - is it converging? might turn down the tolerance to check. Also what does scikit learn say on the same data? we can continue on the JIRA. On Mon, Feb 22, 2021 at 5:42 PM Yakov Kerzhner wrote: > I have written up a JIRA, and there is a gist attached that has code th

Re: Issue after change to 3.0.2

2021-02-26 Thread Sean Owen
That looks to me like you have two different versions of Spark in use somewhere here. Like the cluster and driver versions aren't quite the same. Check your classpaths? On Fri, Feb 26, 2021 at 2:53 AM Bode, Meikel, NMA-CFD < meikel.b...@bertelsmann.de> wrote: > Hi All, > > > > After changing to 3

Re: Spark closures behavior in local mode in IDEs

2021-02-26 Thread Sean Owen
Yeah this is a good question. It is certainly to do with executing within the same JVM, but even I'd have to dig into the code to explain why the spark-sql version operates differently, as that also appears to be local. To be clear this 'shouldn't' work, just happens to not fail in local execution.

Re: Please update this notification on Spark download Site

2021-03-02 Thread Sean Owen
That statement is still accurate - it is saying the release will be 3.1.1, not 3.1.0. In any event, 3.1.1 is rolling out as we speak - already in Maven and binaries are up and the website changes are being merged. On Tue, Mar 2, 2021 at 9:10 AM Mich Talebzadeh wrote: > > Can someone please updat

Re: [Spark SQL, intermediate+] possible bug or weird behavior of insertInto

2021-03-03 Thread Sean Owen
I don't have any good answer here, but, I seem to recall that this is because of SQL semantics, which follows column ordering not naming when performing operations like this. It may well be as intended. On Tue, Mar 2, 2021 at 6:10 AM Oldrich Vlasic < oldrich.vla...@datasentics.com> wrote: > Hi, >

Re: Possible upgrade path from Spark 3.1.1-RC2 to Spark 3.1.1 GA

2021-03-04 Thread Sean Owen
I think you're still asking about GCP and Dataproc, and that's really nothing to do with Spark itself. Whatever issues you are having concern Dataproc and how it's run and possibly customizations in Dataproc. 3.1.1-RC2 is not a release, but, also nothing meaningfully changed between it and the fina

Re: com.esotericsoftware.kryo.KryoException: java.io.IOException: No space left on device\n\t

2021-03-08 Thread Sean Owen
It's there in the error: No space left on device You ran out of disk space (local disk) on one of your machines. On Mon, Mar 8, 2021 at 2:02 AM Sachit Murarka wrote: > Hi All, > > I am getting the following error in my spark job. > > Can someone please have a look ? > > org.apache.spark.SparkExc

Re: Creating spark context outside of the driver throws error

2021-03-08 Thread Sean Owen
Yep, you can never use Spark inside Spark. You could run N jobs in parallel from the driver using Spark, however. On Mon, Mar 8, 2021 at 3:14 PM Mich Talebzadeh wrote: > > In structured streaming with pySpark, I need to do some work on the row > *foreach(process_row)* > > below > > > *def proces

Re: Spark Streaming - Routing rdd to Executor based on Key

2021-03-09 Thread Sean Owen
You can also group by the key in the transformation on each batch. But yes that's faster/easier if it's already partitioned that way. On Tue, Mar 9, 2021 at 7:30 AM Ali Gouta wrote: > Do not know Kenesis, but it looks like it works like kafka. Your producer > should implement a paritionner that

Re: Sounds like Structured streaming with foreach, can only run on one executor

2021-03-09 Thread Sean Owen
That should not be the case. See https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#using-foreach-and-foreachbatch Maybe you are calling .foreach on some Scala object inadvertently. On Tue, Mar 9, 2021 at 4:41 PM Mich Talebzadeh wrote: > Hi, > > When I use *foreachB

Re: Submitting insert query from beeline failing on executor server with java 11

2021-03-16 Thread Sean Owen
That looks like you didn't compile with Java 11 actually. How did you try to do so? On Tue, Mar 16, 2021, 7:50 AM kaki mahesh raja wrote: > HI All, > > We have compiled spark with java 11 ("11.0.9.1") and when testing the > thrift > server we are seeing that insert query from operator using beel

  1   2   3   4   5   6   7   8   9   10   >