Inserting column to DataFrame

2016-02-11 Thread Zsolt Tóth
Hi, I'd like to append a column of a dataframe to another DF (using Spark 1.5.2): DataFrame outputDF = unlabelledDF.withColumn("predicted_label", predictedDF.col("predicted")); I get the following exception: java.lang.IllegalArgumentException: requirement failed: DataFrame must have the same sc

Re: Inserting column to DataFrame

2016-02-12 Thread Zsolt Tóth
edDF.join(predictedDF.select(“id”,”predicted”),”id”) > > On 11 February 2016 at 10:12, Zsolt Tóth wrote: > >> Hi, >> >> I'd like to append a column of a dataframe to another DF (using Spark >> 1.5.2): >> >> DataFrame outputDF = unlabelledDF.withColumn("pre

Re: Inserting column to DataFrame

2016-02-12 Thread Zsolt Tóth
org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304) Regards, Zsolt 2016-02-12 13:11 GMT+01:00 Ted Yu : > Can you pastebin the full error with all column types ? > > There should be a difference between some column(s). > > Cheers > > > On Feb 11, 2016, at 2:12 AM, Zsolt Tóth > wrote: &

ClassNotFound for exception class in Spark 1.5.x

2015-11-19 Thread Zsolt Tóth
Hi, I try to throw an exception of my own exception class (MyException extends SparkException) on one of the executors. This works fine on Spark 1.3.x, 1.4.x but throws a deserialization/ClassNotFound exception on Spark 1.5.x. This happens only when I throw it on an executor, on the driver it succ

Re: Re: driver ClassNotFoundException when MySQL JDBC exceptions are thrown on executor

2015-11-19 Thread Zsolt Tóth
Hi, this is exactly the same as my issue, seems to be a bug in 1.5.x. (see my thread for details) 2015-11-19 11:20 GMT+01:00 Jeff Zhang : > Seems your jdbc url is not correct. Should be jdbc:mysql:// > 192.168.41.229:3306 > > On Thu, Nov 19, 2015 at 6:03 PM, wrote: > >> hi guy, >> >>I a

Re: ClassNotFound for exception class in Spark 1.5.x

2015-11-19 Thread Zsolt Tóth
Hi Tamás, the exception class is in the application jar, I'm using the spark-submit script. 2015-11-19 11:54 GMT+01:00 Tamas Szuromi : > Hi Zsolt, > > How you load the jar and how you prepend it to the classpath? > > Tamas > > > > > On 19 November 2015 a

create DataFrame from RDD

2015-12-02 Thread Zsolt Tóth
Hi, I have a Spark job with many transformations (sequence of maps and mapPartitions) and only one action in the end (DataFrame.write()). The transformations return an RDD, so I need to create a DataFrame. To be able to use sqlContext.createDataFrame() I need to know the schema of the Row but for

Re: OutOfMemory error with Spark ML 1.5 logreg example

2015-09-07 Thread Zsolt Tóth
Hi, I ran your example on Spark-1.4.1 and 1.5.0-rc3. It succeeds on 1.4.1 but throws the OOM on 1.5.0. Do any of you know which PR introduced this issue? Zsolt 2015-09-07 16:33 GMT+02:00 Zoltán Zvara : > Hey, I'd try to debug, profile ResolvedDataSource. As far as I know, your > write will b

SparkR 2.0 glm prediction confidences

2016-10-05 Thread Zsolt Tóth
Hi, in Spark 1.6 the glm's predict() method returned a DataFrame with 0/1 prediction values. In 2.0 however, the same code returns confidence-like values, e.g. 0.5320209312. Can anyone tell me, what caused the change here? Is it possible to get the old, binary values with Spark 2.0? Regards, Zsol

Delegation Token renewal in yarn-cluster

2016-11-03 Thread Zsolt Tóth
Hi, I ran some tests regarding Spark's Delegation Token renewal mechanism. As I see, the concept here is simple: if I give my keytab file and client principal to Spark, it starts a token renewal thread, and renews the namenode delegation tokens after some time. This works fine. Then I tried to ru

Re: Delegation Token renewal in yarn-cluster

2016-11-03 Thread Zsolt Tóth
Any ideas about this one? Am I missing something here? 2016-11-03 15:22 GMT+01:00 Zsolt Tóth : > Hi, > > I ran some tests regarding Spark's Delegation Token renewal mechanism. As > I see, the concept here is simple: if I give my keytab file and client > principal to Spar

Re: Delegation Token renewal in yarn-cluster

2016-11-03 Thread Zsolt Tóth
e definitely have run into it. So > if you're not hitting it, it's most definitely an issue with your test > configuration. > > On Thu, Nov 3, 2016 at 7:22 AM, Zsolt Tóth > wrote: > > Hi, > > > > I ran some tests regarding Spark's Delegation Token r

Re: Delegation Token renewal in yarn-cluster

2016-11-03 Thread Zsolt Tóth
extend its lifetime. The feature you're talking about is for > creating *new* delegation tokens after the old ones expire and cannot > be renewed anymore (i.e. the max-lifetime configuration). > > On Thu, Nov 3, 2016 at 2:02 PM, Zsolt Tóth > wrote: > > Yes, I did change dfs.

Re: Delegation Token renewal in yarn-cluster

2016-11-04 Thread Zsolt Tóth
based on the renew-interval instead of the max-lifetime? 2016-11-04 2:37 GMT+01:00 Marcelo Vanzin : > On Thu, Nov 3, 2016 at 3:47 PM, Zsolt Tóth > wrote: > > What is the purpose of the delegation token renewal (the one that is done > > automatically by Hadoop libraries, afte

Map and MapParitions with partition-local variable

2016-11-16 Thread Zsolt Tóth
Hi, I need to run a map() and a mapPartitions() on my input DF. As a side-effect of the map(), a partition-local variable should be updated, that is used in the mapPartitions() afterwards. I can't use Broadcast variable, because it's shared between partitions on the same executor. Where can I def

Map and MapParitions with partition-local variable

2016-11-17 Thread Zsolt Tóth
Any comment on this one? 2016. nov. 16. du. 12:59 ezt írta ("Zsolt Tóth" ): > Hi, > > I need to run a map() and a mapPartitions() on my input DF. As a > side-effect of the map(), a partition-local variable should be updated, > that is used in the mapPartitions()

withColumn on nested schema

2018-06-13 Thread Zsolt Tóth
Hi, I'm trying to replace values in a nested column in a JSON-based dataframe using withColumn(). This syntax works for select, filter, etc, giving only the nested "country" column: df.select('body.payload.country') but if I do this, it will create a new column with the name "body.payload.countr

[mllib] Decision Tree - prediction probabilites of label classes

2015-01-21 Thread Zsolt Tóth
Hi, I use DecisionTree for multi class classification. I can get the probability of the predicted label for every node in the decision tree from node.predict().prob(). Is it possible to retrieve or count the probability of every possible label class in the node? To be more clear: Say in Node A the

Resource allocation in yarn-cluster mode

2015-02-10 Thread Zsolt Tóth
Hi, I'm using Spark in yarn-cluster mode and submit the jobs programmatically from the client in Java. I ran into a few issues when tried to set the resource allocation properties. 1. It looks like setting spark.executor.memory, spark.executor.cores and spark.executor.instances have no effect bec

Re: Resource allocation in yarn-cluster mode

2015-02-10 Thread Zsolt Tóth
One more question: Is there reason why Spark throws an error when requesting too much memory instead of capping it to the maximum value (as YARN would do by default)? Thanks! 2015-02-10 17:32 GMT+01:00 Zsolt Tóth : > Hi, > > I'm using Spark in yarn-cluster mode and s

Using 1.3.0 client jars with 1.2.1 assembly in yarn-cluster mode

2015-03-06 Thread Zsolt Tóth
Hi, I submit spark jobs in yarn-cluster mode remotely from java code by calling Client.submitApplication(). For some reason I want to use 1.3.0 jars on the client side (e.g spark-yarn_2.10-1.3.0.jar) but I have spark-assembly-1.2.1* on the cluster. The problem is that the ApplicationMaster can't f

Using ORC input for mllib algorithms

2015-03-25 Thread Zsolt Tóth
Hi, I use sc.hadoopFile(directory, OrcInputFormat.class, NullWritable.class, OrcStruct.class) to use data in ORC format as an RDD. I made some benchmarking on ORC input vs Text input for MLlib and I ran into a few issues with ORC. Setup: yarn-cluster mode, 11 executors, 4 cores, 9g executor memory

RDD collect hangs on large input data

2015-03-27 Thread Zsolt Tóth
Hi, I have a simple Spark application: it creates an input rdd with sc.textfile, and it calls flatMapToPair, reduceByKey and map on it. The output rdd is small, a few MB's. Then I call collect() on the output. If the textfile is ~50GB, it finishes in a few minutes. However, if it's larger (~100GB

Re: Using ORC input for mllib algorithms

2015-03-30 Thread Zsolt Tóth
the SQL data source API: > https://github.com/apache/spark/pull/3753. You can try pulling that PR > and help test it. -Xiangrui > > On Wed, Mar 25, 2015 at 5:03 AM, Zsolt Tóth > wrote: > > Hi, > > > > I use sc.hadoopFile(directory, OrcInputFormat.class, NullWrita

Re: RDD collect hangs on large input data

2015-03-30 Thread Zsolt Tóth
size huge, you can simply do a count() to > trigger the execution. > > Can you paste your exception stack trace so that we'll know whats > happening? > > Thanks > Best Regards > > On Fri, Mar 27, 2015 at 9:18 PM, Zsolt Tóth > wrote: > >> Hi, >> >

Re: RDD collect hangs on large input data

2015-04-08 Thread Zsolt Tóth
I use EMR 3.3.1 which comes with Java 7. Do you think that this may cause the issue? Did you test it with Java 8?

Re: RDD collect hangs on large input data

2015-04-17 Thread Zsolt Tóth
me, or it might point to spark internals. > > On Wed, Apr 8, 2015 at 3:45 AM, Zsolt Tóth > wrote: > >> I use EMR 3.3.1 which comes with Java 7. Do you think that this may cause >> the issue? Did you test it with Java 8? >> > >

Spark-1.2.2-bin-hadoop2.4.tgz missing

2015-04-20 Thread Zsolt Tóth
Hi all, it looks like the 1.2.2 pre-built version for hadoop2.4 is not available on the mirror sites. Am I missing something? Regards, Zsolt