Re: Spark ML Random Forest output.

2015-12-05 Thread Benjamin Fradet
Hi, To get back the original labels after indexing them with StringIndexer, I usually use IndexToString to retrieve my original labels like so: val labelIndexer = new StringIndexer() .setInp

Re: Grid search with Random Forest

2015-12-01 Thread Benjamin Fradet
> Hi Benjamin, > > Thanks, the documentation you sent is clear. > Is there any other way to perform a Grid Search with GBT? > > > Ndjido > On Tue, 1 Dec 2015 at 08:32, Benjamin Fradet > wrote: > >> Hi Ndjido, >> >> This is because GBTClassifier do

Re: Grid search with Random Forest

2015-11-30 Thread Benjamin Fradet
Hi Ndjido, This is because GBTClassifier doesn't yet have a rawPredictionCol like the. RandomForestClassifier has. Cf: http://spark.apache.org/docs/latest/ml-ensembles.html#output-columns-predictions-1 On 1 Dec 2015 3:57 a.m., "Ndjido Ardo BAR" wrote: > Hi Joseph, > > Yes Random Forest support G

Re: Retrieve best parameters from CrossValidator

2015-11-29 Thread Benjamin Fradet
Hi Yanbo, Thanks for your answer, I'm looking forward to 1.6 then. On Sun, Nov 29, 2015 at 3:44 PM, Yanbo Liang wrote: > Hi Ben, > > We can get the best model from CrossValidatorModel.BestModel, further > more we can use the write function of CrossValidatorModel >

Drop multiple columns in the DataFrame API

2015-11-19 Thread Benjamin Fradet
Hi everyone, I was wondering if there is a better way to drop mutliple columns from a dataframe or why there is no drop(cols: Column*) method in the dataframe API. Indeed, I tend to write code like this: val filteredDF = df.drop("colA") .drop("colB") .drop("colC") //etc which is a bit

Re: Setting JVM heap start and max sizes, -Xms and -Xmx, for executors

2015-07-02 Thread Benjamin Fradet
Hi, You can set those parameters through the spark.executor.extraJavaOptions Which is documented in the configuration guide: spark.apache.org/docs/latest/configuration.htnl On 2 Jul 2015 9:06 pm, "Mulugeta Mammo" wrote: > Hi, > > I'm running Spark 1.4.0, I want to specify the start and max siz

Re: Kafka Direct Stream - Custom Serialization and Deserilization

2015-06-26 Thread Benjamin Fradet
There is one for the key of your Kafka message and one for its value. On 26 Jun 2015 4:21 pm, "Ashish Soni" wrote: > my question is why there are similar two parameter String.Class and > StringDecoder.class what is the difference each of them ? > > Ashish > > On Fri, Jun 26, 2015 at 8:53 AM, Akhi

Re: [Spark Streaming] Null Pointer Exception when accessing broadcast variable to store a hashmap in Java

2015-06-23 Thread Benjamin Fradet
Are you using checkpointing? I had a similar issue when recreating a streaming context from checkpoint as broadcast variables are not checkpointed. On 23 Jun 2015 5:01 pm, "Nipun Arora" wrote: > Hi, > > I have a spark streaming application where I need to access a model saved > in a HashMap. > I

Re: [Spark Streaming 1.4.0] SPARK-5063, Checkpointing and queuestream

2015-06-22 Thread Benjamin Fradet
d sc.parallelize to generate the RDDs > for the queue, but perhaps on resume, it doesn't recreate the context > needed? > > > > -- > > Shaanan Cohney > PhD Student > University of Pennsylvania > > > shaan...@gmail.com > > On Mon, Jun 22, 2015 at 9:27 P

Re: [Spark Streaming 1.4.0] SPARK-5063, Checkpointing and queuestream

2015-06-22 Thread Benjamin Fradet
curr_rdd.collect(): > fnames, count = c > counts.append(count) > out_files |= fnames > except Py4JJavaError as e: > print("EXCEPTION: %s" % str(e)) > > -- > > Shaanan Cohney > PhD Student > University of Pennsylvania

Re: [Spark Streaming 1.4.0] SPARK-5063, Checkpointing and queuestream

2015-06-22 Thread Benjamin Fradet
What does "counts" refer to? Could you also paste the code of your "update_state" function? On 22 Jun 2015 12:48 pm, "Shaanan Cohney" wrote: > I'm receiving the SPARK-5063 error (RDD transformations and actions can > only be invoked by the driver, not inside of other transformations) > whenever

Re: [SparkStreaming 1.3.0] Broadcast failure after setting "spark.cleaner.ttl"

2015-06-09 Thread Benjamin Fradet
Hi, Are you restarting your Spark streaming context through getOrCreate? On 9 Jun 2015 09:30, "Haopu Wang" wrote: > When I ran a spark streaming application longer, I noticed the local > directory's size was kept increasing. > > I set "spark.cleaner.ttl" to 1800 seconds in order clean the metada