from:"James Hammerton"

Re: dataframe udf functioin will be executed twice when filter on new column created by withColumn

2016-05-11 Thread James Hammerton

This may be related to: https://issues.apache.org/jira/browse/SPARK-13773 Regards, James On 11 May 2016 at 15:49, Ted Yu wrote: > In master branch, behavior is the same. > > Suggest opening a JIRA if you haven't done so. > > On Wed, May 11, 2016 at 6:55 AM, Tony Jin wrote: > >> Hi guys, >> >>

Re: Error from reading S3 in Scala

2016-05-04 Thread James Hammerton

On 3 May 2016 at 17:22, Gourav Sengupta wrote: > Hi, > > The best thing to do is start the EMR clusters with proper permissions in > the roles that way you do not need to worry about the keys at all. > > Another thing, why are we using s3a// instead of s3:// ? > Probably because of what's said a

Re: ML Random Forest Classifier

2016-04-13 Thread James Hammerton

sifier = null, > categoricalFeatures: Map[Int, Int], numClasses: Int, numFeatures: Int > = -1): RandomForestClassificationModel = { > RandomForestClassificationModel.fromOld(oldModel, parent, > categoricalFeatures, numClasses, numFeatures) > } > > > def toOld(newModel: RandomF

Re: ML Random Forest Classifier

2016-04-11 Thread James Hammerton

tegoricalFeatures, numClasses, numFeatures) > > } > > > def toOld(newModel: RandomForestClassificationModel): > OldRandomForestModel = { > > newModel.toOld > > } > > } > Regards, James On 11 April 2016 at 10:36, James Hammerton wrote: > There are met

Re: ML Random Forest Classifier

2016-04-11 Thread James Hammerton

There are methods for converting the dataframe based random forest models to the old RDD based models and vice versa. Perhaps using these will help given that the old models can be saved and loaded? In order to use them however you will need to write code in the org.apache.spark.ml package. I've

Logistic regression throwing errors

2016-04-01 Thread James Hammerton

Hi, On a particular .csv data set - which I can use in WEKA's logistic regression implementation without any trouble, I'm getting errors like the following: 16/04/01 18:04:18 ERROR LBFGS: Failure! Resetting history: > breeze.optimize.FirstOrderException: Line search failed These errors cause the

Re: Work out date column in CSV more than 6 months old (datediff or something)

2016-03-22 Thread James Hammerton

On 22 March 2016 at 10:57, Mich Talebzadeh wrote: > Thanks Silvio. > > The problem I have is that somehow string comparison does not work. > > Case in point > > val df = > sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", > "true").option("header", "true").load("/data/stg/t

Re: Find all invoices more than 6 months from csv file

2016-03-22 Thread James Hammerton

On 21 March 2016 at 17:57, Mich Talebzadeh wrote: > > Hi, > > For test purposes I am ready a simple csv file as follows: > > val df = > sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", > "true").option("header", "true").load("/data/stg/table2") > df: org.apache.spark.sql.D

Add org.apache.spark.mllib model .predict() method to models in org.apache.spark.ml?

2016-03-22 Thread James Hammerton

Hi, The machine learning models in org.apache.spark.mllib have a .predict() method that can be applied to a Vector to return a prediction. However this method does not appear on the new models on org.apache.spark.ml and you have to wrap up a Vector in a DataFrame to send a prediction in. This tie

Re: best way to do deep learning on spark ?

2016-03-20 Thread James Hammerton

In the meantime there is also deeplearning4j which integrates with Spark (for both Java and Scala): http://deeplearning4j.org/ Regards, James On 17 March 2016 at 02:32, Ulanov, Alexander wrote: > Hi Charles, > > > > There is an implementation of multilayer perceptron in Spark (since 1.5): > >

Saving the DataFrame based RandomForestClassificationModels

2016-03-18 Thread James Hammerton

Hi, If you train a org.apache.spark.ml.classification.RandomForestClassificationModel, you can't save it - attempts to do so yield the following error: 16/03/18 14:12:44 INFO SparkContext: Successfully stopped SparkContext > Exception in thread "main" java.lang.UnsupportedOperationException: > Pi

Best way to process values for key in sorted order

2016-03-15 Thread James Hammerton

Hi, I need to process some events in a specific order based on a timestamp, for each user in my data. I had implemented this by using the dataframe sort method to sort by user id and then sort by the timestamp secondarily, then do a groupBy().mapValues() to process the events for each user. Howe

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-09 Thread James Hammerton

Hi Ted, Finally got round to creating this: https://issues.apache.org/jira/browse/SPARK-13773 I hope you don't mind me selecting you as the shepherd for this ticket. Regards, James On 7 March 2016 at 17:50, James Hammerton wrote: > Hi Ted, > > Thanks for getting back -

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-07 Thread James Hammerton

the Project. > > Cheers > > On Mon, Mar 7, 2016 at 2:54 AM, James Hammerton wrote: > >> Hi, >> >> So I managed to isolate the bug and I'm ready to try raising a JIRA >> issue. I joined the Apache Jira project so I can create tickets. >> >> Howe

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-07 Thread James Hammerton

pache Infrastructure. There doesn't seem to be an option for me to raise an issue for Spark?! Regards, James On 4 March 2016 at 14:03, James Hammerton wrote: > Sure thing, I'll see if I can isolate this. > > Regards. > > James > > On 4 March 2016 at 12:24, Ted Yu

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-04 Thread James Hammerton

Sure thing, I'll see if I can isolate this. Regards. James On 4 March 2016 at 12:24, Ted Yu wrote: > If you can reproduce the following with a unit test, I suggest you open a > JIRA. > > Thanks > > On Mar 4, 2016, at 4:01 AM, James Hammerton wrote: > > Hi, >

DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

2016-03-04 Thread James Hammerton

Hi, I've come across some strange behaviour with Spark 1.6.0. In the code below, the filtering by "eventName" only seems to work if I called .cache on the resulting DataFrame. If I don't do this, the code crashes inside the UDF because it processes an event that the filter should get rid off. A

Re: How to control the number of parquet files getting created under a partition ?

2016-03-02 Thread James Hammerton

Hi, Based on the behaviour I've seen using parquet, the number of partitions in the DataFrame will determine the number of files in each parquet partition. I.e. when you use "PARTITION BY" you're actually partitioning twice, once via the partitions spark has created internally and then again with

Count job stalling at shuffle stage on 3.4TB input (but only 5.3GB shuffle write)

2016-02-23 Thread James Hammerton

Hi, I have been having problems processing a 3.4TB data set - uncompressed tab separated text - containing object creation/update events from our system, one event per line. I decided to see what happens with a count of the number of events (= number of lines in the text files) and a count of the

Re: Is this likely to cause any problems?

2016-02-19 Thread James Hammerton

-ec2 script rather than EMR? > > On Thu, Feb 18, 2016 at 11:39 AM, James Hammerton wrote: > >> I have now... So far I think the issues I've had are not related to >> this, but I wanted to be sure in case it should be something that needs to >> be patched. I&#x

Re: Is this likely to cause any problems?

2016-02-18 Thread James Hammerton

Yu wrote: > Have you seen this ? > > HADOOP-10988 > > Cheers > > On Thu, Feb 18, 2016 at 3:39 AM, James Hammerton wrote: > >> HI, >> >> I am seeing warnings like this in the logs when I run Spark jobs: >> >> OpenJDK 64-Bit Server VM warn

Re: Is this likely to cause any problems?

2016-02-18 Thread James Hammerton

t using EMR to start your SPARK > cluster? > > > Regards, > Gourav > > On Thu, Feb 18, 2016 at 12:23 PM, Ted Yu wrote: > >> Have you seen this ? >> >> HADOOP-10988 >> >> Cheers >> >> On Thu, Feb 18, 2016 at 3:39 AM, James Hammerton

Is this likely to cause any problems?

2016-02-18 Thread James Hammerton

HI, I am seeing warnings like this in the logs when I run Spark jobs: OpenJDK 64-Bit Server VM warning: You have loaded library /root/ephemeral-hdfs/lib/native/libhadoop.so.1.0.0 which might have disabled stack guard. The VM will try to fix the stack guard now. It's highly recommended that you fi

Re: dataframe udf functioin will be executed twice when filter on new column created by withColumn

Re: Error from reading S3 in Scala

Re: ML Random Forest Classifier

Re: ML Random Forest Classifier

Re: ML Random Forest Classifier

Logistic regression throwing errors

Re: Work out date column in CSV more than 6 months old (datediff or something)

Re: Find all invoices more than 6 months from csv file

Add org.apache.spark.mllib model .predict() method to models in org.apache.spark.ml?

Re: best way to do deep learning on spark ?

Saving the DataFrame based RandomForestClassificationModels

Best way to process values for key in sorted order

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

Re: DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

DataFrame .filter only seems to work when .cache is called in local mode in 1.6.0

Re: How to control the number of parquet files getting created under a partition ?

Count job stalling at shuffle stage on 3.4TB input (but only 5.3GB shuffle write)

Re: Is this likely to cause any problems?

Re: Is this likely to cause any problems?

Re: Is this likely to cause any problems?

Is this likely to cause any problems?

23 matches

Site Navigation

Mail list logo

Footer information