Building scaladoc using "build/sbt unidoc" failure

2015-05-26 Thread Justin Yip
Hello, I am trying to build scala doc from the 1.4 branch. But it failed due to [error] (sql/compile:compile) java.lang.AssertionError: assertion failed: List(object package$DebugNode, object package$DebugNode) I followed the instruction on github

Inconsistent behavior with Dataframe Timestamp between 1.3.1 and 1.4.0

2015-05-27 Thread Justin Yip
Hello, I am trying out 1.4.0 and notice there are some differences in behavior with Timestamp between 1.3.1 and 1.4.0. In 1.3.1, I can compare a Timestamp with string. scala> val df = sqlContext.createDataFrame(Seq((1, Timestamp.valueOf("2015-01-01 00:00:00")), (2, Timestamp.valueOf("2014-01-01 0

Why the default Params.copy doesn't work for Model.copy?

2015-06-04 Thread Justin Yip
Hello, I have a question with Spark 1.4 ml library. In the copy function, it is stated that the default implementation doesn't work of Params doesn't work for models. ( https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Model.scala#L49 ) As a result, some feature

NullPointerException with functions.rand()

2015-06-10 Thread Justin Yip
Hello, I am using 1.4.0 and found the following weird behavior. This case works fine: scala> sc.parallelize(Seq((1,2), (3, 100))).toDF.withColumn("index", rand(30)).show() +--+---+---+ |_1| _2| index| +--+---+---+ | 1| 2| 0.6662967911724369| | 3|100|

Re: Inconsistent behavior with Dataframe Timestamp between 1.3.1 and 1.4.0

2015-06-17 Thread Justin Yip
Done. https://issues.apache.org/jira/browse/SPARK-8420 Justin On Wed, Jun 17, 2015 at 4:06 PM, Xiangrui Meng wrote: > That sounds like a bug. Could you create a JIRA and ping Yin Huai > (cc'ed). -Xiangrui > > On Wed, May 27, 2015 at 12:57 AM, Justin Yip > wrote: >

NaiveBayes for MLPipeline is absent

2015-06-18 Thread Justin Yip
Hello, Currently, there is no NaiveBayes implementation for MLpipeline. I couldn't find the JIRA ticket related to it too (or maybe I missed). Is there a plan to implement it? If no one has the bandwidth, I can work on it. Thanks. Justin -- View this message in context: http://apache-spark

Temp directory used by spark-submit

2015-03-12 Thread Justin Yip
Hello, I notice that when I run spark-submit, a temporary directory containing all the jars and resource files is created under /tmp (for example, /tmp/spark-fd1b77fc-50f4-4b1c-a122-5cf36969407c). Sometimes this directory gets cleanup after the job, but sometimes it doesn't, which fills up my roo

Catching InvalidClassException in sc.objectFile

2015-03-19 Thread Justin Yip
Hello, I have persisted a RDD[T] to disk through "saveAsObjectFile". Then I changed the implementation of T. When I read the file with sc.objectFile using the new binary, I got the exception of java.io.InvalidClassException, which is expected. I try to catch this error via SparkException in the d

StackOverflow Problem with 1.3 mllib ALS

2015-04-01 Thread Justin Yip
Hello, I have been using Mllib's ALS in 1.2 and it works quite well. I have just upgraded to 1.3 and I encountered stackoverflow problem. After some digging, I realized that when the iteration > ~35, I will get overflow problem. However, I can get at least 80 iterations with ALS in 1.2. Is there

Re: StackOverflow Problem with 1.3 mllib ALS

2015-04-02 Thread Justin Yip
ntInterval to solve this > problem, which is available in the current master and 1.3.1 (to be > released soon). Btw, do you find 80 iterations are needed for > convergence? -Xiangrui > > On Wed, Apr 1, 2015 at 11:54 PM, Justin Yip > wrote: > > Hello, > > > > I hav

Re: MLlib: save models to HDFS?

2015-04-03 Thread Justin Yip
Hello Zhou, You can look at the recommendation template of PredictionIO. PredictionIO is built on the top of spark. And this template illustrates how you can save the ALS model to HDFS and the reload it later. Ju

DataFrame groupBy MapType

2015-04-04 Thread Justin Yip
Hello, I have a case class like this: case class A( m: Map[Long, Long], ... ) and constructed a DataFrame from Seq[A]. I would like to perform a groupBy on A.m("SomeKey"). I can implement a UDF, create a new Column then invoke a groupBy on the new Column. But is it the idiomatic way of doin

Re: DataFrame groupBy MapType

2015-04-07 Thread Justin Yip
5 PM, Michael Armbrust > wrote: > >> In HiveQL, you should be able to express this as: >> >> SELECT ... FROM table GROUP BY m['SomeKey'] >> >> On Sat, Apr 4, 2015 at 5:25 PM, Justin Yip >> wrote: >> >>> Hello, >>> >>>

Expected behavior for DataFrame.unionAll

2015-04-07 Thread Justin Yip
Hello, I am experimenting with DataFrame. I tried to construct two DataFrames with: 1. case class A(a: Int, b: String) scala> adf.printSchema() root |-- a: integer (nullable = false) |-- b: string (nullable = true) 2. case class B(a: String, c: Int) scala> bdf.printSchema() root |-- a: string

DataFrame degraded performance after DataFrame.cache

2015-04-07 Thread Justin Yip
Hello, I have a parquet file of around 55M rows (~ 1G on disk). Performing simple grouping operation is pretty efficient (I get results within 10 seconds). However, after called DataFrame.cache, I observe a significant performance degrade, the same operation now takes 3+ minutes. My hunch is that

Re: DataFrame degraded performance after DataFrame.cache

2015-04-07 Thread Justin Yip
The schema has a StructType. Justin On Tue, Apr 7, 2015 at 6:58 PM, Yin Huai wrote: > Hi Justin, > > Does the schema of your data have any decimal, array, map, or struct type? > > Thanks, > > Yin > > On Tue, Apr 7, 2015 at 6:31 PM, Justin Yip > wrote: > >

Re: DataFrame degraded performance after DataFrame.cache

2015-04-07 Thread Justin Yip
gt; > On Tue, Apr 7, 2015 at 6:59 PM, Justin Yip > wrote: > >> The schema has a StructType. >> >> Justin >> >> On Tue, Apr 7, 2015 at 6:58 PM, Yin Huai wrote: >> >>> Hi Justin, >>> >>> Does the schema of your data have any deci

The $ notation for DataFrame Column

2015-04-10 Thread Justin Yip
Hello, The DataFrame documentation always uses $"columnX" to annotates a column. But I cannot find much information about it. Maybe I have missed something. Can anyone point me to the doc about the "$", if there is any? Thanks. Justin

DataFrame column name restriction

2015-04-10 Thread Justin Yip
Hello, Are there any restriction in the column name? I tried to use ".", but sqlContext.sql cannot find the column. I would guess that "." is tricky as this affects accessing StructType, but are there any more restriction on column name? scala> case class A(a: Int) defined class A scala> sqlCont

Re: How to use Joda Time with Spark SQL?

2015-04-12 Thread Justin Yip
Cheng, this is great info. I have a follow up question. There are a few very common data types (i.e. Joda DateTime) that is not directly supported by SparkSQL. Do you know if there are any plans for accommodating some common data types in SparkSQL? They don't need to be a first class datatype, but

Fwd: Expected behavior for DataFrame.unionAll

2015-04-13 Thread Justin Yip
Hello, I am experimenting with DataFrame. I tried to construct two DataFrames with: 1. case class A(a: Int, b: String) scala> adf.printSchema() root |-- a: integer (nullable = false) |-- b: string (nullable = true) 2. case class B(a: String, c: Int) scala> bdf.printSchema() root |-- a: string

Re: Expected behavior for DataFrame.unionAll

2015-04-13 Thread Justin Yip
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/HiveTypeCoercion.scala#L144 > > > > On Tue, Apr 7, 2015 at 5:00 PM, Justin Yip > wrote: > >> Hello, >> >> I am experimenting with DataFrame. I tried to

Re: How to use Joda Time with Spark SQL?

2015-04-14 Thread Justin Yip
va object. > > > > Thanks, > > Daoyuan. > > > > *From:* Cheng Lian [mailto:lian.cs@gmail.com] > *Sent:* Sunday, April 12, 2015 11:51 PM > *To:* Justin Yip > *Cc:* adamgerst; user@spark.apache.org > *Subject:* Re: How to use Joda Time with Spark SQL? > > &g

Catching executor exception from executor in driver

2015-04-14 Thread Justin Yip
Hello, I would like to know if there is a way of catching exception throw from executor exception from the driver program. Here is an example: try { val x = sc.parallelize(Seq(1,2,3)).map(e => e / 0).collect } catch { case e: SparkException => { println(s"ERROR: $e") println(s"CAUSE:

Column renaming after DataFrame.groupBy

2015-04-21 Thread Justin Yip
Hello, I would like rename a column after aggregation. In the following code, the column name is "SUM(_1#179)", is there a way to rename it to a more friendly name? scala> val d = sqlContext.createDataFrame(Seq((1, 2), (1, 3), (2, 10))) scala> d.groupBy("_1").sum().printSchema root |-- _1: integ

Creating StructType with DataFrame.withColumn

2015-04-30 Thread Justin Yip
Hello, I would like to add a StructType to DataFrame. What would be the best way to do it? Not sure if it is possible using withColumn. A possible way is to convert the dataframe into a RDD[Row], add the struct and then convert it back to dataframe. But that seems an overkill. I guess I may have

casting timestamp into long fail in Spark 1.3.1

2015-04-30 Thread Justin Yip
Hello, I was able to cast a timestamp into long using df.withColumn("millis", $"eventTime".cast("long") * 1000) in spark 1.3.0. However, this statement returns a failure with spark 1.3.1. I got the following exception: Exception in thread "main" org.apache.spark.sql.types.DataTypeException: Unsu

Re: casting timestamp into long fail in Spark 1.3.1

2015-04-30 Thread Justin Yip
After some trial and error, using DataType solves the problem: df.withColumn("millis", $"eventTime".cast( org.apache.spark.sql.types.LongType) * 1000) Justin On Thu, Apr 30, 2015 at 3:41 PM, Justin Yip wrote: > Hello, > > I was able to cast a timestamp into long

Custom Aggregate Function for DataFrame

2015-05-14 Thread Justin Yip
Hello, May I know if these is way to implement aggregate function for grouped data in DataFrame? I dug into the doc but didn't find any apart from the UDF functions which applies on a Row. Maybe I have missed something. Thanks. Justin -- View this message in context: http://apache-spark-user

Re: Custom Aggregate Function for DataFrame

2015-05-15 Thread Justin Yip
to write udafs in > similar lines of sum/min etc. > > On Fri, May 15, 2015 at 5:49 AM, Justin Yip > wrote: > >> Hello, >> >> May I know if these is way to implement aggregate function for grouped >> data in DataFrame? I dug into the doc but didn't find an

Best practice to avoid ambiguous columns in DataFrame.join

2015-05-15 Thread Justin Yip
Hello, I would like ask know if there are recommended ways of preventing ambiguous columns when joining dataframes. When we join dataframes, it usually happen we join the column with identical name. I could have rename the columns on the right data frame, as described in the following code. Is the

Re: Best practice to avoid ambiguous columns in DataFrame.join

2015-05-15 Thread Justin Yip
rename the columns as you suggested.* > > df.join(df2.withColumnRenamed("_1", "right_key"), $"_1" === > $"right_key").printSchema > > *4. (Spark 1.4 only) use def join(right: DataFrame, usingColumn: String): > DataFrame* > > df.join(

Getting the best parameter set back from CrossValidatorModel

2015-05-16 Thread Justin Yip
Hello, I am using MLPipeline. I would like to extract the best parameter found by CrossValidator. But I cannot find much document about how to do it. Can anyone give me some pointers? Thanks. Justin -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Getting

Re: Getting the best parameter set back from CrossValidatorModel

2015-05-17 Thread Justin Yip
> println(hashingStage.getNumFeatures) > val lrStage = stages(2).asInstanceOf[LogisticRegressionModel] > println(lrStage.getRegParam) > > > > Ram > > On Sat, May 16, 2015 at 3:17 AM, Justin Yip > wrote: > >> Hello, >> >> I am using MLPipeline. I would like to extr

Implementing custom metrics under MLPipeline's BinaryClassificationEvaluator

2015-05-17 Thread Justin Yip
Hello, I would like to use other metrics in BinaryClassificaitonEvaluator, I am thinking about simple ones (i.e. PrecisionByThreshold). From the api site, I can't tell much about how to implement it. >From the code, it seems like I will have to override this function, using most of the existing c

KyroException: Unable to find class

2014-06-05 Thread Justin Yip
Hello, I have been using Externalizer from Chill to as serialization wrapper. It appears to me that Spark have some conflict with the classloader with Chill. I have the (a simplified version) following program: import java.io._ import com.twitter.chill.Externalizer class X(val i: Int) { override

MLLib : Decision Tree with minimum points per node

2014-06-13 Thread Justin Yip
Hello, I have been playing around with mllib's decision tree library. It is working great, thanks. I have a question regarding overfitting. It appears to me that the current implementation doesn't allows user to specify the minimum number of samples per node. This results in some nodes only conta

MLLib sample data format

2014-06-22 Thread Justin Yip
Hello, I am looking into a couple of MLLib data files in https://github.com/apache/spark/tree/master/data/mllib. But I cannot find any explanation for these files? Does anyone know if they are documented? Thanks. Justin

Re: MLLib sample data format

2014-06-22 Thread Justin Yip
). Thanks. Justin On Sun, Jun 22, 2014 at 3:24 PM, Justin Yip wrote: > Hi Shuo, > > Yes. I was reading the guide as well as the sample code. > > For example, in > http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-support-vector-machine-svm, > now where in the g

Re: MLLib sample data format

2014-06-22 Thread Justin Yip
). Thanks. Justin On Sun, Jun 22, 2014 at 2:40 PM, Shuo Xiang wrote: > Hi, you might find http://spark.apache.org/docs/latest/mllib-guide.html > helpful. > > > On Sun, Jun 22, 2014 at 2:35 PM, Justin Yip wrote: > >> Hello, >> >> I am looking into a couple of ML

Re: MLLib sample data format

2014-06-22 Thread Justin Yip
I see. That's good. Thanks. Justin On Sun, Jun 22, 2014 at 4:59 PM, Evan Sparks wrote: > Oh, and the movie lens one is userid::movieid::rating > > - Evan > > On Jun 22, 2014, at 3:35 PM, Justin Yip wrote: > > Hello, > > I am looking into a couple of MLLib da

Spark Master Web UI showing "0 cores" in Completed Applications

2014-11-02 Thread Justin Yip
Hello, I have a question about the "Completed Applications" table on the Spark Master web UI page. For the column "Cores", it used to show the number of cores used in the application. However, after I added a line "sparkContext.stop()" as the end my spark app, it shows "0 cores". My application

Re: How to create an empty RDD with a given type?

2015-01-12 Thread Justin Yip
Xuelin, There is a function called emtpyRDD under spark context which serves this purpose. Justin On Mon, Jan 12, 2015 at 9:50 PM, Xuelin Cao wrote: > > > Hi, > > I'd like to create a transform functio

Accumulator value in Spark UI

2015-01-14 Thread Justin Yip
Hello, >From accumulator documentation, it says that if the accumulator is named, it will be displayed in the WebUI. However, I cannot find it anywhere. Do I need to specify anything in the spark ui config? Thanks. Justin

Re: Accumulator value in Spark UI

2015-01-14 Thread Justin Yip
Found it. Thanks Patrick. Justin On Wed, Jan 14, 2015 at 10:38 PM, Patrick Wendell wrote: > It should appear in the page for any stage in which accumulators are > updated. > > On Wed, Jan 14, 2015 at 6:46 PM, Justin Yip > wrote: > > Hello, > > > > From accumu