Re: Spark Implementation of XGBoost

2015-10-27 Thread DB Tsai
Hi Meihua, For categorical features, the ordinal issue can be solved by trying all kind of different partitions 2^(q-1) -1 for q values into two groups. However, it's computational expensive. In Hastie's book, in 9.2.4, the trees can be trained by sorting the residuals and being learnt as if they

Re: [VOTE] Release Apache Spark 1.5.2 (RC1)

2015-10-27 Thread Sean Owen
Ah, good point. I also see it still reads 1.5.1. I imagine we just need another sweep to update all the version strings. On Tue, Oct 27, 2015 at 3:08 AM, Krishna Sankar wrote: > Guys, >The sc.version returns 1.5.1 in python and scala. Is anyone getting the > same results ? Probably I am doin

Re: [VOTE] Release Apache Spark 1.5.2 (RC1)

2015-10-27 Thread Reynold Xin
Yup looks like I missed that. I will build a new one. On Tuesday, October 27, 2015, Sean Owen wrote: > Ah, good point. I also see it still reads 1.5.1. I imagine we just need > another sweep to update all the version strings. > > On Tue, Oct 27, 2015 at 3:08 AM, Krishna Sankar > wrote: > >> Guy

Exception when using some aggregate operators

2015-10-27 Thread Shagun Sodhani
Hi! I was trying out some aggregate functions in SparkSql and I noticed that certain aggregate operators are not working. This includes: approxCountDistinct countDistinct mean sumDistinct For example using countDistinct results in an error saying *Exception in thread "main" org.apache.spark.sql.

Re: Exception when using some aggregate operators

2015-10-27 Thread Shagun Sodhani
Oops seems I made a mistake. The error message is : Exception in thread "main" org.apache.spark.sql.AnalysisException: undefined function countDistinct On 27 Oct 2015 15:49, "Shagun Sodhani" wrote: > Hi! I was trying out some aggregate functions in SparkSql and I noticed > that certain aggregate

Re: Exception when using some aggregate operators

2015-10-27 Thread Reynold Xin
Try count(distinct columnane) In SQL distinct is not part of the function name. On Tuesday, October 27, 2015, Shagun Sodhani wrote: > Oops seems I made a mistake. The error message is : Exception in thread > "main" org.apache.spark.sql.AnalysisException: undefined function > countDistinct > On

Re: Exception when using some aggregate operators

2015-10-27 Thread Shagun Sodhani
Will try in a while when I get back. I assume this applies to all functions other than mean. Also countDistinct is defined along with all other SQL functions. So I don't get "distinct is not part of function name" part. On 27 Oct 2015 19:58, "Reynold Xin" wrote: > Try > > count(distinct columnane

Re: Exception when using some aggregate operators

2015-10-27 Thread Shagun Sodhani
So I tried @Reynold's suggestion. I could get countDistinct and sumDistinct running but mean and approxCountDistinct do not work. (I guess I am using the wrong syntax for approxCountDistinct) For mean, I think the registry entry is missing. Can someone clarify that as well? On Tue, Oct 27, 2015 a

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-27 Thread Sjoerd Mulder
I have disabled it because of it started generating ERROR's when upgrading from Spark 1.4 to 1.5.1 2015-10-27T20:50:11.574+0100 ERROR TungstenSort.newOrdering() - Failed to generate ordering, fallback to interpreted java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile: o

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-27 Thread Josh Rosen
Hi Sjoerd, Did your job actually *fail* or did it just generate many spurious exceptions? While the stacktrace that you posted does indicate a bug, I don't think that it should have stopped query execution because Spark should have fallen back to an interpreted code path (note the "Failed to gener

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-27 Thread Sjoerd Mulder
No the job actually doesn't fail, but since our tests is generating all these stacktraces i have disabled the tungsten mode just to be sure (and don't have gazilion stacktraces in production). 2015-10-27 20:59 GMT+01:00 Josh Rosen : > Hi Sjoerd, > > Did your job actually *fail* or did it just gen

Pickle Spark DataFrame

2015-10-27 Thread agg212
Hi, I'd like to "pickle" a Spark DataFrame object and have tried the following: import pickle data = sparkContext.jsonFile(data_file) #load file with open('out.pickle', 'wb') as handle: pickle.dump(data, handle) If I convert "data" to a Pandas DataFrame (e.g.,

Re: Spark.Executor.Cores question

2015-10-27 Thread Richard Marscher
Hi Mark, if you know your cluster's number of workers and cores per worker you can set this up when you create a SparkContext and shouldn't need to tinker with the 'spark.executor.cores' setting. That setting is for running multiple executors per application per worker, which you are saying you do

Re: Spark.Executor.Cores question

2015-10-27 Thread mkhaitman
Hi Richard, Thanks for the response. I should have added that the specific case where this becomes a problem is when one of the executors for that application is lost/killed prematurely, and the application attempts to spawn up a new executor without consideration as to whether an executor alrea

Re: Spark.Executor.Cores question

2015-10-27 Thread Richard Marscher
Ah I see, that's a bit more complicated =). If it's possible, would using `spark.executor.memory` to set the available worker memory used by executors help alleviate the problem of running on a node that already has an executor on it? I would assume that would have a constant worst case overhead pe

Re: Exception when using some aggregate operators

2015-10-27 Thread Ted Yu
Have you tried using avg in place of mean ? (1 to 5).foreach { i => val df = (1 to 1000).map(j => (j, s"str$j")).toDF("a", "b").save(s"/tmp/partitioned/i=$i") } sqlContext.sql(""" CREATE TEMPORARY TABLE partitionedParquet USING org.apache.spark.sql.parquet OPTIONS ( path '/tm

Re: Spark Implementation of XGBoost

2015-10-27 Thread Meihua Wu
Hi DB Tsai, Thank you again for your insightful comments! 1) I agree the sorting method you suggested is a very efficient way to handle the unordered categorical variables in binary classification and regression. I propose we have a Spark ML Transformer to do the sorting and encoding, bringing th

Filter applied on merged Parquet shemsa with new column fails.

2015-10-27 Thread Hyukjin Kwon
When enabling mergedSchema and predicate filter, this fails since Parquet filters are pushed down regardless of each schema of the splits (or rather files). Dominic Ricard reported this issue ( https://issues.apache.org/jira/browse/SPARK-11103) Even though this would work okay by setting spark.sq

Re: Exception when using some aggregate operators

2015-10-27 Thread Shagun Sodhani
Yup avg works good. So we have alternate functions to use in place on the functions pointed out earlier. But my point is that are those original aggregate functions not supposed to be used or I am using them in the wrong way or is it a bug as I asked in my first mail. On Wed, Oct 28, 2015 at 3:20

Task not serializable exception

2015-10-27 Thread Rohith Parameshwara
I am getting this spark not serializable exception when running spark submit in standalone mode. I am trying to use spark streaming which gets its stream from kafka queues.. but it is not able to process the mapping actions on the RDDs from the stream ..the code where the serialization exception

Re: using JavaRDD in spark-redis connector

2015-10-27 Thread Rohith P
got it ..thank u... -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/using-JavaRDD-in-spark-redis-connector-tp14391p14812.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. -