How to get Histogram of all columns in a large CSV / RDD[Array[double]] ?

2015-10-20 Thread DEVAN M.S.
Hi all, I am trying to calculate Histogram of all columns from a CSV file using Spark Scala. I found that DoubleRDDFunctions supporting Histogram. So i coded like following for getting histogram of all columns. 1. Get column count 2. Create RDD[double] of each column and calculate Histogram of

SORT BY and ORDER BY file size v/s RAM size

2015-02-28 Thread DEVAN M.S.
*Hi devs,* *Is there any connection between the input file size and RAM size for sorting using SparkSQL ?* *I tried 1 GB file with 8 GB RAM with 4 cores and got java.lang.OutOfMemoryError: GC overhead limit exceeded.* *Or could it be for any other reason ? Its working for other SparkSQL operation

Exception in thread "main" java.lang.SecurityException: class "javax.servlet.ServletRegistration"'

2015-02-03 Thread DEVAN M.S.
HI all, I need a help. When i am trying to run spark project it is showing that, "Exception in thread "main" java.lang.SecurityException: class "javax.servlet.ServletRegistration"'s signer information does not match signer information of other classes in the same package". *After deleting "/home/d

Re: reducing number of output files

2015-01-22 Thread DEVAN M.S.
Rdd.coalesce(1) will coalesce RDD and give only one output file. coalesce(2) will give 2 wise versa. On Jan 23, 2015 4:58 AM, "Sean Owen" wrote: > One output file is produced per partition. If you want fewer, use > coalesce() before saving the RDD. > > On Thu, Jan 22, 2015 at 10:46 PM, Kane Kim

Re: KNN for large data set

2015-01-22 Thread DEVAN M.S.
der to compute k-nearest > neighbors locally. You can start with LSH + k-nearest in Google > scholar: http://scholar.google.com/scholar?q=lsh+k+nearest -Xiangrui > > On Tue, Jan 20, 2015 at 9:55 PM, DEVAN M.S. wrote: > > Hi all, > > > > Please help me to find out best

KNN for large data set

2015-01-20 Thread DEVAN M.S.
Hi all, Please help me to find out best way for K-nearest neighbor using spark for large data sets.

Re: IF statement doesn't work in Spark-SQL?

2015-01-20 Thread DEVAN M.S.
Can you share your code ? Devan M.S. | Research Associate | Cyber Security | AMRITA VISHWA VIDYAPEETHAM | Amritapuri | Cell +919946535290 | On Tue, Jan 20, 2015 at 5:03 PM, Xuelin Cao wrote: > > Hi, > > Yes, this is what I'm doing. I'm using hiveContext.h

Re: IF statement doesn't work in Spark-SQL?

2015-01-20 Thread DEVAN M.S.
Add one more library libraryDependencies += "org.apache.spark" % "spark-hive_2.10" % "1.2.0" val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc) repalce sqlContext with hiveContext. Its working while using HiveContext for me. Devan M.S. | Resea

Re: IF statement doesn't work in Spark-SQL?

2015-01-20 Thread DEVAN M.S.
Which context are you using HiveContext or SQLContext ? Can you try with HiveContext ?? Devan M.S. | Research Associate | Cyber Security | AMRITA VISHWA VIDYAPEETHAM | Amritapuri | Cell +919946535290 | On Tue, Jan 20, 2015 at 3:49 PM, Xuelin Cao wrote: > > Hi, I'm using Spark 1

How to collect() each partition in scala ?

2014-12-30 Thread DEVAN M.S.
Hi all, i have one large data-set. when i am getting the number of partitions its showing 43. We can't collect() the large data-set in to memory so i am thinking like this, collect() each partitions so that it will be small in size. Any thoughts ?