date:20160724

Re: java.lang.RuntimeException: Unsupported type: vector

2016-07-24 Thread Hyukjin Kwon

I just wonder how your CSV data structure looks like. If my understanding is correct, is SQL type of the VectorUDT is StructType and CSV data source does not support ArrayType and StructType. Anyhow, it seems CSV does not support UDT for now anyway. https://github.com/apache/spark/blob/e1dc85373

Hive and distributed sql engine

2016-07-24 Thread Marco Colombo

Hi all! Among other use cases, I want to use spark as a distributed sql engine via thrift server. I have some tables in postegres and Cassandra: I need to expose them via hive for custom reporting. Basic implementation is simple and works, but I have some concerns and open question: - is there a be

Re: Spark 1.6.2 version displayed as 1.6.1

2016-07-24 Thread Sean Owen

Are you certain? looks like it was correct in the release: https://github.com/apache/spark/blob/v1.6.2/core/src/main/scala/org/apache/spark/package.scala On Mon, Jul 25, 2016 at 12:33 AM, Ascot Moss wrote: > Hi, > > I am trying to upgrade spark from 1.6.1 to 1.6.2, from 1.6.2 spark-shell, I >

unsubscribe

2016-07-24 Thread Uzi Hadad

unsubscribe)

2016-07-24 Thread Uzi Hadad

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-24 Thread Nick Pentreath

Good suggestion Krishna One issue is that this doesn't work with TrainValidationSplit or CrossValidator for parameter tuning. Hence my solution in the PR which makes it work with the cross-validators. On Mon, 25 Jul 2016 at 00:42, Krishna Sankar wrote: > Thanks Nick. I also ran into this issue.

Re: K-means Evaluation metrics

2016-07-24 Thread Yanbo Liang

Spark MLlib KMeansModel provides "computeCost" function which return the sum of squared distances of points to their nearest center as the k-means cost on the given dataset. Thanks Yanbo 2016-07-24 17:30 GMT-07:00 janardhan shetty : > Hi, > > I was trying to evaluate k-means clustering predictio

Re: Frequent Item Pattern Spark ML Dataframes

2016-07-24 Thread Yanbo Liang

You can refer this JIRA (https://issues.apache.org/jira/browse/SPARK-14501) for porting spark.mllib.fpm to spark.ml. Thanks Yanbo 2016-07-24 11:18 GMT-07:00 janardhan shetty : > Is there any implementation of FPGrowth and Association rules in Spark > Dataframes ? > We have in RDD but any pointer

where I can find spark-streaming-kafka for spark2.0

2016-07-24 Thread kevin

hi,all : I try to run example org.apache.spark.examples.streaming.KafkaWordCount , I got error : Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/spark/streaming/kafka/KafkaUtils$ at org.apache.spark.examples.streaming.KafkaWordCount$.main(KafkaWordCount.scala:57) at org.apache

[Error] : Save dataframe to csv using Spark-csv in Spark 1.6

2016-07-24 Thread Divya Gehlot

Hi, I am getting below error when I am trying to save dataframe using Spark-CSV > > final_result_df.write.format("com.databricks.spark.csv").option("header","true").save(output_path) java.lang.NoSuchMethodError: > scala.Predef$.$conforms()Lscala/Predef$$less$colon$less; > at > com.databricks.spa

Re: Bzip2 to Parquet format

2016-07-24 Thread Andrew Ehrlich

You can load the text with sc.textFile() to an RDD[String], then use .map() to convert it into an RDD[Row]. At this point you are ready to apply a schema. Use sqlContext.createDataFrame(rddOfRow, structType) Here is an example on how to define the StructType (schema) that you will combine with

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-24 Thread Rohit Chaddha

Great thanks both of you. I was struggling with this issue as well. -Rohit On Mon, Jul 25, 2016 at 4:12 AM, Krishna Sankar wrote: > Thanks Nick. I also ran into this issue. > VG, One workaround is to drop the NaN from predictions (df.na.drop()) and > then use the dataset for the evaluator. In

Re: Size exceeds Integer.MAX_VALUE

2016-07-24 Thread Andrew Ehrlich

You can use the .repartition() function on the rdd or dataframe to set the number of partitions higher. Use .partitions.length to get the current number of partitions. (Scala API). Andrew > On Jul 24, 2016, at 4:30 PM, Ascot Moss wrote: > > the data set is the training data set for random for

Bzip2 to Parquet format

2016-07-24 Thread janardhan shetty

We have data in Bz2 compression format. Any links in Spark to convert into Parquet and also performance benchmarks and uses study materials ?

K-means Evaluation metrics

2016-07-24 Thread janardhan shetty

Hi, I was trying to evaluate k-means clustering prediction since the exact cluster numbers were provided before hand for each data point. Just tried the Error = Predicted cluster number - Given number as brute force method. What are the evaluation metrics available in Spark for K-means clustering

Re: Maintaining order of pair rdd

2016-07-24 Thread janardhan shetty

Thanks Marco. This solved the order problem. Had another question which is prefix to this. As you can see below ID2,ID1 and ID3 are in order and I need to maintain this index order as well. But when we do groupByKey operation(*rdd.distinct.groupByKey().mapValues(v => v.toArray*)) everything is *ju

Spark 1.6.2 version displayed as 1.6.1

2016-07-24 Thread Ascot Moss

Hi, I am trying to upgrade spark from 1.6.1 to 1.6.2, from 1.6.2 spark-shell, I found the version is still displayed 1.6.1 Is this a minor typo/bug? Regards ### Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ v

Re: Size exceeds Integer.MAX_VALUE

2016-07-24 Thread Ascot Moss

the data set is the training data set for random forest training, about 36,500 data, any idea how to further partition it? On Sun, Jul 24, 2016 at 12:31 PM, Andrew Ehrlich wrote: > It may be this issue: https://issues.apache.org/jira/browse/SPARK-6235 which > limits the size of the blocks in th

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-24 Thread Krishna Sankar

Thanks Nick. I also ran into this issue. VG, One workaround is to drop the NaN from predictions (df.na.drop()) and then use the dataset for the evaluator. In real life, probably detect the NaN and recommend most popular on some window. HTH. Cheers On Sun, Jul 24, 2016 at 12:49 PM, Nick Pentreath

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-24 Thread Nick Pentreath

It seems likely that you're running into https://issues.apache.org/jira/browse/SPARK-14489 - this occurs when the test dataset in the train/test split contains users or items that were not in the training set. Hence the model doesn't have computed factors for those ids, and ALS 'transform' currentl

Re: Maintaining order of pair rdd

2016-07-24 Thread Marco Mistroni

Hello Uhm you have an array containing 3 tuples? If all the arrays have same length, you can just zip all of them, creatings a list of tuples then you can scan the list 5 by 5...? so something like (Array(0)_2,Array(1)._2,Array(2)._2).zipped.toList this will give you a list of tuples of 3 eleme

Outer Explode needed

2016-07-24 Thread Don Drake

I have a nested data structure (array of structures) that I'm using the DSL df.explode() API to flatten the data. However, when the array is empty, I'm not getting the rest of the row in my output as it is skipped. This is the intended behavior, and Hive supports a SQL "OUTER explode()" to genera

Re: Spark driver getting out of memory

2016-07-24 Thread Raghava Mutharaju

Saurav, We have the same issue. Our application runs fine on 32 nodes with 4 cores each and 256 partitions but gives an OOM on the driver when run on 64 nodes with 512 partitions. Did you get to know the reason behind this behavior or the relation between number of partitions and driver RAM usage?

Re: UDF to build a Vector?

2016-07-24 Thread Marco Mistroni

Hi what is your source data? i am guessing a DataFrame or Integers as you are usingan UDF So your DataFrame is then a bunch of Row[Integer] ? below a sample from one of my code to predict eurocup winners , going from a DataFrame of Row[Double] to a RDD of LabeledPoint I m not using UDF to con

Frequent Item Pattern Spark ML Dataframes

2016-07-24 Thread janardhan shetty

Is there any implementation of FPGrowth and Association rules in Spark Dataframes ? We have in RDD but any pointers to Dataframes ?

Re: spark context stop vs close

2016-07-24 Thread Sean Owen

I think this is about JavaSparkContext which implements the standard Closeable interface for convenience. Both do exactly the same thing. On Sun, Jul 24, 2016 at 6:27 PM, Jacek Laskowski wrote: > Hi, > > I can only find stop. Where did you find close? > > Pozdrawiam, > Jacek Laskowski > > ht

How to read content of hdfs files

2016-07-24 Thread Bhupendra Mishra

I have hdfs data in zip formate which includes data, name and nameseconday folder. Pretty much structure is like datanode, name node and secondary node. How to read the content of data. would be great if some can suggest tips/steps. Thanks

Re: spark context stop vs close

2016-07-24 Thread Jacek Laskowski

Hi, I can only find stop. Where did you find close? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark http://bit.ly/mastering-apache-spark Follow me at https://twitter.com/jaceklaskowski On Sat, Jul 23, 2016 at 3:11 PM, Mail.com wrote: > Hi All, > > Wh

Spark 2.0.0 RC 5 -- java.lang.AssertionError: assertion failed: Block rdd_[*] is not locked for reading

2016-07-24 Thread Ameen Akel

Hello, I'm working with Spark 2.0.0-rc5 on Mesos (v0.28.2) on a job with ~600 cores. Every so often, depending on the task that I've run, I'll lose an executor to an assertion. Here's an example error: java.lang.AssertionError: assertion failed: Block rdd_2659_0 is not locked for reading I've p

Re: Maintaining order of pair rdd

2016-07-24 Thread janardhan shetty

Array( (ID1,Array(18159, 308703, 72636, 64544, 39244, 107937, 54477, 145272, 100079, 36318, 160992, 817, 89366, 150022, 19622, 44683, 58866, 162076, 45431, 100136)), (ID3,Array(100079, 19622, 18159, 212064, 107937, 44683, 150022, 39244, 100136, 58866, 72636, 145272, 817, 89366, 54477, 36318, 308703

java.lang.RuntimeException: Unsupported type: vector

2016-07-24 Thread Jean Georges Perrin

I try to build a simple DataFrame that can be used for ML SparkConf conf = new SparkConf().setAppName("Simple prediction from Text File").setMaster("local"); SparkContext sc = new SparkContext(conf); SQLContext sqlContext = new SQLContext(sc);

UDF to build a Vector?

2016-07-24 Thread Jean Georges Perrin

Hi, Here is my UDF that should build a VectorUDT. How do I actually make that the value is in the vector? package net.jgp.labs.spark.udf; import org.apache.spark.mllib.linalg.VectorUDT; import org.apache.spark.sql.api.java.UDF1; public class VectorBuilder implements UDF1 { private sta

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

2016-07-24 Thread VG

ping. Anyone has some suggestions/advice for me . It will be really helpful. VG On Sun, Jul 24, 2016 at 12:19 AM, VG wrote: > Sean, > > I did this just to test the model. When I do a split of my data as > training to 80% and test to be 20% > > I get a Root-mean-square error = NaN > > So I am wo

Restarting Spark Streaming Job periodically

2016-07-24 Thread Prashant verma

Hi All, I want to restart my spark streaming job periodically after every 15 mins using JAVA. Is it possible and if yes, how should i proceed. Thanks, Prashant

Re: Maintaining order of pair rdd

2016-07-24 Thread Marco Mistroni

Apologies I misinterpreted could you post two use cases? Kr On 24 Jul 2016 3:41 pm, "janardhan shetty" wrote: > Marco, > > Thanks for the response. It is indexed order and not ascending or > descending order. > On Jul 24, 2016 7:37 AM, "Marco Mistroni" wrote: > >> Use map values to transfor

Re: Maintaining order of pair rdd

2016-07-24 Thread janardhan shetty

Marco, Thanks for the response. It is indexed order and not ascending or descending order. On Jul 24, 2016 7:37 AM, "Marco Mistroni" wrote: > Use map values to transform to an rdd where values are sorted? > Hth > > On 24 Jul 2016 6:23 am, "janardhan shetty" wrote: > >> I have a key,value pair r

Re: How to generate a sequential key in rdd across executors

2016-07-24 Thread Marco Mistroni

Hi how bout creating an auto increment column in hbase? Hth On 24 Jul 2016 3:53 am, "yeshwanth kumar" wrote: > Hi, > > i am doing bulk load to hbase using spark, > in which i need to generate a sequential key for each record, > the key should be sequential across all the executors. > > i tried z

Re: Locality sensitive hashing

2016-07-24 Thread Yanbo Liang

Hi Janardhan, Please refer the JIRA (https://issues.apache.org/jira/browse/SPARK-5992) for the discussion about LSH. Regards Yanbo 2016-07-24 7:13 GMT-07:00 Karl Higley : > Hi Janardhan, > > I collected some LSH papers while working on an RDD-based implementation. > Links at the end of the READ

Re: Locality sensitive hashing

2016-07-24 Thread Karl Higley

Hi Janardhan, I collected some LSH papers while working on an RDD-based implementation. Links at the end of the README here: https://github.com/karlhigley/spark-neighbors Keep me posted on what you come up with! Best, Karl On Sun, Jul 24, 2016 at 9:54 AM janardhan shetty wrote: > I was lookin

Locality sensitive hashing

2016-07-24 Thread janardhan shetty

I was looking through to implement locality sensitive hashing in dataframes. Any pointers for reference?

Re: Saving a pyspark.ml.feature.PCA model

2016-07-24 Thread Yanbo Liang

Sorry for the wrong link, what you should refer is jpmml-sparkml ( https://github.com/jpmml/jpmml-sparkml). Thanks Yanbo 2016-07-24 4:46 GMT-07:00 Yanbo Liang : > Spark does not support exporting ML models to PMML currently. You can try > the third party jpmml-spark (https://github.com/jpmml/jpm

Re: Saving a pyspark.ml.feature.PCA model

2016-07-24 Thread Yanbo Liang

Spark does not support exporting ML models to PMML currently. You can try the third party jpmml-spark (https://github.com/jpmml/jpmml-spark) package which supports a part of ML models. Thanks Yanbo 2016-07-20 11:14 GMT-07:00 Ajinkya Kale : > Just found Google dataproc has a preview of spark 2.0.

Re: Using flatMap on Dataframes with Spark 2.0

2016-07-24 Thread Julien Nauroy

Hi again, Just another strange behavior I stumbled upon. Can anybody reproduce it? Here's the code snippet in scala: var df1 = spark.read.parquet(fileName) df1 = df1.withColumn("newCol", df1.col("anyExistingCol")) df1.printSchema() // here newCol exists df1 = df1.flatMap(x => List(x)) df1

Re: How to generate a sequential key in rdd across executors

2016-07-24 Thread Pedro Rodriguez

If you can use a dataframe then you could use rank + window function at the expense of an extra sort. Do you have an example of zip with index not working, that seems surprising. On Jul 23, 2016 10:24 PM, "Andrew Ehrlich" wrote: > It’s hard to do in a distributed system. Maybe try generating a me

Re: Distributed Matrices - spark mllib

2016-07-24 Thread Yanbo Liang

Hi Gourav, I can not reproduce your problem. The following code snippets works well on my local machine, you can try to verify it in your environment. Or could you provide more information to make others can reproduce your problem? from pyspark.mllib.linalg.distributed import CoordinateMatrix, Ma

Re: NoClassDefFoundError with ZonedDateTime

2016-07-24 Thread Timur Shenkao

Which version of Java 8 do you use? AFAIK, it's recommended to exploit Java 1.8_0.66 + On Fri, Jul 22, 2016 at 8:49 PM, Jacek Laskowski wrote: > On Fri, Jul 22, 2016 at 6:43 AM, Ted Yu wrote: > > You can use this command (assuming log aggregation is turned on): > > > > yarn logs --applicationId

Re: Spark, Scala, and DNA sequencing

2016-07-24 Thread Sean Owen

Also also, you may be interested in GATK, built on Spark, for genomics: https://github.com/broadinstitute/gatk On Sun, Jul 24, 2016 at 7:56 AM, Ofir Manor wrote: > Hi James, > BTW - if you are into analyzing DNA with Spark, you may also be interested > in ADAM: >https://github.com/bigdatagen

47 matches

Mail list logo