Spark SQL Dataframe resulting from an except( ) is unusable

2017-01-31 Thread Vinayak Joshi5
With Spark 2.x, I construct a Dataframe from a sample libsvm file: scala> val higgsDF = spark.read.format("libsvm").load("higgs.libsvm") higgsDF: org.apache.spark.sql.DataFrame = [label: double, features: vector] Then, build a new dataframe that involves an except( ) scala> val train_df = higg

[SQL][ML] Pipeline performance regression between 1.6 and 2.x

2017-01-31 Thread Maciej Szymkiewicz
Hi everyone, While experimenting with ML pipelines I experience a significant performance regression when switching from 1.6.x to 2.x. import org.apache.spark.ml.{Pipeline, PipelineStage} import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, VectorAssembler} val df = (1 to 40).foldLe

Unique Partition Id per partition

2017-01-31 Thread Chawla,Sumit
Hi All I have a rdd, which i partition based on some key, and then can sc.runJob for each partition. Inside this function, i assign each partition a unique key using following: "%s_%s" % (id(part), int(round(time.time())) This is to make sure that, each partition produces separate bookeeping st

Re: Error Saving Dataframe to Hive with Spark 2.0.0

2017-01-31 Thread Michael Allman
That's understandable. Maybe I can help. :) What happens if you set `HIVE_TABLE_NAME = "default.employees"`? Also, does that table exist before you call `filtered_output_timestamp.write.mode("append").saveAsTable(HIVE_TABLE_NAME)`? Cheers, Michael > On Jan 29, 2017, at 9:52 PM, Chetan Khatri

Re: Unique Partition Id per partition

2017-01-31 Thread Michael Allman
Hi Sumit, Can you use http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=rdd#pyspark.RDD.mapPartitionsWithIndex to solve your problem? Michael > On Jan 31, 2017,

Call for abstracts open for Dataworks & Hadoop Summit San Jose

2017-01-31 Thread Alan Gates
The Dataworks & Hadoop summit will be in San Jose June 13-15, 2017. The call for abstracts closes February 10. You can submit an abstract at http://tinyurl.com/dwsj17CFA There are tracks for Hadoop, data processing and warehousing, governance and security, IoT and streaming, cloud and operati

Structured Streaming Source error

2017-01-31 Thread Sam Elamin
Hi Folks I am getting a weird error when trying to write a BigQuery Structured Streaming source Error: java.lang.AbstractMethodError: com.samelamin.spark.bigquery.streaming.BigQuerySource.commit(Lorg/apache/spark/sql/execution/streaming/Offset;)V at org.apache.spark.sql.execution.stream

Re: Structured Streaming Source error

2017-01-31 Thread Shixiong(Ryan) Zhu
You used one Spark version to compile your codes but another newer version to run. As the Source APIs are not stable, Spark doesn't guarantee that they are binary compatibility. On Tue, Jan 31, 2017 at 1:39 PM, Sam Elamin wrote: > Hi Folks > > > I am getting a weird error when trying to write a

Re: Structured Streaming Source error

2017-01-31 Thread Sam Elamin
Ha Ryan your everywhere,JIRA and maillist. I thought multitasking was a myth! Thanks for your help. It was using different versions! Regards Sam On Tue, Jan 31, 2017 at 9:48 PM, Shixiong(Ryan) Zhu wrote: > You used one Spark version to compile your codes but another newer version > to run. As

Re: MLlib mission and goals

2017-01-31 Thread Seth Hendrickson
I agree with what Sean said about not supporting arbitrarily many algorithms. I think the goal of MLlib should be to support only core algorithms for machine learning. Ideally Spark ML provides a relatively small set of algorithms that are heavily optimized, and also provides a framework that makes