How to convert spark data-frame to datasets?

2016-08-18 Thread Minudika Malshan
Hi all, Most of Spark ML algorithms requires a dataset to train the model. I would like to know how to convert a spark *data-frame* to a *dataset* using Java. Your support is much appreciated. Thank you! Minudika

Re: How to convert spark data-frame to datasets?

2016-08-18 Thread Oscar Batori
>From the docs , DataFrame is just Dataset[Row]. The are various converters for subtypes of Product if you want, using "as[T]", where T <: Product

Setting YARN executors' JAVA_HOME

2016-08-18 Thread Ryan Williams
I need to tell YARN a JAVA_HOME to use when spawning containers (to run a Java 8 app on Java 7 YARN). The only way I've found that works is setting SPARK_YARN_USER_ENV="JAVA_HOME=/path/to/java8". The code

Re: Setting YARN executors' JAVA_HOME

2016-08-18 Thread dhruve ashar
Hi Ryan, You can get more info on this here: Spark documentation . The page addresses what you need. You can look for spark.executorEnv.[EnvironmentVariableName] and set your java home as spark.executorEnv.JAVA_HOME= Regards, Dhruve O

Re: Setting YARN executors' JAVA_HOME

2016-08-18 Thread Ryan Williams
Ah, I guess I missed that by only looking in the YARN config docs, but this is a more general parameter and not documented there. Thanks! On Thu, Aug 18, 2016 at 2:51 PM dhruve ashar wrote: > Hi Ryan, > > You can get more info on this here: Spark documentation >

Early Draft Structured Streaming Machine Learning

2016-08-18 Thread Holden Karau
Hi Everyone (that cares about structured streaming and ML), Seth and I have been giving some thought to support structured streaming in machine learning - we've put together an early design doc (its been in JIRA (SPARK-16424) for awhile, but inca

Parquet partitioning / appends

2016-08-18 Thread Jeremy Smith
Hi, I'm running into an issue wherein Spark (both 1.6.1 and 2.0.0) will fail with a GC Overhead limit when creating a DataFrame from a parquet-backed partitioned Hive table with a relatively large number of parquet files (~ 175 partitions, and each partition contains many parquet files). If I the

Re: RFC: Remote "HBaseTest" from examples?

2016-08-18 Thread Ignacio Zendejas
I'm very late to this party and I get hbase-spark... what's the recommendation for pyspark + hbase? I realize this isn't necessarily a concern of the spark project, but it'd be nice to at least document it here with a very short and sweet response because I haven't found anything useful in the wild