Hi Marco,
Yes you can apply `VectorAssembler` first in the pipeline to assemble
multiple features column.
Thanks.
On Sun, Dec 17, 2017 at 6:33 AM, Marco Mistroni wrote:
> Hello Wei
> Thanks, i should have c hecked the data
> My data has this format
> |col1|col2|col3|label|
>
> so it looks lik
Hello Wei
Thanks, i should have c hecked the data
My data has this format
|col1|col2|col3|label|
so it looks like i cannot use VectorIndexer directly (it accepts a Vector
column).
I am guessing what i should do is something like this (given i have few
categorical features)
val assembler = new Ve
Hi Jacek,
Just replied to the SO thread as well, but…
Yes, your first statement is correct. The DFs in the union are read in the same
stage, so in your example where each DF has 8 partitions then you have a stage
with 16 tasks to read the 2 DFs. There's no need to define the DF in a separate
t
Hi, Marco,
val data =
spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
The data now include a feature column with name "features",
val featureIndexer = new VectorIndexer()
.setInputCol("features") <-- Here specify the "features"
column to index.
.setOutputCol("inde
Hi,
I've been trying to find out the answer to the question about UNION ALL and
SELECTs @ https://stackoverflow.com/q/47837955/1305344
> If I have Spark SQL statement of the form SELECT [...] UNION ALL SELECT
[...], will the two SELECT statements be executed in parallel? In my
specific use case t
Hi,
join between streaming and batch/static Datasets is supported for sure -->
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#join-operations
I'm not sure about union, but that's just easy to check (and am leaving it
as your home exercise).
You cannot have datase
Hi All,
I am getting following error message while applying
*flatMapGroupsWithState.*
*Exception in thread "main" org.apache.spark.sql.AnalysisException:
flatMapGroupsWithState in update mode is not supported with aggregation on
a streaming DataFrame/Dataset;;*
Following is what I am trying to d
Hi
Does anyone have any hints or example (code) how to get combination:
Windows10 + pyspark + ipython notebook + csv file loading with
timestamps (timeseries data) to dataframe or RDD to work ?
I have already installed windows10 + pyspark + ipython notebook and they
seem to work, but my pyth
Develop your own HadoopFileFormat and use
https://spark.apache.org/docs/2.0.2/api/java/org/apache/spark/SparkContext.html#newAPIHadoopRDD(org.apache.hadoop.conf.Configuration,%20java.lang.Class,%20java.lang.Class,%20java.lang.Class)
to load. The Spark datasource API will be relevant for you in th