Greetings,
I am 50.50 sure the data format is correct, as if I split the data the
classifier works properly. If I introduce another dataset, created identically
to the one it is trained on.
However, the creation of the data itself is in doubt, but I do not see any help on
this subject with Dataset<Row>
What I do is create two List<Row>
List<Row> dataTraining = new ArrayList<>();
List<Row> dataTesting = new ArrayList<>();
Fill them
dataTraining.add(RowFactory.create(Double.parseDouble(label),
Vectors.dense(v)));
dataTesting.add(RowFactory.create(Double.parseDouble(label),
Vectors.dense(v)));
Then construct two Dataset<Row>
StructType schemaForFrame = new StructType(
new StructField[] { new StructField("label",
DataTypes.DoubleType, false, Metadata.empty()),
new StructField("features", new VectorUDT(), false,
Metadata.empty()) });
Dataset<Row> training = spark.createDataFrame(dataTraining,
schemaForFrame);
Dataset<Row> testing = spark.createDataFrame(dataTesting,
schemaForFrame);
So I am not sure if it is correct, but I am not using RDD.
Also, can you inform me is you had any problems with the mailing list. I have
tried for weeks for my emails to be accepted by the list.
Thanks
BR
MK
----------------------------------------
Michael C. Kunkel, USMC, PhD
Forschungszentrum Jülich
Nuclear Physics Institute and Juelich Center for Hadron Physics
Experimental Hadron Structure (IKP-1)
www.fz-juelich.de/ikp<http://www.fz-juelich.de/ikp>
On 11/07/2017 14:53, Riccardo Ferrari wrote:
Hi,
Are you sure you're feeding the correct data format? I found this conversation
that might be useful:
http://apache-spark-user-list.1001560.n3.nabble.com/Description-of-data-file-sample-libsvm-data-txt-td25832.html
Best,
On Tue, Jul 11, 2017 at 1:42 PM, mckunkel
<[email protected]<mailto:[email protected]>> wrote:
Greetings,
Following the example on the AS page for Naive Bayes using Dataset<Row>
https://spark.apache.org/docs/latest/ml-classification-regression.html#naive-bayes
<https://spark.apache.org/docs/latest/ml-classification-regression.html#naive-bayes>
I want to predict the outcome of another set of data. So instead of
splitting the data into training and testing, I have 1 set of training and
one set of testing. i.e.;
Dataset<Row> training = spark.createDataFrame(dataTraining,
schemaForFrame);
Dataset<Row> testing = spark.createDataFrame(dataTesting,
schemaForFrame);
NaiveBayes nb = new NaiveBayes();
NaiveBayesModel model = nb.fit(train);
Dataset<Row> predictions = model.transform(testing);
predictions.show();
But I get the error.
17/07/11 13:40:38 INFO DAGScheduler: Job 2 finished: collect at
NaiveBayes.scala:171, took 3.942413 s
Exception in thread "main" org.apache.spark.SparkException: Failed to
execute user defined function($anonfun$1: (vector) => vector)
at
org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1075)
at
org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:144)
at
org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:48)
at
org.apache.spark.sql.catalyst.expressions.InterpretedProjection.apply(Projection.scala:30)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
...
...
...
How do I perform predictions on other datasets that were not created at a
split?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Testing-another-Dataset-after-ML-training-tp28845.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe e-mail:
[email protected]<mailto:[email protected]>
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------