You could use .option("nullValue", "+") to tell the parser that '+'
refers to "no value":
spark.read
.option("inferSchema", "true")
.option("header", "true")
.option("nullvalue", "+")
.csv("path")
Enrico
Am 04.06.22 um 18:54 schrieb marc nicole:
c1
c2
c3
c4
c5
c6
1.2
true
A
Z
120
+
1.3
false
B
X
130
F
+
true
C
Y
200
G
in the above table c1 has double values except on the last row so:
Dataset<Row> dataset =
spark.read().format("csv")..option("inferSchema","true").option("header","true").load("path");
will yield StringType as a type for column c1 similarly for c6
I want to return the true type of each column by first discarding the "+"
I use Dataset<String> after filtering the rows (removing "+") because
i can re-read the new dataset using .csv() method.
Any better idea to do that ?
Le sam. 4 juin 2022 à 18:40, Enrico Minack <i...@enrico.minack.dev> a
écrit :
Can you provide an example string (row) and the expected inferred
schema?
Enrico
Am 04.06.22 um 18:36 schrieb marc nicole:
How to do just that? i thought we only can inferSchema when we
first read the dataset, or am i wrong?
Le sam. 4 juin 2022 à 18:10, Sean Owen <sro...@gmail.com> a écrit :
It sounds like you want to interpret the input as strings, do
some processing, then infer the schema. That has nothing to
do with construing the entire row as a string like
"Row[foo=bar, baz=1]"
On Sat, Jun 4, 2022 at 10:32 AM marc nicole
<mk1853...@gmail.com> wrote:
Hi Sean,
Thanks, actually I have a dataset where I want to
inferSchema after discarding the specific String value of
"+". I do this because the column would be considered
StringType while if i remove that "+" value it will be
considered DoubleType for example or something else.
Basically I want to remove "+" from all dataset rows and
then inferschema.
Here my idea is to filter the rows not equal to "+" for
the target columns (potentially all of them) and then use
spark.read().csv() to read the new filtered dataset with
the option inferSchema which would then yield correct
column types.
What do you think?
Le sam. 4 juin 2022 à 15:56, Sean Owen <sro...@gmail.com>
a écrit :
I don't think you want to do that. You get a string
representation of structured data without the
structure, at best. This is part of the reason it
doesn't work directly this way.
You can use a UDF to call .toString on the Row of
course, but, again what are you really trying to do?
On Sat, Jun 4, 2022 at 7:35 AM marc nicole
<mk1853...@gmail.com> wrote:
Hi,
How to convert a Dataset<Row> to a Dataset<String>?
What i have tried is:
List<String> list = dataset.as
<http://dataset.as>(Encoders.STRING()).collectAsList();
Dataset<String> datasetSt =
spark.createDataset(list, Encoders.STRING()); //
But this line raises
a org.apache.spark.sql.AnalysisException: Try to
map struct... to Tuple1, but failed as the number
of fields does not line up
Type of columns being String
How to solve this?