You could use .option("nullValue", "+") to tell the parser that '+' refers to "no value":

spark.read
.option("inferSchema", "true")
.option("header", "true")
.option("nullvalue", "+")
     .csv("path")

Enrico


Am 04.06.22 um 18:54 schrieb marc nicole:

c1

        

c2

        

c3

        

c4

        

c5

        

c6

1.2

        

true

        

A

        

Z

        

120

        

+

1.3

        

false

        

B

        

X

        

130

        

F

+

        

true

        

C

        

Y

        

200

        

G

in the above table c1 has double values except on the last row so:

Dataset<Row> dataset = spark.read().format("csv")..option("inferSchema","true").option("header","true").load("path");
will yield StringType as a type for column c1 similarly for c6
I want to return the true type of each column by first discarding the "+"
I use Dataset<String> after filtering the rows (removing "+") because i can re-read the new dataset using .csv() method.
Any better idea to do that ?

Le sam. 4 juin 2022 à 18:40, Enrico Minack <i...@enrico.minack.dev> a écrit :

    Can you provide an example string (row) and the expected inferred
    schema?

    Enrico


    Am 04.06.22 um 18:36 schrieb marc nicole:
    How to do just that? i thought we only can inferSchema when we
    first read the dataset, or am i wrong?

    Le sam. 4 juin 2022 à 18:10, Sean Owen <sro...@gmail.com> a écrit :

        It sounds like you want to interpret the input as strings, do
        some processing, then infer the schema. That has nothing to
        do with construing the entire row as a string like
        "Row[foo=bar, baz=1]"

        On Sat, Jun 4, 2022 at 10:32 AM marc nicole
        <mk1853...@gmail.com> wrote:

            Hi Sean,

            Thanks, actually I have a dataset where I want to
            inferSchema after discarding the specific String value of
            "+". I do this because the column would be considered
            StringType while if i remove that "+" value it will be
            considered DoubleType for example or something else.
            Basically I want to remove "+" from all dataset rows and
            then inferschema.
            Here my idea is to filter the rows not equal to "+" for
            the target columns (potentially all of them) and then use
            spark.read().csv() to read the new filtered dataset with
            the option inferSchema which would then yield correct
            column types.
            What do you think?

            Le sam. 4 juin 2022 à 15:56, Sean Owen <sro...@gmail.com>
            a écrit :

                I don't think you want to do that. You get a string
                representation of structured data without the
                structure, at best. This is part of the reason it
                doesn't work directly this way.
                You can use a UDF to call .toString on the Row of
                course, but, again what are you really trying to do?

                On Sat, Jun 4, 2022 at 7:35 AM marc nicole
                <mk1853...@gmail.com> wrote:

                    Hi,
                    How to convert a Dataset<Row> to a Dataset<String>?
                    What i have tried is:

                    List<String> list = dataset.as
                    <http://dataset.as>(Encoders.STRING()).collectAsList();
                    Dataset<String> datasetSt =
                    spark.createDataset(list, Encoders.STRING()); //
                    But this line raises
                    a org.apache.spark.sql.AnalysisException: Try to
                    map struct... to Tuple1, but failed as the number
                    of fields does not line up

                    Type of columns being String
                    How to solve this?


Reply via email to