Re: Complexity with the data

Sid Wed, 25 May 2022 13:37:15 -0700

Thank you so much for your time.

I have data like below which I tried to load by setting multiple options
while reading the file but however, but I am not able to consolidate the
9th column data within itself.


[image: image.png]

I tried the below code:

df = spark.read.option("header", "true").option("multiline",
"true").option("inferSchema", "true").option("quote",

                              '"').option(
    "delimiter", ",").csv("path")

What else I can do?

Thanks,
Sid


On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos <
papad...@csd.auth.gr> wrote:

> Dear Sid,
>
> can you please give us more info? Is it true that every line may have a
> different number of columns? Is there any rule followed by
>
> every line of the file? From the information you have sent I cannot
> fully understand the "schema" of your data.
>
> Regards,
>
> Apostolos
>
>
> On 25/5/22 23:06, Sid wrote:
> > Hi Experts,
> >
> > I have below CSV data that is getting generated automatically. I can't
> > change the data manually.
> >
> > The data looks like below:
> >
> > 2020-12-12,abc,2000,,INR,
> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing
> > 2020-12-09,fgh,,software_developer,I only manage the development part.
> >
> > Since I don't have much experience with the other domains.
> >
> > It is handled by the other people.,INR
> > 2020-12-12,abc,2000,,USD,
> >
> > The third record is a problem. Since the value is separated by the new
> > line by the user while filling up the form. So, how do I handle this?
> >
> > There are 6 columns and 4 records in total. These are the sample records.
> >
> > Should I load it as RDD and then may be using a regex should eliminate
> > the new lines? Or how it should be? with ". /n" ?
> >
> > Any suggestions?
> >
> > Thanks,
> > Sid
>
> --
> Apostolos N. Papadopoulos, Associate Professor
> Department of Informatics
> Aristotle University of Thessaloniki
> Thessaloniki, GREECE
> tel: ++0030312310991918
> email: papad...@csd.auth.gr
> twitter: @papadopoulos_ap
> web: http://datalab.csd.auth.gr/~apostol
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Complexity with the data

Reply via email to