Re: Complexity with the data

Sid Thu, 26 May 2022 10:41:04 -0700

Hi Gourav,

Please find the below link for a detailed understanding.


https://stackoverflow.com/questions/72389385/how-to-load-complex-data-using-pyspark/72391090#72391090

@Bjørn Jørgensen <bjornjorgen...@gmail.com> :

I was able to read such kind of data using the below code:

spark.read.option("header",True).option("multiline","true").option("escape","\"").csv("sample1.csv")


Also, I have one question about one of my columns. I have one column
with data like below:


[image: image.png]


Have a look at the second record. Should I mark it as corrupt record?
Or is there anyway to process such kind of records.


Thanks,

Sid





On Thu, May 26, 2022 at 10:54 PM Gourav Sengupta <gourav.sengu...@gmail.com>
wrote:

> Hi,
> can you please give us a simple map of what the input is and what the
> output should be like? From your description it looks a bit difficult to
> figure out what exactly or how exactly you want the records actually parsed.
>
>
> Regards,
> Gourav Sengupta
>
> On Wed, May 25, 2022 at 9:08 PM Sid <flinkbyhe...@gmail.com> wrote:
>
>> Hi Experts,
>>
>> I have below CSV data that is getting generated automatically. I can't
>> change the data manually.
>>
>> The data looks like below:
>>
>> 2020-12-12,abc,2000,,INR,
>> 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing
>> 2020-12-09,fgh,,software_developer,I only manage the development part.
>>
>> Since I don't have much experience with the other domains.
>>
>> It is handled by the other people.,INR
>> 2020-12-12,abc,2000,,USD,
>>
>> The third record is a problem. Since the value is separated by the new
>> line by the user while filling up the form. So, how do I handle this?
>>
>> There are 6 columns and 4 records in total. These are the sample records.
>>
>> Should I load it as RDD and then may be using a regex should eliminate
>> the new lines? Or how it should be? with ". /n" ?
>>
>> Any suggestions?
>>
>> Thanks,
>> Sid
>>
>

Re: Complexity with the data

Reply via email to