Re: Complexity with the data

Sid Thu, 26 May 2022 02:09:38 -0700

Hello Everyone,

I have posted a question finally with the dataset and the column names.


PFB link:

https://stackoverflow.com/questions/72389385/how-to-load-complex-data-using-pyspark

Thanks,
Sid

On Thu, May 26, 2022 at 2:40 AM Bjørn Jørgensen <bjornjorgen...@gmail.com>
wrote:

> Sid, dump one of yours files.
>
> https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/
>
>
>
> ons. 25. mai 2022, 23:04 skrev Sid <flinkbyhe...@gmail.com>:
>
>> I have 10 columns with me but in the dataset, I observed that some
>> records have 11 columns of data(for the additional column it is marked as
>> null). But, how do I handle this?
>>
>> Thanks,
>> Sid
>>
>> On Thu, May 26, 2022 at 2:22 AM Sid <flinkbyhe...@gmail.com> wrote:
>>
>>> How can I do that? Any examples or links, please. So, this works well
>>> with pandas I suppose. It's just that I need to convert back to the spark
>>> data frame by providing a schema but since we are using a lower spark
>>> version and pandas won't work in a distributed way in the lower versions,
>>> therefore, was wondering if spark could handle this in a much better way.
>>>
>>> Thanks,
>>> Sid
>>>
>>> On Thu, May 26, 2022 at 2:19 AM Gavin Ray <ray.gavi...@gmail.com> wrote:
>>>
>>>> Forgot to reply-all last message, whoops. Not very good at email.
>>>>
>>>> You need to normalize the CSV with a parser that can escape commas
>>>> inside of strings
>>>> Not sure if Spark has an option for this?
>>>>
>>>>
>>>> On Wed, May 25, 2022 at 4:37 PM Sid <flinkbyhe...@gmail.com> wrote:
>>>>
>>>>> Thank you so much for your time.
>>>>>
>>>>> I have data like below which I tried to load by setting multiple
>>>>> options while reading the file but however, but I am not able to
>>>>> consolidate the 9th column data within itself.
>>>>>
>>>>> [image: image.png]
>>>>>
>>>>> I tried the below code:
>>>>>
>>>>> df = spark.read.option("header", "true").option("multiline",
>>>>> "true").option("inferSchema", "true").option("quote",
>>>>>
>>>>>                                     '"').option(
>>>>>     "delimiter", ",").csv("path")
>>>>>
>>>>> What else I can do?
>>>>>
>>>>> Thanks,
>>>>> Sid
>>>>>
>>>>>
>>>>> On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos <
>>>>> papad...@csd.auth.gr> wrote:
>>>>>
>>>>>> Dear Sid,
>>>>>>
>>>>>> can you please give us more info? Is it true that every line may have
>>>>>> a
>>>>>> different number of columns? Is there any rule followed by
>>>>>>
>>>>>> every line of the file? From the information you have sent I cannot
>>>>>> fully understand the "schema" of your data.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Apostolos
>>>>>>
>>>>>>
>>>>>> On 25/5/22 23:06, Sid wrote:
>>>>>> > Hi Experts,
>>>>>> >
>>>>>> > I have below CSV data that is getting generated automatically. I
>>>>>> can't
>>>>>> > change the data manually.
>>>>>> >
>>>>>> > The data looks like below:
>>>>>> >
>>>>>> > 2020-12-12,abc,2000,,INR,
>>>>>> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing
>>>>>> > 2020-12-09,fgh,,software_developer,I only manage the development
>>>>>> part.
>>>>>> >
>>>>>> > Since I don't have much experience with the other domains.
>>>>>> >
>>>>>> > It is handled by the other people.,INR
>>>>>> > 2020-12-12,abc,2000,,USD,
>>>>>> >
>>>>>> > The third record is a problem. Since the value is separated by the
>>>>>> new
>>>>>> > line by the user while filling up the form. So, how do I
>>>>>> handle this?
>>>>>> >
>>>>>> > There are 6 columns and 4 records in total. These are the sample
>>>>>> records.
>>>>>> >
>>>>>> > Should I load it as RDD and then may be using a regex should
>>>>>> eliminate
>>>>>> > the new lines? Or how it should be? with ". /n" ?
>>>>>> >
>>>>>> > Any suggestions?
>>>>>> >
>>>>>> > Thanks,
>>>>>> > Sid
>>>>>>
>>>>>> --
>>>>>> Apostolos N. Papadopoulos, Associate Professor
>>>>>> Department of Informatics
>>>>>> Aristotle University of Thessaloniki
>>>>>> Thessaloniki, GREECE
>>>>>> tel: ++0030312310991918
>>>>>> email: papad...@csd.auth.gr
>>>>>> twitter: @papadopoulos_ap
>>>>>> web: http://datalab.csd.auth.gr/~apostol
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>>
>>>>>>

Re: Complexity with the data

Reply via email to