Hello Everyone, I have posted a question finally with the dataset and the column names.
PFB link: https://stackoverflow.com/questions/72389385/how-to-load-complex-data-using-pyspark Thanks, Sid On Thu, May 26, 2022 at 2:40 AM Bjørn Jørgensen <bjornjorgen...@gmail.com> wrote: > Sid, dump one of yours files. > > https://sparkbyexamples.com/pyspark/pyspark-read-csv-file-into-dataframe/ > > > > ons. 25. mai 2022, 23:04 skrev Sid <flinkbyhe...@gmail.com>: > >> I have 10 columns with me but in the dataset, I observed that some >> records have 11 columns of data(for the additional column it is marked as >> null). But, how do I handle this? >> >> Thanks, >> Sid >> >> On Thu, May 26, 2022 at 2:22 AM Sid <flinkbyhe...@gmail.com> wrote: >> >>> How can I do that? Any examples or links, please. So, this works well >>> with pandas I suppose. It's just that I need to convert back to the spark >>> data frame by providing a schema but since we are using a lower spark >>> version and pandas won't work in a distributed way in the lower versions, >>> therefore, was wondering if spark could handle this in a much better way. >>> >>> Thanks, >>> Sid >>> >>> On Thu, May 26, 2022 at 2:19 AM Gavin Ray <ray.gavi...@gmail.com> wrote: >>> >>>> Forgot to reply-all last message, whoops. Not very good at email. >>>> >>>> You need to normalize the CSV with a parser that can escape commas >>>> inside of strings >>>> Not sure if Spark has an option for this? >>>> >>>> >>>> On Wed, May 25, 2022 at 4:37 PM Sid <flinkbyhe...@gmail.com> wrote: >>>> >>>>> Thank you so much for your time. >>>>> >>>>> I have data like below which I tried to load by setting multiple >>>>> options while reading the file but however, but I am not able to >>>>> consolidate the 9th column data within itself. >>>>> >>>>> [image: image.png] >>>>> >>>>> I tried the below code: >>>>> >>>>> df = spark.read.option("header", "true").option("multiline", >>>>> "true").option("inferSchema", "true").option("quote", >>>>> >>>>> '"').option( >>>>> "delimiter", ",").csv("path") >>>>> >>>>> What else I can do? >>>>> >>>>> Thanks, >>>>> Sid >>>>> >>>>> >>>>> On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos < >>>>> papad...@csd.auth.gr> wrote: >>>>> >>>>>> Dear Sid, >>>>>> >>>>>> can you please give us more info? Is it true that every line may have >>>>>> a >>>>>> different number of columns? Is there any rule followed by >>>>>> >>>>>> every line of the file? From the information you have sent I cannot >>>>>> fully understand the "schema" of your data. >>>>>> >>>>>> Regards, >>>>>> >>>>>> Apostolos >>>>>> >>>>>> >>>>>> On 25/5/22 23:06, Sid wrote: >>>>>> > Hi Experts, >>>>>> > >>>>>> > I have below CSV data that is getting generated automatically. I >>>>>> can't >>>>>> > change the data manually. >>>>>> > >>>>>> > The data looks like below: >>>>>> > >>>>>> > 2020-12-12,abc,2000,,INR, >>>>>> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing >>>>>> > 2020-12-09,fgh,,software_developer,I only manage the development >>>>>> part. >>>>>> > >>>>>> > Since I don't have much experience with the other domains. >>>>>> > >>>>>> > It is handled by the other people.,INR >>>>>> > 2020-12-12,abc,2000,,USD, >>>>>> > >>>>>> > The third record is a problem. Since the value is separated by the >>>>>> new >>>>>> > line by the user while filling up the form. So, how do I >>>>>> handle this? >>>>>> > >>>>>> > There are 6 columns and 4 records in total. These are the sample >>>>>> records. >>>>>> > >>>>>> > Should I load it as RDD and then may be using a regex should >>>>>> eliminate >>>>>> > the new lines? Or how it should be? with ". /n" ? >>>>>> > >>>>>> > Any suggestions? >>>>>> > >>>>>> > Thanks, >>>>>> > Sid >>>>>> >>>>>> -- >>>>>> Apostolos N. Papadopoulos, Associate Professor >>>>>> Department of Informatics >>>>>> Aristotle University of Thessaloniki >>>>>> Thessaloniki, GREECE >>>>>> tel: ++0030312310991918 >>>>>> email: papad...@csd.auth.gr >>>>>> twitter: @papadopoulos_ap >>>>>> web: http://datalab.csd.auth.gr/~apostol >>>>>> >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>>>> >>>>>>