Re: Loading .xlsx and .xlx files using pyspark

Gourav Sengupta Wed, 23 Feb 2022 05:52:41 -0800

Hi,

this looks like a very specific and exact problem in its scope.


Do you think that you can load the data into panda dataframe and load it
back to SPARK using PANDAS UDF?

Koalas is now natively integrated with SPARK, try to see if you can use
those features.


Regards,
Gourav

On Wed, Feb 23, 2022 at 1:31 PM Sid <flinkbyhe...@gmail.com> wrote:

> I have an excel file which unfortunately cannot be converted to CSV format
> and I am trying to load it using pyspark shell.
>
> I tried invoking the below pyspark session with the jars provided.
>
> pyspark --jars
> /home/siddhesh/Downloads/spark-excel_2.12-0.14.0.jar,/home/siddhesh/Downloads/xmlbeans-5.0.3.jar,/home/siddhesh/Downloads/commons-collections4-4.4.jar,/home/siddhesh/Downloads/poi-5.2.0.jar,/home/siddhesh/Downloads/poi-ooxml-5.2.0.jar,/home/siddhesh/Downloads/poi-ooxml-schemas-4.1.2.jar,/home/siddhesh/Downloads/slf4j-log4j12-1.7.28.jar,/home/siddhesh/Downloads/log4j-1.2-api-2.17.1.jar
>
> and below is the code to read the excel file:
>
> df = spark.read.format("excel") \
>      .option("dataAddress", "'Sheet1'!") \
>      .option("header", "true") \
>      .option("inferSchema", "true") \
> .load("/home/.../Documents/test_excel.xlsx")
>
> It is giving me the below error message:
>
>  java.lang.NoClassDefFoundError: org/apache/logging/log4j/LogManager
>
> I tried several Jars for this error but no luck. Also, what would be the
> efficient way to load it?
>
> Thanks,
> Sid
>

Re: Loading .xlsx and .xlx files using pyspark

Reply via email to