Re: Spark 2.4.7

2023-08-26 Thread Mich Talebzadeh
Sorry for forgetting. Add this line to the top of the code import sys Mich Talebzadeh, Distinguished Technologist, Solutions Architect & Engineer London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Tal

Re: Spark 2.4.7

2023-08-26 Thread Mich Talebzadeh
Hi guys, You can try the code below in PySpark relying on* urllib *library to download the contents of the URL and then create a new column in the DataFrame to store the downloaded contents. Spark 4.3.0 The limit explained by Varun from pyspark.sql import SparkSession from pyspark.sql.functions

Re: Spark 2.4.7

2023-08-26 Thread Harry Jamison
Thank you Varun, this makes sense. I understand a separate process for content ingestion. I was thinking it would be a separate spark job, but it sounds like you are suggesting that ideally I should do it outside of Hadoop entirely? Thanks Harry On Saturday, August 26, 2023 at 09:19:33

Re: Spark 2.4.7

2023-08-26 Thread Varun Shah
Hi Harry, Ideally, you should not be fetching a url in your transformation job but do the API calls separately (outside the cluster if possible). Ingesting data should be treated separately from transformation / cleaning / join operations. You can create another dataframe of urls, dedup if require