Thank you Varun, this makes sense. I understand a separate process for content ingestion. I was thinking it would be a separate spark job, but it sounds like you are suggesting that ideally I should do it outside of Hadoop entirely? Thanks
Harry On Saturday, August 26, 2023 at 09:19:33 AM PDT, Varun Shah <varunshah100...@gmail.com> wrote: Hi Harry, Ideally, you should not be fetching a url in your transformation job but do the API calls separately (outside the cluster if possible). Ingesting data should be treated separately from transformation / cleaning / join operations. You can create another dataframe of urls, dedup if required & store it in a file where your normal python function would ingest the data for the url & after X amount of api calls, create dataframe for it & union with previous dataframe, finally writing the content & then doing a join with the original df based on url, if required. If this is absolutely necessary, here are a few ways to achieve this: Approach-1: You can use the spark's foreachPartition which will require a udf function.In this, you can create a connection to limit the API calls per partition. This can work if you introduce logic that checks for the current number of partitions & then distribute the max_api_calls per partition.eg: if no_of_partitions = 4 and total_max_api_calls = 4, then you can pass in a parameter to this udf with max_partition_api_calls = 1. This approach has limitations as it requires max allowed api calls to be more than that of the number of partitions. Approach-2 An alternative approach is to create the connection outside of the udf with rate limiter(link) and use this connection variable inside of the udf function in each partition, invoking time.sleep. This will definitely introduce issues where many partitions are trying to invoke the api. I found this medium-article which discusses the issue you are facing, but does not discuss a solution for the same. Do check the comments also Regards,Varun On Sat, Aug 26, 2023 at 10:32 AM Harry Jamison <harryjamiso...@yahoo.com.invalid> wrote: I am using python 3.7 and Spark 2.4.7 I am not sure what the best way to do this is. I have a dataframe with a url in one of the columns, and I want to download the contents of that url and put it in a new column. Can someone point me in the right direction on how to do this?I looked at the UDFs and they seem confusing to me. Also, is there a good way to rate limit the number of calls I make per second?