Thank you Varun, this makes sense.
I understand a separate process for content ingestion. I was thinking it would 
be a separate spark job, but it sounds like you are suggesting that ideally I 
should do it outside of Hadoop entirely?
Thanks

Harry



    On Saturday, August 26, 2023 at 09:19:33 AM PDT, Varun Shah 
<varunshah100...@gmail.com> wrote:  
 
 Hi Harry, 

Ideally, you should not be fetching a url in your transformation job but do the 
API calls separately (outside the cluster if possible). Ingesting data should 
be treated separately from transformation / cleaning / join operations. You can 
create another dataframe of urls, dedup if required & store it in a file where 
your normal python function would ingest the data for the url & after X amount 
of api calls, create dataframe for it & union with previous dataframe, finally 
writing the content & then doing a join with the original df based on url, if 
required.

If this is absolutely necessary, here are a few ways to achieve this:

Approach-1:
You can use the spark's foreachPartition which will require a udf function.In 
this, you can create a connection to limit the API calls per partition. 

This can work if you introduce logic that checks for the current number of 
partitions & then distribute the max_api_calls per partition.eg: if 
no_of_partitions = 4 and total_max_api_calls = 4, then you can pass in a 
parameter to this udf with max_partition_api_calls = 1. 

This approach has limitations as it requires max allowed api calls to be more 
than that of the number of partitions.
Approach-2
An alternative approach is to create the connection outside of the udf with 
rate limiter(link) and use this connection variable inside of the udf function 
in each partition, invoking time.sleep. This will definitely introduce issues 
where many partitions are trying to invoke the api.
I found this medium-article which discusses the issue you are facing, but does 
not discuss a solution for the same. Do check the comments also 

Regards,Varun


On Sat, Aug 26, 2023 at 10:32 AM Harry Jamison 
<harryjamiso...@yahoo.com.invalid> wrote:

I am using python 3.7 and Spark 2.4.7
I am not sure what the best way to do this is.
I have a dataframe with a url in one of the columns, and I want to download the 
contents of that url and put it in a new column.
Can someone point me in the right direction on how to do this?I looked at the 
UDFs and they seem confusing to me.
Also, is there a good way to rate limit the number of calls I make per second?
  

Reply via email to