Thank you Varun, this makes sense.
I understand a separate process for content ingestion. I was thinking it would
be a separate spark job, but it sounds like you are suggesting that ideally I
should do it outside of Hadoop entirely?
Thanks
Harry
On Saturday, August 26, 2023 at 09:19:33 AM PDT, Varun Shah
<[email protected]> wrote:
Hi Harry,
Ideally, you should not be fetching a url in your transformation job but do the
API calls separately (outside the cluster if possible). Ingesting data should
be treated separately from transformation / cleaning / join operations. You can
create another dataframe of urls, dedup if required & store it in a file where
your normal python function would ingest the data for the url & after X amount
of api calls, create dataframe for it & union with previous dataframe, finally
writing the content & then doing a join with the original df based on url, if
required.
If this is absolutely necessary, here are a few ways to achieve this:
Approach-1:
You can use the spark's foreachPartition which will require a udf function.In
this, you can create a connection to limit the API calls per partition.
This can work if you introduce logic that checks for the current number of
partitions & then distribute the max_api_calls per partition.eg: if
no_of_partitions = 4 and total_max_api_calls = 4, then you can pass in a
parameter to this udf with max_partition_api_calls = 1.
This approach has limitations as it requires max allowed api calls to be more
than that of the number of partitions.
Approach-2
An alternative approach is to create the connection outside of the udf with
rate limiter(link) and use this connection variable inside of the udf function
in each partition, invoking time.sleep. This will definitely introduce issues
where many partitions are trying to invoke the api.
I found this medium-article which discusses the issue you are facing, but does
not discuss a solution for the same. Do check the comments also
Regards,Varun
On Sat, Aug 26, 2023 at 10:32 AM Harry Jamison
<[email protected]> wrote:
I am using python 3.7 and Spark 2.4.7
I am not sure what the best way to do this is.
I have a dataframe with a url in one of the columns, and I want to download the
contents of that url and put it in a new column.
Can someone point me in the right direction on how to do this?I looked at the
UDFs and they seem confusing to me.
Also, is there a good way to rate limit the number of calls I make per second?