Do we need partitioning while loading data from JDBC sources?

Perez Wed, 05 Jun 2024 07:19:22 -0700

Hello experts,

I was just wondering if I could leverage the below thing to expedite the
loading of the data process in Spark.



def extract_data_from_mongodb(mongo_config): df =
glueContext.create_dynamic_frame.from_options( connection_type="mongodb",
connection_options=mongo_config ) return df

mongo_config = { "connection.uri": "mongodb://url", "database": "",
"collection": "", "username": "", "password": "", "partitionColumn":"_id",
"lowerBound": str(lower_bound), "upperBound": str(upper_bound) }
lower_bound = 0 upper_bound = 200 segment_size = 10 segments = [(i, min(i +
segment_size, upper_bound)) for i in range(lower_bound, upper_bound,
segment_size)] with ThreadPoolExecutor() as executor: futures =
[executor.submit(execution, segment) for segment in segments] for future in
as_completed(futures): try: future.result() except Exception as e:
print(f"Error: {e}")

I am trying to leverage the parallel threads to pull data in parallel. So
is it effective?

Do we need partitioning while loading data from JDBC sources?

Reply via email to