Hi ,

Need solution for below:


We use Apache Spark extensively and need a robust mechanism to log Spark
job execution details, specifically:

• IAM roles accessing data

• Databases and tables accessed

• Timestamps of access


I don’t need query-level details, and I want to avoid using Spark Listeners
due to concerns about flattening with too many events. Instead, I’m
considering using a Log4j-based configuration for Spark logging.


Could you guide me on the steps to implement this approach, including:

1. How to configure Log4j for Spark to log meaningful execution details
(e.g., database/table access, IAM role, and timestamps)?

2. How to design an architecture for scalable and efficient logging,
ensuring it integrates well with our existing framework?


Additional Context:

• We don’t run Spark jobs directly on EMR clusters; instead, we use a
custom framework on top of Spark.

• The framework supports:

1. Low-code pipelines configured with YAML.

2. Custom PySpark code implemented via reusable components.


The solution must support both patterns and be easy to test using our
existing framework. Could you share best practices or steps to move forward
with this setup?



Thanks and regards

AT

Reply via email to