Hi , Need solution for below:
We use Apache Spark extensively and need a robust mechanism to log Spark job execution details, specifically: • IAM roles accessing data • Databases and tables accessed • Timestamps of access I don’t need query-level details, and I want to avoid using Spark Listeners due to concerns about flattening with too many events. Instead, I’m considering using a Log4j-based configuration for Spark logging. Could you guide me on the steps to implement this approach, including: 1. How to configure Log4j for Spark to log meaningful execution details (e.g., database/table access, IAM role, and timestamps)? 2. How to design an architecture for scalable and efficient logging, ensuring it integrates well with our existing framework? Additional Context: • We don’t run Spark jobs directly on EMR clusters; instead, we use a custom framework on top of Spark. • The framework supports: 1. Low-code pipelines configured with YAML. 2. Custom PySpark code implemented via reusable components. The solution must support both patterns and be easy to test using our existing framework. Could you share best practices or steps to move forward with this setup? Thanks and regards AT