prashantwason commented on issue #1289: [HUDI-92] Provide reasonable names for Spark DAG stages in Hudi. URL: https://github.com/apache/incubator-hudi/pull/1289#issuecomment-579944539 A DAG stage name and description can be set using the JavaSparkContext.setJobDescription(...) method. The same name/description is used for all stages which use the same thread until the name/description is updated (another call to setJobDescription) or deleted (clearJobGroup). In this PR, I am using the ClassName as the stage name and a textual description derived from the method logic. HUDI classes have very descriptive names so this works well. There are two ways this may be done: 1. Manually (this PR) by adding code set the name/description before any DAG stages are started. 2. Using Java AOP to automatically find code locations matching some pattern and augment them with the call to setJobDescription. To use AOP approach, we can create a separate AspectJ file containing the Pointcuts (code locations to augment) and Advices (code to insert). There is a separate AspectJ compiler which at runtime can change the class bytecode to add the Advices. Pros of AOP approach: 1. Does not require any change in current code 2. Also covers future code automatically 3. Easy to undo (just don't run the AspectJ compiler as part of build) 4. Can be extended to more use-cases like automating Metrics. Cons of AOP approach: 1. Require AspectJ and its compiler to be integrated with the HUDI build chain 2. The Advice cannot be dynamic. Hence we cannot provide descriptions to the DAG stages (we can still use the class name as the DAG stage name). Since the code has a manageable number of places where DAG is created, I prefer the simpler manual approach. It also ends up documenting the code.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
