> We would like to capture some information in our Hadoop Cluster. > Can anybody please suggest how we can we achieve this, any tools >available already ? Or do we need to scrub any log ?
Apache Atlas is the standardized solution for deeper analytics into data ownership/usage (look at the HiveHook in Atlas). > 1. We want to know how many queries are run in everyday > 2. What are the durations of those queries . > 3. If any queries are failing in what step they are failing. For a general use-case, you probably are already writing a lot of this data already. https://gist.github.com/t3rmin4t0r/e4bf835f10271b9e466e That only pulls the query text + plans in JSON (to automatically look for bad plans), but the total event structure looks like this { "domain": "DEFAULT", "entity": "gopal_20151119211930_bae04691-f46a-44c4-9116-bef8f854e49a", "entitytype": "HIVE_QUERY_ID", "events": [ { "eventinfo": {}, "eventtype": "QUERY_COMPLETED", "timestamp": 1447986004954 }, { "eventinfo": {}, "eventtype": "QUERY_SUBMITTED", "timestamp": 1447985970564 } ], "otherinfo": { "STATUS": true, "TEZ": true "MAPRED": false, "QUERY" : ... } "primaryfilters": { "requestuser": [ "gopal" ], "user": [ "gopal" ] }, } I have seen at least one custom KafkaHook to feed hive query plans into a Storm pipeline, but that was custom built to police the system after an ad-hoc query produced a 4.5 petabyte join. Cheers, Gopal