> We would like to capture some information in our Hadoop Cluster.
> Can anybody please suggest how we can we  achieve this, any tools
>available already ? Or do we need to scrub any log ?

Apache Atlas is the standardized solution for deeper analytics into data
ownership/usage (look at the HiveHook in Atlas).

> 1. We want to know how many queries are run in everyday
> 2. What are the durations of those queries .
> 3. If any queries are failing in what step they are failing.

For a general use-case, you probably are already writing a lot of this
data already.

https://gist.github.com/t3rmin4t0r/e4bf835f10271b9e466e

That only pulls the query text + plans in JSON (to automatically look for
bad plans), but the total event structure looks like this

{
            "domain": "DEFAULT",
            "entity":
"gopal_20151119211930_bae04691-f46a-44c4-9116-bef8f854e49a",
            "entitytype": "HIVE_QUERY_ID",
            "events": [
                {
                    "eventinfo": {},
                    "eventtype": "QUERY_COMPLETED",
                    "timestamp": 1447986004954
                },
                {
                    "eventinfo": {},
                    "eventtype": "QUERY_SUBMITTED",
                    "timestamp": 1447985970564
                }
            ],
            "otherinfo": {
                "STATUS": true,
                "TEZ": true

                "MAPRED": false,

                "QUERY" : ...
            }
            "primaryfilters": {
                "requestuser": [
                    "gopal"
                ],
                "user": [
                    "gopal"
                ]
            },

}

I have seen at least one custom KafkaHook to feed hive query plans into a
Storm pipeline, but that was custom built to police the system after an
ad-hoc query produced a 4.5 petabyte join.

Cheers,
Gopal


Reply via email to