What version are you using, and if you are using 1.13+, are you using the adaptive scheduler or reactive mode?

On 20/10/2021 07:39, Clemens Valiente wrote:
Hi Chesnay,
thanks a lot for the clarification.
We managed to resolve the collision, and isolated a problem to the metrics themselves.

Using the REST API at /jobs/<job_id>/metrics?get=uptime
the response is [{"id":"uptime","value":"-1"}]
despite the job running and processing data for 5 days at that point. All task,taskmanager, and jobmanager related metrics seem fine, only the job metrics are incorrect. Basically all of these do not have correct metrics:
[{"id":"numberOfFailedCheckpoints"},{"id":"lastCheckpointSize"},{"id":"lastCheckpointExternalPath"},{"id":"totalNumberOfCheckpoints"},{"id":"lastCheckpointRestoreTimestamp"},{"id":"uptime"},{"id":"restartingTime"},{"id":"numberOfInProgressCheckpoints"},{"id":"downtime"},{"id":"numberOfCompletedCheckpoints"},{"id":"lastCheckpointProcessedData"},{"id":"fullRestarts"},{"id":"lastCheckpointDuration"},{"id":"lastCheckpointPersistedData"}]
Looking at the Gauge the only way it can return -1 is when isTerminalState() is true which I don't think can be the case in a running application.
Do you know where we can check on what went wrong?

Best Regards
Clemens


On Thu, Oct 14, 2021 at 8:55 PM Chesnay Schepler <ches...@apache.org> wrote:

    I think you are misunderstanding a few things.

    a) when you include a variable in the scope format, then Flink
    fills that in /before/ it reaches Datadog. If you set it to
    "flink.<job_name>", then what we send to Datadog is
    "flink.myAwesomeJob".
    b) the exception you see is not coming from Datadog. They occur
    because, based on the configured scope formats, metrics from
    different jobs running in the same JobManager resolve to the same
    name (the standby jobmanger is irrelevant). Flink rejects these
    metrics, because if were to send these out you'd get funny results
    in Datadog because all jobs would try to report the same metric.

    In short, you need to include the job id or job name in the
    metrics.scope.jm.job scope formats.

    On 13/10/2021 06:39, Clemens Valiente wrote:
    Hi,

    we are using datadog as our metrics reporter as documented here:
    
https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/metric_reporters/#datadog

    our jobmanager scope is
    metrics.scope.jm <http://metrics.scope.jm>: flink.jobmanager
        metrics.scope.jm.job: flink.jobmanager
    since datadog doesn't allow placeholder in metric names, we
    cannot include the <host> or <job_name> placeholder in the scope.

    This setup worked nicely on our standalone kubernetes application
    deployment without using HA.
    But when we set up HA, we lost checkpointing metrics in datadog,
    and see this warning in the jobmanager log:
    2021-10-01 04:22:09,920 WARN  org.apache.flink.metrics.MetricGroup          
               [] - Name collision: Group already contains a Metric with the 
name'totalNumberOfCheckpoints'. Metric will not be reported.[flink, jobmanager]
    2021-10-01 04:22:09,920 WARN  org.apache.flink.metrics.MetricGroup          
               [] - Name collision: Group already contains a Metric with the 
name'numberOfInProgressCheckpoints'. Metric will not be reported.[flink, 
jobmanager]
    2021-10-01 04:22:09,920 WARN  org.apache.flink.metrics.MetricGroup          
               [] - Name collision: Group already contains a Metric with the 
name'numberOfCompletedCheckpoints'. Metric will not be reported.[flink, 
jobmanager]
    2021-10-01 04:22:09,921 WARN  org.apache.flink.metrics.MetricGroup          
               [] - Name collision: Group already contains a Metric with the 
name'numberOfFailedCheckpoints'. Metric will not be reported.[flink, jobmanager]
    2021-10-01 04:22:09,921 WARN  org.apache.flink.metrics.MetricGroup          
               [] - Name collision: Group already contains a Metric with the 
name'lastCheckpointRestoreTimestamp'. Metric will not be reported.[flink, 
jobmanager]
    2021-10-01 04:22:09,921 WARN  org.apache.flink.metrics.MetricGroup          
               [] - Name collision: Group already contains a Metric with the 
name'lastCheckpointSize'. Metric will not be reported.[flink, jobmanager]
    2021-10-01 04:22:09,921 WARN  org.apache.flink.metrics.MetricGroup          
               [] - Name collision: Group already contains a Metric with the 
name'lastCheckpointDuration'. Metric will not be reported.[flink, jobmanager]
    2021-10-01 04:22:09,921 WARN  org.apache.flink.metrics.MetricGroup          
               [] - Name collision: Group already contains a Metric with the 
name'lastCheckpointProcessedData'. Metric will not be reported.[flink, 
jobmanager]
    2021-10-01 04:22:09,921 WARN  org.apache.flink.metrics.MetricGroup          
               [] - Name collision: Group already contains a Metric with the 
name'lastCheckpointPersistedData'. Metric will not be reported.[flink, 
jobmanager]
    2021-10-01 04:22:09,921 WARN  org.apache.flink.metrics.MetricGroup          
               [] - Name collision: Group already contains a Metric with the 
name'lastCheckpointExternalPath'. Metric will not be reported.[flink, 
jobmanager]

    I assume this is because we now have two jobmanager pods (one
    active one standby) and they both report this metric, it fails.
    but we cannot use the <host> scope in the group, otherwise we
    won't be able to build datadog dashboards conveniently.

    My question:
    - did anyone else encounter this problem?
    - how could we solve this to have checkpointing metrics again in
    HA mode without needing the <host> placeholder?

    Thanks a lot
    Clemens


    By communicating with Grab Inc and/or its subsidiaries, associate
    companies and jointly controlled entities (“Grab Group”), you are
    deemed to have consented to the processing of your personal data
    as set out in the Privacy Notice which can be viewed at
    https://grab.com/privacy/

    This email contains confidential information and is only for the
    intended recipient(s). If you are not the intended recipient(s),
    please do not disseminate, distribute or copy this email Please
    notify Grab Group immediately if you have received this by
    mistake and delete this email from your system. Email
    transmission cannot be guaranteed to be secure or error-free as
    any information therein could be intercepted, corrupted, lost,
    destroyed, delayed or incomplete, or contain viruses. Grab Group
    do not accept liability for any errors or omissions in the
    contents of this email arises as a result of email transmission.
    All intellectual property rights in this email and attachments
    therein shall remain vested in Grab Group, unless otherwise
    provided by law.




--
Grab Singapore <https://htmlsig.com/t/000001BKA99J>

Twitter <https://twitter.com/grabth?lang=en> Facebook <https://www.facebook.com/GrabTH/> LinkedIn <https://www.linkedin.com/company/grabapp> Instagram <https://www.instagram.com/grabth/> Youtube <https://www.youtube.com/channel/UCrK1UNPks-lRzKwJ0kEWoJg>

                        
Clemens valienteclemens.valie...@grab.com

Grab Singapore9 Straits View, Marina One West Tower, #23-07/12Singapore 018937www.grab.com <http://www.grab.com/>



By communicating with Grab Inc and/or its subsidiaries, associate companies and jointly controlled entities (“Grab Group”), you are deemed to have consented to the processing of your personal data as set out in the Privacy Notice which can be viewed at https://grab.com/privacy/

This email contains confidential information and is only for the intended recipient(s). If you are not the intended recipient(s), please do not disseminate, distribute or copy this email Please notify Grab Group immediately if you have received this by mistake and delete this email from your system. Email transmission cannot be guaranteed to be secure or error-free as any information therein could be intercepted, corrupted, lost, destroyed, delayed or incomplete, or contain viruses. Grab Group do not accept liability for any errors or omissions in the contents of this email arises as a result of email transmission. All intellectual property rights in this email and attachments therein shall remain vested in Grab Group, unless otherwise provided by law.

Reply via email to