It happens because you use *-z zk-url *to connect to solr.
When you do that the prometheus-export assumes that it connects to a
SolrCloud environment and will collect the metrics from all nodes.
Given you have started 3 prometheus-exporters, each one of them will
collect all metrics from the cluster.

You can fix this in two different ways:
1- use *-h <your-local-solr-url>* instead of *-z <zk-url>*
2- have only one instance of the prometheus-exporter in the cluster

Note that solution 1 will not retrieve the metrics you have configured in
the *<collections>* tag in your configuration, as *-h* assumes a non-solr
cloud instance.

Regards,
Mathieu

On Wed, Aug 11, 2021 at 9:32 AM Joshua Hendrickson <
jhendrick...@tripadvisor.com> wrote:

> Hello,
>
> Our organization has implemented Solr 8.9.0 for a production use case. We
> have standardized on Prometheus for metrics collection and storage. We
> export metrics from our Solr cluster by deploying the public Solr image for
> version 8.9.0 to an EC2 instance and using Docker to run the exporter
> binary against Solr (which is running in a container on the same host). Our
> Prometheus scraper (hosted in Kubernetes and configured via a Helm chart)
> reports errors like the following on every scrape:
>
> ts=2021-08-10T16:44:13.929Z caller=dedupe.go:112 component=remote
> level=error remote_name=11d3d0 url=https://our.endpoint/push
> msg="non-recoverable error" count=500 err="server returned HTTP status 400
> Bad Request: user=nnnnn: err: duplicate sample for timestamp.
> timestamp=2021-08-10T16:44:13.317Z,
> series={__name__=\"solr_metrics_core_time_seconds_total\",
> aws_account=\"our-account\", base_url=\"
> http://fqdn.for.solr.server:32080/solr\";, category=\"QUERY\",
> cluster=\"our-cluster\", collection=\"a-collection\",
> core=\"a_collection_shard1_replica_t13\", dc=\"aws\", handler=\"/select\",
> instance=\" fqdn.for.solr.server:8984\", job=\"solr\",
> replica=\"replica_t13\", shard=\"shard1\"}"
>
> We have confirmed that there are indeed duplicate time series when we
> query our promtheus exporter. Here is a sample that shows the duplicate
> time series:
>
>
> solr_metrics_core_time_seconds_total{category="QUERY",handler="/select",core="a_collection_shard1_replica_t1",collection="a_collection",shard="shard1",replica="replica_t1",base_url="
> http://fqdn3.for.solr.server:32080/solr",} 1.533471301599E9
>
> solr_metrics_core_time_seconds_total{category="QUERY",handler="/select",core="a_collection_shard1_replica_t1",collection="a_collection",shard="shard1",replica="replica_t1",base_url="
> http://fqdn3.for.solr.server:32080/solr",} 8.89078653472891E11
>
> solr_metrics_core_time_seconds_total{category="QUERY",handler="/select",core="a_collection_shard1_replica_t1",collection="a_collection",shard="shard1",replica="replica_t1",base_url="
> http://fqdn3.for.solr.server:32080/solr",} 8.9061212477449E11
>
> solr_metrics_core_time_seconds_total{category="QUERY",handler="/select",core="a_collection_shard1_replica_t3",collection="a_collection",shard="shard1",replica="replica_t3",base_url="
> http://fqdn2.for.solr.server:32080/solr",} 1.63796914645E9
>
> solr_metrics_core_time_seconds_total{category="QUERY",handler="/select",core="a_collection_shard1_replica_t3",collection="a_collection",shard="shard1",replica="replica_t3",base_url="
> http://fqdn2.for.solr.server:32080/solr",} 9.05314998357273E11
>
> solr_metrics_core_time_seconds_total{category="QUERY",handler="/select",core="a_collection_shard1_replica_t3",collection="a_collection",shard="shard1",replica="replica_t3",base_url="
> http://fqdn2.for.solr.server:32080/solr",} 9.06952967503723E11
>
> solr_metrics_core_time_seconds_total{category="QUERY",handler="/select",core="a_collection_shard1_replica_t5",collection="a_collection",shard="shard1",replica="replica_t5",base_url="
> http://fqdn1.for.solr.server:32080/solr",} 1.667842814432E9
>
> solr_metrics_core_time_seconds_total{category="QUERY",handler="/select",core="a_collection_shard1_replica_t5",collection="a_collection",shard="shard1",replica="replica_t5",base_url="
> http://fqdn1.for.solr.server:32080/solr",} 9.1289401347629E11
>
> solr_metrics_core_time_seconds_total{category="QUERY",handler="/select",core="a_collection_shard1_replica_t5",collection="a_collection",shard="shard1",replica="replica_t5",base_url="
> http://fqdn1.for.solr.server:32080/solr",} 9.14561856290722E11
>
> This is the systemd unit file that runs the exporter container:
>
> [Unit]
> Description=Solr Exporter Docker
> After=network.target
> Wants=network.target
> Requires=docker.service
> After=docker.service
>
> [Service]
> Type=simple
> ExecStart=/usr/bin/docker run --rm \
> --name=solr-exporter \
> --net=host \
> --user=solr \
> solr:8.9.0 \
> /opt/solr/contrib/prometheus-exporter/bin/solr-exporter \
> -p 8984 -z the-various-zookeeper-endpoints -f
> /opt/solr/contrib/prometheus-exporter/conf/solr-exporter-config.xml -n 4
>
> ExecStop=/usr/bin/docker stop -t 2 solr-exporter
> Restart=on-failure
>
> [Install]
> WantedBy=multi-user.target
>
> I looked into the XML configurations for prometheus-exporter between 8.6.2
> (the previous version we used) and latest, and it looks like at some point
> recently there was a major refactoring in how this works. Is there
> something we are missing? Can anyone reproduce this issue on 8.9?
>
> Thanks in advance,
> Joshua Hendrickson
>
>

-- 
Mathieu Marie
Software Engineer | Salesforce
Mobile: + 33 6 98 59 62 31

Reply via email to