Hello,
I would like to start a discussion on introducing new JFR events for DNS
lookups. While many lookups are DNS in cloud-native environments, the JDK uses
the configured name service, so the event naming and semantics should not imply
DNS-only behavior. I’m seeking feedback on scope, naming, and payload fields.
Motivation
• High-frequency, latency-sensitive lookups are critical for service discovery.
• Current gaps:
• Cannot distinguish cache hits vs. network lookups
• Hard to trace lookup latency and diagnose timeouts/failures
• Concurrent libraries may cause redundant lookups
• Value:
• End-to-end observability: lookup → socket connect → data transfer
• Troubleshooting: identify timeouts, resolution failures
• Performance: evaluate cache policies, detect hotspot names
• Security: audit external domains accessed
Proposed event (initial draft)
Event name: jdk.DnsLookup
When: Emitted around DNS hostname resolution call boundaries, including:
• Actual network DNS queries (when cache is disabled or cache miss occurs)
• Cache hits (when result is retrieved from DNS cache)
• Stale data usage (when expired but still valid cached data is used)
• Background DNS cache refresh operations
Key fields (feedback welcome):
• host (String): The hostname being resolved
• result (String): Comma-separated list of resolved IP addresses, or error
message if lookup failed
• success (boolean): Whether the DNS lookup was successful
• cached (boolean): Whether the result was retrieved from cache (true) or from
actual DNS network query (false). This helps distinguish between three use
cases:
• Actual network queries (cached=false) - represents real DNS network
traffic
• Cache hits (cached=true, stale=false) - repeated lookups using fresh
cached data
• Stale data usage (cached=true, stale=true) - application continues with
expired but still valid cached data when DNS refresh fails
• ttl (long, seconds): Time to live in seconds. Values:
• 0 or -1: Not cached or forever cached
• > 0: Actual remaining TTL if cached
• stale (boolean): Whether stale cached data was used (only valid when
cached=true). Helps identify semi-error scenarios where DNS errors occur but
application continues using stale cached records
Event name: jdk.DnsCacheStatistics
When: Periodic event emitted at configurable intervals (default: 5 seconds in
default.jfc, 1 second in profile.jfc). This is a statistics event similar to
jdk.ExceptionStatistics, providing aggregate metrics about the DNS cache state.
Key fields (feedback welcome):
• cacheSize (long): Current number of entries in the DNS cache. Useful for
monitoring cache growth and understanding cache utilization patterns.
• staleEntries (long): Number of stale entries currently in the cache (entries
that have expired but are still within the stale period). Helps identify how
many entries are using stale data, which is important for understanding cache
behavior in scenarios where DNS refresh fails.
• entriesRemoved (long): Number of entries that have been removed during cache
cleanup operations. This metric tracks cache eviction and helps understand
cache churn patterns, which is particularly useful in Kubernetes and
cloud-native environments where DNS entries may change frequently.
Use cases:
• Monitoring DNS cache size growth over time
• Identifying cache cleanup frequency and patterns
• Understanding stale data usage in production environments
• Troubleshooting DNS-related performance issues in microservices architectures
• Observing cache behavior during DNS server failures or network partitions
Prototype/PR
• A preliminary PR is available for context and discussion:
• https://git.openjdk.org/jdk/pull/28110
• I will update the design/implementation per feedback from this thread.
Thanks in advance for your feedback!
Best regards,
NeayGuyCoding