zjx990 opened a new issue, #2488:
URL: https://github.com/apache/shardingsphere-elasticjob/issues/2488
Symptoms Observed
In production environment, we encountered the following symptoms:
All shards not executing: 4-shard job completely stopped execution
ZooKeeper operations spike: ZK OPS maintained at ~700 high-frequency
operations
Normal leader election: Election logs appeared only once, leader election
successful
Healthy application processes: All 4 scheduling machines running normally
Recoverable by restart: Job execution resumes normally after application
restart
Root Cause Analysis
Core Issue
During network failures, sharding transaction failures are silently ignored
by the exception handling mechanism, causing sharding state nodes to become
"zombie nodes", triggering all non-leader nodes to enter infinite waiting loops.
Technical Details
Exception Handling Flaw (Critical code location)
// RegExceptionHandler.java:53-55
private static boolean isIgnoredException(final Throwable cause) {
return null != cause && (cause instanceof ConnectionLossException ||
cause instanceof NoNodeException ||
cause instanceof NodeExistsException);
}
Transaction Failure Masked (Critical code location)
// JobNodeStorage.java:168-177
public void executeInTransaction(final TransactionExecutionCallback
callback) {
try {
// ... transaction operations
curatorTransactionFinal.commit();
} catch (final Exception ex) {
RegExceptionHandler.handleException(ex); // Network exceptions
silently ignored
}
}
Infinite Loop Waiting (Critical code location)
// ShardingService.java:132-138
while (!leaderElectionService.isLeader() &&
(jobNodeStorage.isJobNodeExisted(ShardingNode.NECESSARY) ||
jobNodeStorage.isJobNodeExisted(ShardingNode.PROCESSING))) {
BlockUtils.waitingShortTime(); // 100ms infinite loop
}
Failure Chain
Network Jitter → ZK Connection Loss → Sharding Transaction Commit Failure →
ConnectionLossException →
Exception Silently Ignored → Sharding State Nodes Remain → Non-leader Nodes
Infinite Loop →
Complete Job Execution Halt + High-frequency ZK Operations
Impact Assessment
Business Interruption: Distributed jobs completely stop executing until
manual restart
Resource Waste: CPU consumed by ineffective loops, ZK bears high-frequency
meaningless operations
Operational Cost: Requires manual monitoring and restart intervention, no
automatic recovery
Stability Risk: Common failures like network jitter can trigger this,
affecting system availability
Reproduction Steps
Start multi-shard ElasticJob cluster
Disconnect network connection to ZooKeeper during leader sharding process
Observe symptoms: Sharding execution stops, ZK OPS abnormal, non-leader
nodes high CPU usage
Restart verification: Application resumes normal operation after restart
Proposed Fix Solutions
Short-term Solutions
Improve Exception Handling: Transaction operations should not ignore
ConnectionLossException
Add Timeout Mechanism: Add timeout exit logic in sharding waiting loop
Enhance Logging: Elevate ignored network exception log level to WARN
Long-term Solutions
Unify Node Types: Unify sharding state nodes as ephemeral nodes, leverage
session timeout for automatic cleanup
Add Health Checks: Periodically detect and clean zombie nodes
Improve Monitoring: Add sharding wait time and ZK operation frequency
monitoring
Affected Code Files
elastic-job-common/elastic-job-common-core/src/main/java/com/dangdang/ddframe/job/reg/exception/RegExceptionHandler.java
elastic-job-lite/elastic-job-lite-core/src/main/java/com/dangdang/ddframe/job/lite/internal/storage/JobNodeStorage.java
elastic-job-lite/elastic-job-lite-core/src/main/java/com/dangdang/ddframe/job/lite/internal/sharding/ShardingService.java
Configuration Details
monitorExecution: false (confirmed not an execution monitoring issue)
Shard Count: 4
Node Count: 4
Environment Information
ElasticJob Version: 2.0.4
ZooKeeper Version: 3.4.6
Curator Version: 2.10.0
Java Version: [Please specify your Java version]
Additional Context
This issue represents a critical design flaw where the system lacks runtime
self-healing capabilities and over-relies on application restarts to resolve
problems. The silent exception handling mechanism masks transaction failures,
leading to hidden issues that only surface when they cause system-wide impact.
The problem is particularly concerning in production environments where
network instability is common, as it can cause extended service outages
requiring manual intervention.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail:
[email protected]
For queries about this service, please contact Infrastructure at:
[email protected]