zjx990 opened a new issue, #2488:
URL: https://github.com/apache/shardingsphere-elasticjob/issues/2488

   Symptoms Observed
   In production environment, we encountered the following symptoms:
   
   All shards not executing: 4-shard job completely stopped execution
   ZooKeeper operations spike: ZK OPS maintained at ~700 high-frequency 
operations
   Normal leader election: Election logs appeared only once, leader election 
successful
   Healthy application processes: All 4 scheduling machines running normally
   Recoverable by restart: Job execution resumes normally after application 
restart
   Root Cause Analysis
   Core Issue
   During network failures, sharding transaction failures are silently ignored 
by the exception handling mechanism, causing sharding state nodes to become 
"zombie nodes", triggering all non-leader nodes to enter infinite waiting loops.
   
   Technical Details
   Exception Handling Flaw (Critical code location)
   
   // RegExceptionHandler.java:53-55
   private static boolean isIgnoredException(final Throwable cause) {
       return null != cause && (cause instanceof ConnectionLossException || 
                              cause instanceof NoNodeException || 
                              cause instanceof NodeExistsException);
   }
   Transaction Failure Masked (Critical code location)
   
   // JobNodeStorage.java:168-177
   public void executeInTransaction(final TransactionExecutionCallback 
callback) {
       try {
           // ... transaction operations
           curatorTransactionFinal.commit();
       } catch (final Exception ex) {
           RegExceptionHandler.handleException(ex); // Network exceptions 
silently ignored
       }
   }
   Infinite Loop Waiting (Critical code location)
   
   // ShardingService.java:132-138
   while (!leaderElectionService.isLeader() &&
           (jobNodeStorage.isJobNodeExisted(ShardingNode.NECESSARY) || 
            jobNodeStorage.isJobNodeExisted(ShardingNode.PROCESSING))) {
       BlockUtils.waitingShortTime(); // 100ms infinite loop
   }
   Failure Chain
   Network Jitter → ZK Connection Loss → Sharding Transaction Commit Failure → 
ConnectionLossException → 
   Exception Silently Ignored → Sharding State Nodes Remain → Non-leader Nodes 
Infinite Loop → 
   Complete Job Execution Halt + High-frequency ZK Operations
   Impact Assessment
   Business Interruption: Distributed jobs completely stop executing until 
manual restart
   Resource Waste: CPU consumed by ineffective loops, ZK bears high-frequency 
meaningless operations
   Operational Cost: Requires manual monitoring and restart intervention, no 
automatic recovery
   Stability Risk: Common failures like network jitter can trigger this, 
affecting system availability
   Reproduction Steps
   Start multi-shard ElasticJob cluster
   Disconnect network connection to ZooKeeper during leader sharding process
   Observe symptoms: Sharding execution stops, ZK OPS abnormal, non-leader 
nodes high CPU usage
   Restart verification: Application resumes normal operation after restart
   Proposed Fix Solutions
   Short-term Solutions
   Improve Exception Handling: Transaction operations should not ignore 
ConnectionLossException
   Add Timeout Mechanism: Add timeout exit logic in sharding waiting loop
   Enhance Logging: Elevate ignored network exception log level to WARN
   Long-term Solutions
   Unify Node Types: Unify sharding state nodes as ephemeral nodes, leverage 
session timeout for automatic cleanup
   Add Health Checks: Periodically detect and clean zombie nodes
   Improve Monitoring: Add sharding wait time and ZK operation frequency 
monitoring
   Affected Code Files
   
elastic-job-common/elastic-job-common-core/src/main/java/com/dangdang/ddframe/job/reg/exception/RegExceptionHandler.java
   
elastic-job-lite/elastic-job-lite-core/src/main/java/com/dangdang/ddframe/job/lite/internal/storage/JobNodeStorage.java
   
elastic-job-lite/elastic-job-lite-core/src/main/java/com/dangdang/ddframe/job/lite/internal/sharding/ShardingService.java
   Configuration Details
   monitorExecution: false (confirmed not an execution monitoring issue)
   Shard Count: 4
   Node Count: 4
   Environment Information
   ElasticJob Version: 2.0.4
   ZooKeeper Version: 3.4.6
   Curator Version: 2.10.0
   Java Version: [Please specify your Java version]
   Additional Context
   This issue represents a critical design flaw where the system lacks runtime 
self-healing capabilities and over-relies on application restarts to resolve 
problems. The silent exception handling mechanism masks transaction failures, 
leading to hidden issues that only surface when they cause system-wide impact.
   
   The problem is particularly concerning in production environments where 
network instability is common, as it can cause extended service outages 
requiring manual intervention.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: 
[email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to