kangkaisen opened a new issue #1279: Create tablet hang when there is a broken 
disk
URL: https://github.com/apache/incubator-doris/issues/1279
 
 
   **Describe the bug**
   1 At Fri Jun  7 02:16:27 CST 2019. BE restart
   ```
   terminate called after throwing an instance of 
'boost::filesystem::filesystem_error'
     what():  boost::filesystem::directory_iterator::operator++: Input/output 
error: "/data11/olap/trash/20190604021338.147750/140342
   119/515119528"
   *** Aborted at 1559844987 (unix time) try "date -d @1559844987" if you are 
using GNU date ***
   PC: @     0x7fe2fb9d3277 __GI_raise
   *** SIGABRT (@0x1f400001bd1) received by PID 7121 (TID 0x7fe2ed54e700) from 
PID 7121; stack trace: ***
       @     0x7fe2fb9d32f0 (unknown)
       @     0x7fe2fb9d3277 __GI_raise
       @     0x7fe2fb9d4968 __GI_abort
       @          0x2582435 __gnu_cxx::__verbose_terminate_handler()
       @          0x24f58b6 __cxxabiv1::__terminate()
       @          0x25822c9 __cxa_call_terminate
       @          0x24f6068 __gxx_personality_v0
       @          0x258d103 _Unwind_RaiseException_Phase2
       @          0x258dc3e _Unwind_Resume
       @          0x15a3f19 
boost::filesystem::detail::directory_iterator_increment()
       @          0x15a5a4d 
boost::filesystem::detail::directory_iterator_construct()
       @          0x15a7781 (anonymous namespace)::remove_all_aux()
       @          0x15a7802 (anonymous namespace)::remove_all_aux()
       @          0x15a7802 (anonymous namespace)::remove_all_aux()
       @          0x15a7c04 boost::filesystem::detail::remove_all()
       @           0xcedc8e doris::remove_all_dir()
       @           0xc5fe2f doris::OLAPEngine::_do_sweep()
       @           0xc6d347 doris::OLAPEngine::start_trash_sweep()
       @           0xc871b5 
doris::OLAPEngine::_garbage_sweeper_thread_callback()
       @           0xc8735f 
_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5doris10OLAPEngine16_start_bg_workerEvEUlvE_EEEEE6_M_runEv
       @          0x252accf execute_native_thread_routine
       @     0x7fe2fb788e25 start_thread
       @     0x7fe2fba9bbad __clone
   ```
   
   2 The create tablet queue only had insert ,not erase.
   ```
   I0607 02:32:25.607103 11647 task_worker_pool.cpp:251] type: CREATE, 
signature: 143318375, has been inserted.queue size: 1
   I0607 02:32:25.607159 11647 task_worker_pool.cpp:251] type: CREATE, 
signature: 143318439, has been inserted.queue size: 2
   I0607 02:32:25.607225 11647 task_worker_pool.cpp:251] type: CREATE, 
signature: 143318459, has been inserted.queue size: 3
   I0607 02:32:25.607259 11480 olap_engine.cpp:2183] begin to process create 
table. [tablet=143318375, schema_hash=930700632]
   I0607 02:32:25.607352 11478 olap_engine.cpp:2183] begin to process create 
table. [tablet=143318439, schema_hash=930700632]
   I0607 02:32:25.607403 11479 olap_engine.cpp:2183] begin to process create 
table. [tablet=143318459, schema_hash=930700632]
   
   I0607 05:17:27.207422 31389 task_worker_pool.cpp:251] type: CREATE, 
signature: 143382309, has been inserted.queue size: 1050
   I0607 05:17:28.344682 31389 task_worker_pool.cpp:251] type: CREATE, 
signature: 143382326, has been inserted.queue size: 1051
   I0607 05:17:45.696979 31389 task_worker_pool.cpp:251] type: CREATE, 
signature: 143382438, has been inserted.queue size: 1052
   I0607 05:17:51.516506 31389 task_worker_pool.cpp:251] type: CREATE, 
signature: 143382695, has been inserted.queue size: 1053
   I0607 05:17:51.516546 31389 task_worker_pool.cpp:251] type: CREATE, 
signature: 143382715, has been inserted.queue size: 1054
   I0607 05:18:09.590432 31389 task_worker_pool.cpp:251] type: CREATE, 
signature: 143382754, has been inserted.queue size: 1055
   I0607 05:18:42.270471 31389 task_worker_pool.cpp:251] type: CREATE, 
signature: 143382859, has been inserted.queue size: 1056
   I0607 05:18:47.847235 31389 task_worker_pool.cpp:251] type: CREATE, 
signature: 143382948, has been inserted.queue size: 1057
   ```
   
   3 The create tablet thread no any output after `begin to process create 
table`:
   
![image](https://user-images.githubusercontent.com/9894906/59258201-121fb280-8c6a-11e9-8d31-25fc3d09e494.png)
   
   4 The disk /data11 (sdk) was broken after `Jun  7 02:14:04`:
   ```
       Jun  7 02:14:04 gh-data-palo-query19 kernel: [24933529.418571] 
end_request: I/O error, dev sdk, sector 4890561312
   ```
   5 The disk was read only at `2019-06-07 05:24`
   ```
   [P1][故障]
   主机名: gh-data-palo-query19
   监控项: 磁盘只读  all(#3) df.mounts.ro fstype=ext4,mount=/data11 == 0
   当前值: 0
   ```
   6 Doris detect the disk broken at `W0607 05:24:37.429275`
   ```
   W0607 05:24:37.429145 38025 store.cpp:364] fail to write test file. 
[file_name=/data11/olap/.testfile]
   W0607 05:24:37.429275 38025 store.cpp:320] store read/write test file occur 
IO Error. path=/data11/olap
   ```
   7 Between 02:14 and 05:24, the following are some other error log:
   ```
   W0607 05:18:43.704691 11568 olap_engine.cpp:675] olap table cannot be used. 
[table=143037920]
   W0607 05:18:43.705096 11568 delta_writer.cpp:69] tablet_id: 143037920, 
schema_hash: 2073545165 not found
   W0607 05:18:43.705113 11568 tablet_writer_mgr.cpp:183] close tablet writer 
failed, tablet_id=143037920, transaction_id=278264366
   W0607 05:18:43.719774 11568 tablet_writer_mgr.cpp:300] channle close failed, 
key=(id=A64DBF5FED4FA940:01D0F4D56960BDA3,index_id=62
   720912), sender_id=0, err_msg=close tablet writer failed
   W0607 05:18:43.720058 11568 olap_engine.cpp:675] olap table cannot be used. 
[table=143037920]
   W0607 05:18:43.720129 11568 olap_engine.cpp:675] olap table cannot be used. 
[table=143037884]
   
   
   W0607 02:32:15.576319 11416 olap_engine.cpp:675] olap table cannot be used. 
[table=143291383]
   W0607 02:32:15.576333 11416 olap_scanner.cpp:85] tablet does not exist. 
[tablet_id=143291383 schema_hash=1779403853]
   I0607 02:32:15.576339 11416 status.cpp:54] tablet does not exist: 143291383
       @           0xb974b9  doris::Status::Status()
       @          0x11c520d  doris::OlapScanner::_prepare()
       @          0x11c7260  doris::OlapScanner::OlapScanner()
       @          0x119b865  doris::OlapScanNode::start_scan_thread()
       @          0x11a0f1f  doris::OlapScanNode::start_scan()
       @          0x11a2252  doris::OlapScanNode::get_next()
       @          0x1220be6  doris::NewPartitionedAggregationNode::open()
       @          0x118f3e1  doris::TopNNode::open()
       @           0xe44337  doris::PlanFragmentExecutor::open_internal()
       @           0xe44d43  doris::PlanFragmentExecutor::open()
       @           0xde42c7  doris::FragmentExecState::execute()
       @           0xde5c9c  doris::FragmentMgr::exec_actual()
       @           0xde9284  
boost::detail::function::void_function_obj_invoker0<>::invoke()
       @           0xdac138  doris::ThreadPool::work_thread()
       @          0x158420d  thread_proxy
       @     0x7f7704ccfe25  start_thread
       @     0x7f7704fe2bad  __clone
   ```
   
   **I don't understand why the create tablet  process is blocked and no any 
warn or error output when the disk was broken.**
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to