kangkaisen opened a new issue #1279: Create tablet hang when there is a broken disk URL: https://github.com/apache/incubator-doris/issues/1279 **Describe the bug** 1 At Fri Jun 7 02:16:27 CST 2019. BE restart ``` terminate called after throwing an instance of 'boost::filesystem::filesystem_error' what(): boost::filesystem::directory_iterator::operator++: Input/output error: "/data11/olap/trash/20190604021338.147750/140342 119/515119528" *** Aborted at 1559844987 (unix time) try "date -d @1559844987" if you are using GNU date *** PC: @ 0x7fe2fb9d3277 __GI_raise *** SIGABRT (@0x1f400001bd1) received by PID 7121 (TID 0x7fe2ed54e700) from PID 7121; stack trace: *** @ 0x7fe2fb9d32f0 (unknown) @ 0x7fe2fb9d3277 __GI_raise @ 0x7fe2fb9d4968 __GI_abort @ 0x2582435 __gnu_cxx::__verbose_terminate_handler() @ 0x24f58b6 __cxxabiv1::__terminate() @ 0x25822c9 __cxa_call_terminate @ 0x24f6068 __gxx_personality_v0 @ 0x258d103 _Unwind_RaiseException_Phase2 @ 0x258dc3e _Unwind_Resume @ 0x15a3f19 boost::filesystem::detail::directory_iterator_increment() @ 0x15a5a4d boost::filesystem::detail::directory_iterator_construct() @ 0x15a7781 (anonymous namespace)::remove_all_aux() @ 0x15a7802 (anonymous namespace)::remove_all_aux() @ 0x15a7802 (anonymous namespace)::remove_all_aux() @ 0x15a7c04 boost::filesystem::detail::remove_all() @ 0xcedc8e doris::remove_all_dir() @ 0xc5fe2f doris::OLAPEngine::_do_sweep() @ 0xc6d347 doris::OLAPEngine::start_trash_sweep() @ 0xc871b5 doris::OLAPEngine::_garbage_sweeper_thread_callback() @ 0xc8735f _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5doris10OLAPEngine16_start_bg_workerEvEUlvE_EEEEE6_M_runEv @ 0x252accf execute_native_thread_routine @ 0x7fe2fb788e25 start_thread @ 0x7fe2fba9bbad __clone ``` 2 The create tablet queue only had insert ,not erase. ``` I0607 02:32:25.607103 11647 task_worker_pool.cpp:251] type: CREATE, signature: 143318375, has been inserted.queue size: 1 I0607 02:32:25.607159 11647 task_worker_pool.cpp:251] type: CREATE, signature: 143318439, has been inserted.queue size: 2 I0607 02:32:25.607225 11647 task_worker_pool.cpp:251] type: CREATE, signature: 143318459, has been inserted.queue size: 3 I0607 02:32:25.607259 11480 olap_engine.cpp:2183] begin to process create table. [tablet=143318375, schema_hash=930700632] I0607 02:32:25.607352 11478 olap_engine.cpp:2183] begin to process create table. [tablet=143318439, schema_hash=930700632] I0607 02:32:25.607403 11479 olap_engine.cpp:2183] begin to process create table. [tablet=143318459, schema_hash=930700632] I0607 05:17:27.207422 31389 task_worker_pool.cpp:251] type: CREATE, signature: 143382309, has been inserted.queue size: 1050 I0607 05:17:28.344682 31389 task_worker_pool.cpp:251] type: CREATE, signature: 143382326, has been inserted.queue size: 1051 I0607 05:17:45.696979 31389 task_worker_pool.cpp:251] type: CREATE, signature: 143382438, has been inserted.queue size: 1052 I0607 05:17:51.516506 31389 task_worker_pool.cpp:251] type: CREATE, signature: 143382695, has been inserted.queue size: 1053 I0607 05:17:51.516546 31389 task_worker_pool.cpp:251] type: CREATE, signature: 143382715, has been inserted.queue size: 1054 I0607 05:18:09.590432 31389 task_worker_pool.cpp:251] type: CREATE, signature: 143382754, has been inserted.queue size: 1055 I0607 05:18:42.270471 31389 task_worker_pool.cpp:251] type: CREATE, signature: 143382859, has been inserted.queue size: 1056 I0607 05:18:47.847235 31389 task_worker_pool.cpp:251] type: CREATE, signature: 143382948, has been inserted.queue size: 1057 ``` 3 The create tablet thread no any output after `begin to process create table`:  4 The disk /data11 (sdk) was broken after `Jun 7 02:14:04`: ``` Jun 7 02:14:04 gh-data-palo-query19 kernel: [24933529.418571] end_request: I/O error, dev sdk, sector 4890561312 ``` 5 The disk was read only at `2019-06-07 05:24` ``` [P1][故障] 主机名: gh-data-palo-query19 监控项: 磁盘只读 all(#3) df.mounts.ro fstype=ext4,mount=/data11 == 0 当前值: 0 ``` 6 Doris detect the disk broken at `W0607 05:24:37.429275` ``` W0607 05:24:37.429145 38025 store.cpp:364] fail to write test file. [file_name=/data11/olap/.testfile] W0607 05:24:37.429275 38025 store.cpp:320] store read/write test file occur IO Error. path=/data11/olap ``` 7 Between 02:14 and 05:24, the following are some other error log: ``` W0607 05:18:43.704691 11568 olap_engine.cpp:675] olap table cannot be used. [table=143037920] W0607 05:18:43.705096 11568 delta_writer.cpp:69] tablet_id: 143037920, schema_hash: 2073545165 not found W0607 05:18:43.705113 11568 tablet_writer_mgr.cpp:183] close tablet writer failed, tablet_id=143037920, transaction_id=278264366 W0607 05:18:43.719774 11568 tablet_writer_mgr.cpp:300] channle close failed, key=(id=A64DBF5FED4FA940:01D0F4D56960BDA3,index_id=62 720912), sender_id=0, err_msg=close tablet writer failed W0607 05:18:43.720058 11568 olap_engine.cpp:675] olap table cannot be used. [table=143037920] W0607 05:18:43.720129 11568 olap_engine.cpp:675] olap table cannot be used. [table=143037884] W0607 02:32:15.576319 11416 olap_engine.cpp:675] olap table cannot be used. [table=143291383] W0607 02:32:15.576333 11416 olap_scanner.cpp:85] tablet does not exist. [tablet_id=143291383 schema_hash=1779403853] I0607 02:32:15.576339 11416 status.cpp:54] tablet does not exist: 143291383 @ 0xb974b9 doris::Status::Status() @ 0x11c520d doris::OlapScanner::_prepare() @ 0x11c7260 doris::OlapScanner::OlapScanner() @ 0x119b865 doris::OlapScanNode::start_scan_thread() @ 0x11a0f1f doris::OlapScanNode::start_scan() @ 0x11a2252 doris::OlapScanNode::get_next() @ 0x1220be6 doris::NewPartitionedAggregationNode::open() @ 0x118f3e1 doris::TopNNode::open() @ 0xe44337 doris::PlanFragmentExecutor::open_internal() @ 0xe44d43 doris::PlanFragmentExecutor::open() @ 0xde42c7 doris::FragmentExecState::execute() @ 0xde5c9c doris::FragmentMgr::exec_actual() @ 0xde9284 boost::detail::function::void_function_obj_invoker0<>::invoke() @ 0xdac138 doris::ThreadPool::work_thread() @ 0x158420d thread_proxy @ 0x7f7704ccfe25 start_thread @ 0x7f7704fe2bad __clone ``` **I don't understand why the create tablet process is blocked and no any warn or error output when the disk was broken.**
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
