ccoffline commented on a change in pull request #6416: URL: https://github.com/apache/incubator-doris/pull/6416#discussion_r695365237
########## File path: fe/fe-core/src/main/java/org/apache/doris/alter/MaterializedViewHandler.java ########## @@ -1165,15 +1152,15 @@ private void getOldAlterJobInfos(Database db, List<List<Comparable>> rollupJobIn for (AlterJob selectedJob : jobs) { - OlapTable olapTable = (OlapTable) db.getTable(selectedJob.getTableId()); - if (olapTable == null) { - continue; - } - olapTable.readLock(); try { - selectedJob.getJobInfo(rollupJobInfos, olapTable); - } finally { - olapTable.readUnlock(); + OlapTable olapTable = db.getTableOrMetaException(selectedJob.getTableId(), Table.TableType.OLAP); + olapTable.readLock(); + try { + selectedJob.getJobInfo(rollupJobInfos, olapTable); + } finally { + olapTable.readUnlock(); + } + } catch (MetaNotFoundException ignored) { Review comment: This is equals to `continue` ########## File path: fe/fe-core/src/main/java/org/apache/doris/alter/MaterializedViewHandler.java ########## @@ -883,26 +878,20 @@ private void removeJobFromRunningQueue(AlterJobV2 alterJob) { } private void changeTableStatus(long dbId, long tableId, OlapTableState olapTableState) { - Database db = Catalog.getCurrentCatalog().getDb(dbId); - if (db == null) { - LOG.warn("db {} has been dropped when changing table {} status after rollup job done", - dbId, tableId); - return; - } - OlapTable tbl = (OlapTable) db.getTable(tableId); - if (tbl == null) { - LOG.warn("table {} has been dropped when changing table status after rollup job done", - tableId); - return; - } - tbl.writeLock(); try { - if (tbl.getState() == olapTableState) { - return; + Database db = Catalog.getCurrentCatalog().getDbOrMetaException(dbId); + OlapTable olapTable = db.getTableOrMetaException(tableId, Table.TableType.OLAP); + olapTable.writeLock(); + try { + if (olapTable.getState() == olapTableState) { + return; + } + olapTable.setState(olapTableState); + } finally { + olapTable.writeUnlock(); } - tbl.setState(olapTableState); - } finally { - tbl.writeUnlock(); + } catch (MetaNotFoundException e) { + LOG.warn("[INCONSISTENT META] changing table status failed after rollup job done", e); Review comment: This is called by `replayAlterJobV2()` and `onJobDone()`, which should both not throw this exception. ########## File path: fe/fe-core/src/main/java/org/apache/doris/alter/Alter.java ########## @@ -260,17 +251,11 @@ private void processModifyColumnComment(Database db, OlapTable tbl, List<AlterCl } } - public void replayModifyComment(ModifyCommentOperationLog operation) { + public void replayModifyComment(ModifyCommentOperationLog operation) throws MetaNotFoundException { Review comment: It is considerable. But "replay" throwing any exception is an extremely big risk, which will cause all FE crush and cannot recover. These `MetaNotFoundException` are mostly thrown by getDb and getTable, due to the lock inconsistence that makes editlogs out of order. Semantically, throwing this exception means some metadata the editlog want to edit on is missing during replay, which makes this replay unable to continue. Just like getDb/Table, This kind of inconsistence cannot recover anyway, and these editlogs are theoretically only affect lost metadata. I prefer to use `MetaNotFoundException` that indicate metadata has lost rather than other exception which may cause confuse. The inconsistence can be checked from warning log. ########## File path: fe/fe-core/src/main/java/org/apache/doris/alter/RollupJobV2.java ########## @@ -667,22 +638,26 @@ private void replayCancelled(RollupJobV2 replayedJob) { @Override public void replay(AlterJobV2 replayedJob) { - RollupJobV2 replayedRollupJob = (RollupJobV2) replayedJob; - switch (replayedJob.jobState) { - case PENDING: - replayCreateJob(replayedRollupJob); - break; - case WAITING_TXN: - replayPendingJob(replayedRollupJob); - break; - case FINISHED: - replayRunningJob(replayedRollupJob); - break; - case CANCELLED: - replayCancelled(replayedRollupJob); - break; - default: - break; + try { + RollupJobV2 replayedRollupJob = (RollupJobV2) replayedJob; + switch (replayedJob.jobState) { + case PENDING: + replayCreateJob(replayedRollupJob); + break; + case WAITING_TXN: + replayPendingJob(replayedRollupJob); + break; + case FINISHED: + replayRunningJob(replayedRollupJob); + break; + case CANCELLED: + replayCancelled(replayedRollupJob); + break; + default: + break; + } + } catch (MetaNotFoundException e) { + LOG.warn("[INCONSISTENT META] replay rollup job failed {}", replayedJob.getJobId(), e); Review comment: It is called by `replayAlterJobV2`. I'm afraid it would be different from the origin. https://github.com/apache/incubator-doris/blob/138e7e896dee6842d8af7583d59288190371cd86/fe/fe-core/src/main/java/org/apache/doris/alter/AlterHandler.java#L460-L469 ########## File path: fe/fe-core/src/main/java/org/apache/doris/alter/AlterHandler.java ########## @@ -302,40 +302,37 @@ protected void jobDone(AlterJob alterJob) { } } - public void replayInitJob(AlterJob alterJob, Catalog catalog) { - Database db = catalog.getDb(alterJob.getDbId()); + public void replayInitJob(AlterJob alterJob, Catalog catalog) throws MetaNotFoundException { + Database db = catalog.getDbOrMetaException(alterJob.getDbId()); alterJob.replayInitJob(db); // add rollup job addAlterJob(alterJob); } - public void replayFinishing(AlterJob alterJob, Catalog catalog) { - Database db = catalog.getDb(alterJob.getDbId()); + public void replayFinishing(AlterJob alterJob, Catalog catalog) throws MetaNotFoundException { + Database db = catalog.getDbOrMetaException(alterJob.getDbId()); alterJob.replayFinishing(db); alterJob.setState(JobState.FINISHING); // !!! the alter job should add to the cache again, because the alter job is deserialized from journal // it is a different object compared to the cache addAlterJob(alterJob); } - public void replayFinish(AlterJob alterJob, Catalog catalog) { - Database db = catalog.getDb(alterJob.getDbId()); + public void replayFinish(AlterJob alterJob, Catalog catalog) throws MetaNotFoundException { + Database db = catalog.getDbOrMetaException(alterJob.getDbId()); alterJob.replayFinish(db); alterJob.setState(JobState.FINISHED); jobDone(alterJob); } - public void replayCancel(AlterJob alterJob, Catalog catalog) { + public void replayCancel(AlterJob alterJob, Catalog catalog) throws MetaNotFoundException { removeAlterJob(alterJob.getTableId()); alterJob.setState(JobState.CANCELLED); - Database db = catalog.getDb(alterJob.getDbId()); - if (db != null) { - // we log rollup job cancelled even if db is dropped. - // so check db != null here - alterJob.replayCancel(db); - } - + // we log rollup job cancelled even if db is dropped. + // so check db != null here + Database db = catalog.getDbOrMetaException(alterJob.getDbId()); + alterJob.replayCancel(db); Review comment: I'll fix this. ########## File path: fe/fe-core/src/main/java/org/apache/doris/alter/RollupJobV2.java ########## @@ -812,10 +787,14 @@ public void gsonPostProcess() throws IOException { return; } // parse the define stmt to schema - SqlParser parser = new SqlParser(new SqlScanner(new StringReader(origStmt.originStmt), - SqlModeHelper.MODE_DEFAULT)); + SqlParser parser = new SqlParser(new SqlScanner(new StringReader(origStmt.originStmt), SqlModeHelper.MODE_DEFAULT)); ConnectContext connectContext = new ConnectContext(); - Database db = Catalog.getCurrentCatalog().getDb(dbId); + Database db; + try { + db = Catalog.getCurrentCatalog().getDbOrMetaException(dbId); + } catch (MetaNotFoundException e) { + throw new IOException("error happens when parsing create materialized view stmt: " + origStmt, e); Review comment: The origin code will throw NPE, I really don't know what to do here ########## File path: fe/fe-core/src/main/java/org/apache/doris/alter/SchemaChangeHandler.java ########## @@ -1889,17 +1889,14 @@ public void cancel(CancelStmt stmt) throws DdlException { Preconditions.checkState(!Strings.isNullOrEmpty(dbName)); Preconditions.checkState(!Strings.isNullOrEmpty(tableName)); - Database db = Catalog.getCurrentCatalog().getDb(dbName); - if (db == null) { - throw new DdlException("Database[" + dbName + "] does not exist"); - } + Database db = Catalog.getCurrentCatalog().getDbOrDdlException(dbName); AlterJob schemaChangeJob = null; AlterJobV2 schemaChangeJobV2 = null; - OlapTable olapTable = null; + OlapTable olapTable; try { - olapTable = (OlapTable) db.getTableOrThrowException(tableName, Table.TableType.OLAP); + olapTable = db.getTableOrMetaException(tableName, Table.TableType.OLAP); Review comment: ok, that make sense ########## File path: fe/fe-core/src/main/java/org/apache/doris/alter/SchemaChangeJobV2.java ########## @@ -218,9 +215,9 @@ protected void runPendingJob() throws AlterCancelException { } MarkedCountDownLatch<Long, Long> countDownLatch = new MarkedCountDownLatch<>(totalReplicaNum); - OlapTable tbl = null; + OlapTable tbl; try { - tbl = (OlapTable) db.getTableOrThrowException(tableId, TableType.OLAP); + tbl = db.getTableOrMetaException(tableId, TableType.OLAP); Review comment: ok ########## File path: fe/fe-core/src/main/java/org/apache/doris/alter/SchemaChangeJobV2.java ########## @@ -370,14 +367,11 @@ protected void runWaitingTxnJob() throws AlterCancelException { } LOG.info("previous transactions are all finished, begin to send schema change tasks. job: {}", jobId); - Database db = Catalog.getCurrentCatalog().getDb(dbId); - if (db == null) { - throw new AlterCancelException("Databasee " + dbId + " does not exist"); - } + Database db = Catalog.getCurrentCatalog().getDbOrException(dbId, s -> new AlterCancelException("Database " + s + " does not exist")); - OlapTable tbl = null; + OlapTable tbl; try { - tbl = (OlapTable) db.getTableOrThrowException(tableId, TableType.OLAP); + tbl = db.getTableOrMetaException(tableId, TableType.OLAP); Review comment: ok ########## File path: fe/fe-core/src/main/java/org/apache/doris/backup/RestoreJob.java ########## @@ -1048,11 +1049,17 @@ private boolean downloadAndDeserializeMetaInfo() { } private void replayCheckAndPrepareMeta() { - Database db = catalog.getDb(dbId); + Database db; + try { + db = catalog.getDbOrMetaException(dbId); Review comment: It is called by `replayRun`, which is called by the code below, and have some next process. https://github.com/apache/incubator-doris/blob/138e7e896dee6842d8af7583d59288190371cd86/fe/fe-core/src/main/java/org/apache/doris/backup/BackupHandler.java#L643-L655 ########## File path: fe/fe-core/src/main/java/org/apache/doris/catalog/Catalog.java ########## @@ -4548,72 +4476,43 @@ private void unprotectUpdateReplica(ReplicaPersistInfo info) { replica.setBad(false); } - public void replayAddReplica(ReplicaPersistInfo info) { - Database db = getDb(info.getDbId()); - OlapTable olapTable = (OlapTable) db.getTable(info.getTableId()); - if (olapTable == null) { - /** - * Same as replayUpdateReplica() - */ - LOG.warn("Olap table is null when the add replica log is replayed, {}", info); - return; - } + public void replayAddReplica(ReplicaPersistInfo info) throws MetaNotFoundException { + Database db = this.getDbOrMetaException(info.getDbId()); + OlapTable olapTable = db.getTableOrMetaException(info.getTableId(), TableType.OLAP); olapTable.writeLock(); try { - unprotectAddReplica(info); + unprotectAddReplica(olapTable, info); } finally { olapTable.writeUnlock(); } } - public void replayUpdateReplica(ReplicaPersistInfo info) { - Database db = getDb(info.getDbId()); - OlapTable olapTable = (OlapTable) db.getTable(info.getTableId()); - if (olapTable == null) { Review comment: ok ########## File path: fe/fe-core/src/main/java/org/apache/doris/httpv2/rest/TableRowCountAction.java ########## @@ -61,15 +61,12 @@ public Object count( String fullDbName = getFullDbName(dbName); // check privilege for select, otherwise return HTTP 401 checkTblAuth(ConnectContext.get().getCurrentUserIdentity(), fullDbName, tblName, PrivPredicate.SELECT); - Database db = Catalog.getCurrentCatalog().getDb(fullDbName); - if (db == null) { - return ResponseEntityBuilder.okWithCommonError("Database [" + dbName + "] " + "does not exists"); - } - OlapTable olapTable = null; + OlapTable olapTable; try { - olapTable = (OlapTable) db.getTableOrThrowException(tblName, Table.TableType.OLAP); + Database db = Catalog.getCurrentCatalog().getDbOrMetaException(fullDbName); + olapTable = db.getTableOrMetaException(tblName, Table.TableType.OLAP); } catch (MetaNotFoundException e) { - return ResponseEntityBuilder.okWithCommonError(e.getMessage()); + return ResponseEntityBuilder.notFound(e.getMessage()); Review comment: ok ########## File path: fe/fe-core/src/main/java/org/apache/doris/catalog/Catalog.java ########## @@ -6341,12 +6264,12 @@ public long loadCluster(DataInputStream dis, long checksum) throws IOException, // for adding BE to some Cluster, but loadCluster is after loadBackend. cluster.setBackendIdList(latestBackendIds); - String dbName = InfoSchemaDb.getFullInfoSchemaDbName(cluster.getName()); + String dbName = InfoSchemaDb.getFullInfoSchemaDbName(cluster.getName()); InfoSchemaDb db; // Use real Catalog instance to avoid InfoSchemaDb id continuously increment // when checkpoint thread load image. - if (Catalog.getServingCatalog().getFullNameToDb().containsKey(dbName)) { - db = (InfoSchemaDb)Catalog.getServingCatalog().getFullNameToDb().get(dbName); + if (Catalog.getCurrentCatalog().getFullNameToDb().containsKey(dbName)) { Review comment: It is a mischange, I'll fix this ########## File path: fe/fe-core/src/main/java/org/apache/doris/persist/EditLog.java ########## @@ -865,6 +862,8 @@ public static void loadJournal(Catalog catalog, JournalEntity journal) { throw e; } } + } catch (MetaNotFoundException e) { + LOG.warn("[INCONSISTENT META] replay failed {}", journal, e); Review comment: I considered printing the original journal content, but gave up because of the complicated deserialization. `JournalEntity` has implemented `toString()` method and has already print the op code. It's better for all the `Writable` objects to implement `toString()` method. I'll add `e.getMessage()` in one line and keep the exception stack trace. ########## File path: fe/fe-core/src/main/java/org/apache/doris/alter/SchemaChangeJobV2.java ########## @@ -218,9 +215,9 @@ protected void runPendingJob() throws AlterCancelException { } MarkedCountDownLatch<Long, Long> countDownLatch = new MarkedCountDownLatch<>(totalReplicaNum); - OlapTable tbl = null; + OlapTable tbl; try { - tbl = (OlapTable) db.getTableOrThrowException(tableId, TableType.OLAP); + tbl = db.getTableOrMetaException(tableId, TableType.OLAP); Review comment: This checks if the table exists and if the table is OLAP, so it might be easier to code this way. ########## File path: fe/fe-core/src/main/java/org/apache/doris/alter/Alter.java ########## @@ -260,17 +251,11 @@ private void processModifyColumnComment(Database db, OlapTable tbl, List<AlterCl } } - public void replayModifyComment(ModifyCommentOperationLog operation) { + public void replayModifyComment(ModifyCommentOperationLog operation) throws MetaNotFoundException { Review comment: @caiconghui we don't have any promise that the edit logs are in order. These check code is to prevent the worst from happening. Edit logs that out of order may cause meta inconsistent, which has to be fixes sooner or later. We are exploring ways to ensure consistency and minimize the cost. Until then, we have to check all NPE or let all the FE to crash. ########## File path: fe/fe-core/src/main/java/org/apache/doris/alter/Alter.java ########## @@ -260,17 +251,11 @@ private void processModifyColumnComment(Database db, OlapTable tbl, List<AlterCl } } - public void replayModifyComment(ModifyCommentOperationLog operation) { + public void replayModifyComment(ModifyCommentOperationLog operation) throws MetaNotFoundException { Review comment: @caiconghui we don't have any promise that the edit logs are in order. These check code is to prevent the worst from happening. Edit logs that out of order may cause meta inconsistent, which has to be fixed sooner or later. We are exploring ways to ensure consistency and minimize the cost. Until then, we have to check all NPE or let all the FE to crash. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org For additional commands, e-mail: commits-h...@doris.apache.org