Hi, When creating jira, there is a "Labels" entry where we could label the jira as supportability. It would help to do this, so we could possibly give priority to fix them for better supportability, especially for those JIRAs that tend to be small but very high-impact for end users.
The types of jiras that we can label "supportability" are (see some detailed description at the bottom of this mail): - improving logs and error messages - simplifying workflows for admins - reducing configuration complexity - adding new metrics - ... For example, HDFS-7281 <https://issues.apache.org/jira/browse/HDFS-7281>Missing block is marked as corrupted block HDFS-7497 <https://issues.apache.org/jira/browse/HDFS-7497>Inconsistent report of decommissioning DataNodes between dfsadmin and NameNode webui HDFS-6959 <https://issues.apache.org/jira/browse/HDFS-6959>Make the HDFS home directory location customizable. HDFS-6403 <https://issues.apache.org/jira/browse/HDFS-6403>Add metrics for log warnings reported by JVM pauses This email serves as an proposal to do this kind of labeling when creating new jiras. We could also go back to label old jiras for reference when time allows. Comments are welcome. Thanks. --Yongjun Below is a bit more detailed description of some relevant scenarios (thanks to Andrew Wang and Todd Lipcon): 1) In the presence of configuration errors, detecting them preemptively before they result in the system getting into a funky state. For example, we used to have a possible configuration where the NN would start up bound to 0.0.0.0 and then advertise 0.0.0.0 to the SNN as its remote IP. This meant that checkpointing wouldn't work, but would fail with confusing errors. Aborting at startup made this easier to support. 2) In the presence of environmental issues, detecting them and giving meaningful errors. For example, stuff like the GC Pause monitor that's in the NN now is helpful because when something goes wrong, you have a smoking gun. (even though it's not exactly an NN bug that GC happens, in some cases) 3) Changing non-specific error messages to specific ones. For example, we've had cases before where we throw an NPE, and the "fix" is to check for null and throw an IllegalArgumentException with a nice message or something. It wasn't a bug that the system failed with that particular config, but the error message tells the user/supporter exactly what to do to fix it.