This is an automated email from the ASF dual-hosted git repository.
wuweijie pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/shardingsphere.git
The following commit(s) were added to refs/heads/master by this push:
new 09e15cab949 update a blog (#26362)
09e15cab949 is described below
commit 09e15cab949e475907062b2f206e64e6a52a16b5
Author: Nan Xiang <[email protected]>
AuthorDate: Thu Jun 15 13:21:58 2023 +0800
update a blog (#26362)
---
.../material/2023_06_15_Chaos_Engineering.md | 141 +++++++++++++++++++++
docs/blog/static/img/chaos_engineering1.jpeg | Bin 0 -> 334820 bytes
docs/blog/static/img/chaos_engineering2.png | Bin 0 -> 14553 bytes
docs/blog/static/img/chaos_engineering3.png | Bin 0 -> 123039 bytes
4 files changed, 141 insertions(+)
diff --git a/docs/blog/content/material/2023_06_15_Chaos_Engineering.md
b/docs/blog/content/material/2023_06_15_Chaos_Engineering.md
new file mode 100644
index 00000000000..5d0a528f521
--- /dev/null
+++ b/docs/blog/content/material/2023_06_15_Chaos_Engineering.md
@@ -0,0 +1,141 @@
++++
+title = "Chaos Engineering: Efficient Way to Improve System Availability
+"
+weight = 101
+chapter = true
++++
+
+
+
+Resilience is a crucial requirement for ShardingSphere-Proxy, an essential
database infrastructure. Testing and verifying resilience can be efficiently
achieved through the use of chaos engineering methodology. To support
customized chaos engineering, the
[ShardingSphere-on-Cloud](https://shardingsphere.apache.org/oncloud/) project
is designing and implementing a new
[CustomResourceDefinition](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
(CRD) ca [...]
+
+# What is chaos engineering
+
+System availability is a critical metric for evaluating service reliability.
Numerous methods can ensure high availability, including engineering resilience
and techniques, and others. One such technique is chaos engineering, which
involves introducing software faults into production systems to enhance
availability.
+
+According to [Principles of Chaos](https://principlesofchaos.org/) (2019), the
definition of chaos engineering is:
+
+> “Chaos Engineering is the discipline of experimenting on a system to build
confidence in the system’s capability to withstand turbulent conditions in
production.”
+
+In other words, chaos engineering is a practice that aims to enhance system
robustness by detecting potential weaknesses in software systems early,
ultimately preventing major disruptions or failures.
+
+# Why is chaos engineering needed
+
+The complexity of a system can be shown in a linear and nonlinear way as well
as reflect how changes in the input of a system affect the output.
+
+A linear system is typically predictable. There are many examples of linear
systems in nature, such as simple mathematical functions and physical
definitions.
+
+In contrast, the output of a nonlinear system cannot be accurately calculated.
In a large distributed program, components interact with each other, and we
cannot determine if expected output can be achieved under various inputs.
+
+Currently, most programs are increasingly complex. In common cloud
environments, coordinating various components is becoming more challenging
(such as Kubernetes, along with the services running on it, like Istio, Envoy,
and other software infrastructure).
+
+
+<div class="caption-center"> Figure 1. Infrastructure stack for general
service </div>
+
+The complexity and rapid changes inherent to many systems often lead to
developers having a narrow understanding of the overall picture. For example,
developers behind a mall system may not familiar with the technical details of
the infrastructure they adopted. With increased complexity, any single person’s
understanding on the model built by the system may become less accurate. Hence,
gaining a complete comprehension of a complex system is not realistic.
+
+Chaos is inherent and describes an unknown state in complex systems. **Chaos
engineering is used to discover chaos in complex systems, learn the behavior of
the system, and develop the ability to respond to failures and restore the
system to a steady state.**
+
+# The guidelines and practical ways of chaos engineering
+
+## Formulate a hypothesis about steady-state
+
+Every experiment begins with a hypothesis, often taking the form of “even in
XYZ circumstances, the system remains in a steady state.” This principle
emphasizes the establishment of hypotheses based on defining steady states.
Therefore, we should define various indicators of the system’s normal state
based on long-term monitoring of the production environment and focus on
measurable outputs, rather than internal properties of the system.
+
+When identifying a steady state, it’s often essential to consider the global
outputs of the system, such as running logs, performance logs, alerts, and
program behavior, and abstract them into steady-state conditions. Having
introduced experimental variables (faults), these steady-state conditions
should change as expected.
+
+When the system is in the steady state we defined, we should consider that the
system can provide services normally to the outside world. In addition,
monitoring the steady state is also important so that the system can recover to
the steady state in a short period of time.
+
+## Introducing diverse real-world events
+
+We ought to introduce events that are real and what we care about such as
trying to reproduce faults that occurred in the production environment, such as
cache avalanche, service degradation, etc.
+
+Behaviors that would lead to the same fault symptoms should not be introduced,
such as occupying all the memory, CPU, or disk of a service instance or
‘killing’ the instance, which system responds to bad requests. Testing should
focus on the system’s behavior after a fault occurs, rather than on how to
trigger the fault.
+
+## Experiments in the production environment
+
+When conducting experiments, we can learn about the relevant behaviors of the
system and establish confidence in the system. If we conduct experiments in a
test environment, we can only establish confidence in that specific test
environment. If there are differences between the production environment and
the test environment, we cannot establish confidence in the production
environment.
+
+This is because a complex system is a whole, environmental differences between
testing and production environments can render testing environment experiments
meaningless, causing a
“[Bullwhip-effect](https://zh.wikipedia.org/wiki/%E9%95%BF%E9%9E%AD%E6%95%88%E5%BA%94)”.
However, conducting experiments in the production environment may affect users
of the system and cause losses. We need to make trade-offs in the formal
environment and let the experimental tools mature in the quasi-product [...]
+
+## Automate experiments
+
+When testing massive experiment sets are required, automating the process is
more efficient than manually setting experiment environments, introducing
faults, and gathering results. Automated experiments save time, run
continuously, and can cover a larger number of experiment sets.
+
+When repeat experiments are required, hypotheses are not always true, and they
can be expired following iterated software, so periodic conducting regression
experiments are needed.
+
+## Minimize the blast radius
+
+Safe experiment methods can reduce the risk to the production environment,
such as using traffic shadowing or selecting a suitable time period. An
indicator in a small variable group is more significant compared to a small
control group.
+
+# Chaos maturity model
+
+Chaos maturity model provided a [model
map](https://www.oreilly.com/content/chaos-engineering/#cmm_map_image), based
on different positions to measure different types of chaos engineering in
practice.
+
+
+
+<div class="caption-center"> Figure 2. Chaos Maturity Map </div>
+
+There are two axes on the map, adoption on the X-axis and sophistication on
the Y-axis, which can be explored separately:
+
+## Adoption
+
+As chaos engineering becomes mature, chaos engineering software needs to
achieve a specific level that robustness validation alone can significantly
affect the compliance process. However, initial adoption of chaos engineering
generally starts from scratch.
+
+## Sophistication
+
+Sophistication has some different metrica: provide consultation services and
provide a set of tools. Due to the software infrastructure’s diversity, no tool
can abstract sophisticated chaos engineering experiment instances in all
environments and apply it in reality. Thus, **chaos engineering practices were
contributed from massive labor inputs, then customized solutions were gradually
developed**.
+
+Another way to understand sophistication engineering is to consider the system
levels and introducing system experiment variables. Experiments typically start
at the infrastructure level with killing pods or virtual machines at the
initial. During the initial stages of chaos experiments, the common approach is
to use methods such as killing pods or virtual machines. As the tools become
more sophisticated, chaos injection logic may be introduced into the target
system, impacting the reque [...]
+
+Additionally, when experimental variables affect business logic, we can
observe more complex experiments. For instance, returning feasible but
unexpected request responses to a service can lead to different results by
programs. The experiments in the system will be conducted from the
infrastructure layer to the application layer, and then to the business logic
layer. Moreover, low-granularity experiments such as those that tend to trigger
potential faults in the business logic layer are [...]
+
+# **Continuous verification**
+
+> “[Continuous verification (CV) is a discipline of proactive experimentation,
implemented as tooling that verifies system
behaviors.](https://www.oreilly.com/library/view/chaos-engineering/9781492043850/)”
— Casy Rosenthal
+
+Continuous validation development tools are a prime example of complexity in
the chaos maturity model. CV, like CI/CD, addresses the need for increasingly
complex operational systems. Due to resource constraints, system developers
cannot afford to verify internal plans, and must instead focus on validating
the system’s output meets desired expectations. That’s why CV is better than
verification and also this is a successful sign of managing complex systems.
+
+There are at least three types of continuous verification: feature testing,
data artifacts, and correctness.
+
+**Feature Testing**: based on the various performance indicators (concurrency,
latency deviation, execution speed, etc.), and through observation of actual
production traffic, the report and recognition of this test will be established.
+
+**Data artifacts**: databases and storage applications have various
requirements for the characters of writing and retrieving data, such as
transaction consistency, idempotence, incorrect data isolation levels, etc.
+
+**Correctness:** not all correct forms are manifested as a certain state or
ideal attribute. In some cases, the interaction between different components
must be taken by interface contracts or agreements. When an interface request
returns a seemingly correct result that is beyond its judgement logic,
unexpected errors may occur. The reason for such issues is that different
levels of code are consistent at the logical level but inconsistent between
layers.
+
+# Open-source chaos engineering platform
+
+## [Litmus Chaos](https://litmuschaos.io/)
+
+Litmus Chaos is a chaos engineering platform that provides cross-cloud
services. It’s a CNCF open-source project that many organizations have used.
[Litmus Chaos](https://litmuschaos.io/)’s mission is to help Kubernetes SRE and
developers to find weaknesses in non-Kubernetes platforms and applications that
run on Kubernetes.
+
+## [Chaos Mesh](https://chaos-mesh.org/)
+
+Chaos Mesh is a chaos engineering platform open-sourced by PingCAP. It has a
strong capability to orchestrate failure scenarios and provide comprehensive
failure simulation types, which allow users to simulate the faults that might
occur in production and testing environments and helps them identify potential
failures. Chaos Mesh provides comprehensive visual tools to help beginner
programmers conveniently run and monitor their own chaos scenarios. Chaos Mesh
was developed based on Kuber [...]
+
+* Chaos Dashboard: a visible platform of Chaos Mesh, provides a user-friendly
WebUI, allowing users to design, monitor for Chaos, and manage RABC permits.
+* Chaos Controller Manager: core logical components of Chaos Mesh, able to
schedule users’ designed Chaos CR. The component includes many CRD Controllers,
such as PodChaos Controller, WorkerFlow Controller, etc.
+* Chaos Daemon: the main execution component of Chaos Mesh. Chaos Daemon runs
as DeamonSet, and holds Privileged access by default (opt-in). Generally, this
component interferes with network equipment, file systems, and kernels by
invasion to target Pod Namespace.
+
+## [Chaos Blade](https://chaosblade.io/)
+
+Chaos Blade is a chaos engineering project designed and open-sourced by
Alibaba in 2019, which includes the chaos engineering experiment tool Chaos
Blade and platform Chaosblade-box. It helps enterprises solve high availability
issues during cloud-native processes through chaos engineering.
+
+Chaosblade supports three big platforms, four computing language applications,
involves over 200 experimental scenarios, and over 3000 parameters, allowing
for fine control of the experimental scope.
+
+ChaosBlade-Box supports the management of experimental tools, and in addition
to managing Chaos Blade, it also supports the aggregation of experimental tools
from other platforms such as Litmuschaos.
+
+# Conclusion
+
+To introduce chaos engineering into a certain system, we can refer to the
chaos maturity model and start with simple inputs. In the case of our
community, we can agree upon a date for developers of various components in the
system to perform a fault test together, record the results to enhance the
sense of participation and importance of chaos engineering for contributors.
We’ll then observe system behavior, define steady states, and design reasonable
chaos experiment plans. These experi [...]
+
+# Reference
+
+1. [Principles of Chaos Engineering](https://principlesofchaos.org/)
+2. [LitmusChaos](https://litmuschaos.io/)
+3. [Chaos Mesh: A Powerful Chaos Engineering Platform for
Kubernetes](https://chaos-mesh.org/)
+4. [Chaos Blade](https://chaosblade.io/)
+5. [Bullwhip
effect](https://zh.wikipedia.org/wiki/%E9%95%BF%E9%9E%AD%E6%95%88%E5%BA%94)
+6. [CMM Map](https://www.oreilly.com/content/chaos-engineering/#cmm_map_image)
+7. [Chaos Engineering (Rosenthal, C, Jones,N,
2020)](https://www.oreilly.com/library/view/chaos-engineering/9781492043850/)
\ No newline at end of file
diff --git a/docs/blog/static/img/chaos_engineering1.jpeg
b/docs/blog/static/img/chaos_engineering1.jpeg
new file mode 100644
index 00000000000..023512c1372
Binary files /dev/null and b/docs/blog/static/img/chaos_engineering1.jpeg differ
diff --git a/docs/blog/static/img/chaos_engineering2.png
b/docs/blog/static/img/chaos_engineering2.png
new file mode 100644
index 00000000000..a22da329d82
Binary files /dev/null and b/docs/blog/static/img/chaos_engineering2.png differ
diff --git a/docs/blog/static/img/chaos_engineering3.png
b/docs/blog/static/img/chaos_engineering3.png
new file mode 100644
index 00000000000..7ebdb1db5ea
Binary files /dev/null and b/docs/blog/static/img/chaos_engineering3.png differ