[shardingsphere] branch master updated: update a blog (#26362)

wuweijie Wed, 14 Jun 2023 22:22:48 -0700

This is an automated email from the ASF dual-hosted git repository.

wuweijie pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/shardingsphere.git



The following commit(s) were added to refs/heads/master by this push:
     new 09e15cab949 update a blog (#26362)
09e15cab949 is described below

commit 09e15cab949e475907062b2f206e64e6a52a16b5
Author: Nan Xiang <[email protected]>
AuthorDate: Thu Jun 15 13:21:58 2023 +0800

    update a blog (#26362)
---
 .../material/2023_06_15_Chaos_Engineering.md       | 141 +++++++++++++++++++++
 docs/blog/static/img/chaos_engineering1.jpeg       | Bin 0 -> 334820 bytes
 docs/blog/static/img/chaos_engineering2.png        | Bin 0 -> 14553 bytes
 docs/blog/static/img/chaos_engineering3.png        | Bin 0 -> 123039 bytes
 4 files changed, 141 insertions(+)

diff --git a/docs/blog/content/material/2023_06_15_Chaos_Engineering.md 
b/docs/blog/content/material/2023_06_15_Chaos_Engineering.md
new file mode 100644
index 00000000000..5d0a528f521
--- /dev/null
+++ b/docs/blog/content/material/2023_06_15_Chaos_Engineering.md
@@ -0,0 +1,141 @@
++++
+title = "Chaos Engineering: Efficient Way to Improve System Availability
+"
+weight = 101
+chapter = true 
++++
+
+![](https://shardingsphere.apache.org/blog/img/chaos_engineering1.jpeg)
+
+Resilience is a crucial requirement for ShardingSphere-Proxy, an essential 
database infrastructure. Testing and verifying resilience can be efficiently 
achieved through the use of chaos engineering methodology. To support 
customized chaos engineering, the 
[ShardingSphere-on-Cloud](https://shardingsphere.apache.org/oncloud/) project 
is designing and implementing a new 
[CustomResourceDefinition](https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/)
 (CRD) ca [...]
+
+# What is chaos engineering
+
+System availability is a critical metric for evaluating service reliability. 
Numerous methods can ensure high availability, including engineering resilience 
and techniques, and others. One such technique is chaos engineering, which 
involves introducing software faults into production systems to enhance 
availability.
+
+According to [Principles of Chaos](https://principlesofchaos.org/) (2019), the 
definition of chaos engineering is:
+
+> “Chaos Engineering is the discipline of experimenting on a system to build 
confidence in the system’s capability to withstand turbulent conditions in 
production.”
+
+In other words, chaos engineering is a practice that aims to enhance system 
robustness by detecting potential weaknesses in software systems early, 
ultimately preventing major disruptions or failures.
+
+# Why is chaos engineering needed
+
+The complexity of a system can be shown in a linear and nonlinear way as well 
as reflect how changes in the input of a system affect the output.
+
+A linear system is typically predictable. There are many examples of linear 
systems in nature, such as simple mathematical functions and physical 
definitions.
+
+In contrast, the output of a nonlinear system cannot be accurately calculated. 
In a large distributed program, components interact with each other, and we 
cannot determine if expected output can be achieved under various inputs.
+
+Currently, most programs are increasingly complex. In common cloud 
environments, coordinating various components is becoming more challenging 
(such as Kubernetes, along with the services running on it, like Istio, Envoy, 
and other software infrastructure).
+
+![](https://shardingsphere.apache.org/blog/img/chaos_engineering2.png)
+<div class="caption-center"> Figure 1. Infrastructure stack for general 
service </div>
+
+The complexity and rapid changes inherent to many systems often lead to 
developers having a narrow understanding of the overall picture. For example, 
developers behind a mall system may not familiar with the technical details of 
the infrastructure they adopted. With increased complexity, any single person’s 
understanding on the model built by the system may become less accurate. Hence, 
gaining a complete comprehension of a complex system is not realistic.
+
+Chaos is inherent and describes an unknown state in complex systems. **Chaos 
engineering is used to discover chaos in complex systems, learn the behavior of 
the system, and develop the ability to respond to failures and restore the 
system to a steady state.**
+
+# The guidelines and practical ways of chaos engineering
+
+## Formulate a hypothesis about steady-state
+
+Every experiment begins with a hypothesis, often taking the form of “even in 
XYZ circumstances, the system remains in a steady state.” This principle 
emphasizes the establishment of hypotheses based on defining steady states. 
Therefore, we should define various indicators of the system’s normal state 
based on long-term monitoring of the production environment and focus on 
measurable outputs, rather than internal properties of the system.
+
+When identifying a steady state, it’s often essential to consider the global 
outputs of the system, such as running logs, performance logs, alerts, and 
program behavior, and abstract them into steady-state conditions. Having 
introduced experimental variables (faults), these steady-state conditions 
should change as expected.
+
+When the system is in the steady state we defined, we should consider that the 
system can provide services normally to the outside world. In addition, 
monitoring the steady state is also important so that the system can recover to 
the steady state in a short period of time.
+
+## Introducing diverse real-world events
+
+We ought to introduce events that are real and what we care about such as 
trying to reproduce faults that occurred in the production environment, such as 
cache avalanche, service degradation, etc.
+
+Behaviors that would lead to the same fault symptoms should not be introduced, 
such as occupying all the memory, CPU, or disk of a service instance or 
‘killing’ the instance, which system responds to bad requests. Testing should 
focus on the system’s behavior after a fault occurs, rather than on how to 
trigger the fault.
+
+## Experiments in the production environment
+
+When conducting experiments, we can learn about the relevant behaviors of the 
system and establish confidence in the system. If we conduct experiments in a 
test environment, we can only establish confidence in that specific test 
environment. If there are differences between the production environment and 
the test environment, we cannot establish confidence in the production 
environment.
+
+This is because a complex system is a whole, environmental differences between 
testing and production environments can render testing environment experiments 
meaningless, causing a 
“[Bullwhip-effect](https://zh.wikipedia.org/wiki/%E9%95%BF%E9%9E%AD%E6%95%88%E5%BA%94)”.
 However, conducting experiments in the production environment may affect users 
of the system and cause losses. We need to make trade-offs in the formal 
environment and let the experimental tools mature in the quasi-product [...]
+
+## Automate experiments
+
+When testing massive experiment sets are required, automating the process is 
more efficient than manually setting experiment environments, introducing 
faults, and gathering results. Automated experiments save time, run 
continuously, and can cover a larger number of experiment sets.
+
+When repeat experiments are required, hypotheses are not always true, and they 
can be expired following iterated software, so periodic conducting regression 
experiments are needed.
+
+## Minimize the blast radius
+
+Safe experiment methods can reduce the risk to the production environment, 
such as using traffic shadowing or selecting a suitable time period. An 
indicator in a small variable group is more significant compared to a small 
control group.
+
+# Chaos maturity model
+
+Chaos maturity model provided a [model 
map](https://www.oreilly.com/content/chaos-engineering/#cmm_map_image), based 
on different positions to measure different types of chaos engineering in 
practice.
+
+
+![](https://shardingsphere.apache.org/blog/img/chaos_engineering3.png)
+<div class="caption-center"> Figure 2. Chaos Maturity Map </div>
+
+There are two axes on the map, adoption on the X-axis and sophistication on 
the Y-axis, which can be explored separately:
+
+## Adoption
+
+As chaos engineering becomes mature, chaos engineering software needs to 
achieve a specific level that robustness validation alone can significantly 
affect the compliance process. However, initial adoption of chaos engineering 
generally starts from scratch.
+
+## Sophistication
+
+Sophistication has some different metrica: provide consultation services and 
provide a set of tools. Due to the software infrastructure’s diversity, no tool 
can abstract sophisticated chaos engineering experiment instances in all 
environments and apply it in reality. Thus, **chaos engineering practices were 
contributed from massive labor inputs, then customized solutions were gradually 
developed**.
+
+Another way to understand sophistication engineering is to consider the system 
levels and introducing system experiment variables. Experiments typically start 
at the infrastructure level with killing pods or virtual machines at the 
initial. During the initial stages of chaos experiments, the common approach is 
to use methods such as killing pods or virtual machines. As the tools become 
more sophisticated, chaos injection logic may be introduced into the target 
system, impacting the reque [...]
+
+Additionally, when experimental variables affect business logic, we can 
observe more complex experiments. For instance, returning feasible but 
unexpected request responses to a service can lead to different results by 
programs. The experiments in the system will be conducted from the 
infrastructure layer to the application layer, and then to the business logic 
layer. Moreover, low-granularity experiments such as those that tend to trigger 
potential faults in the business logic layer are  [...]
+
+# **Continuous verification**
+
+> “[Continuous verification (CV) is a discipline of proactive experimentation, 
implemented as tooling that verifies system 
behaviors.](https://www.oreilly.com/library/view/chaos-engineering/9781492043850/)”
 — Casy Rosenthal
+
+Continuous validation development tools are a prime example of complexity in 
the chaos maturity model. CV, like CI/CD, addresses the need for increasingly 
complex operational systems. Due to resource constraints, system developers 
cannot afford to verify internal plans, and must instead focus on validating 
the system’s output meets desired expectations. That’s why CV is better than 
verification and also this is a successful sign of managing complex systems.
+
+There are at least three types of continuous verification: feature testing, 
data artifacts, and correctness.
+
+**Feature Testing**: based on the various performance indicators (concurrency, 
latency deviation, execution speed, etc.), and through observation of actual 
production traffic, the report and recognition of this test will be established.
+
+**Data artifacts**: databases and storage applications have various 
requirements for the characters of writing and retrieving data, such as 
transaction consistency, idempotence, incorrect data isolation levels, etc.
+
+**Correctness:** not all correct forms are manifested as a certain state or 
ideal attribute. In some cases, the interaction between different components 
must be taken by interface contracts or agreements. When an interface request 
returns a seemingly correct result that is beyond its judgement logic, 
unexpected errors may occur. The reason for such issues is that different 
levels of code are consistent at the logical level but inconsistent between 
layers.
+
+# Open-source chaos engineering platform
+
+## [Litmus Chaos](https://litmuschaos.io/)
+
+Litmus Chaos is a chaos engineering platform that provides cross-cloud 
services. It’s a CNCF open-source project that many organizations have used. 
[Litmus Chaos](https://litmuschaos.io/)’s mission is to help Kubernetes SRE and 
developers to find weaknesses in non-Kubernetes platforms and applications that 
run on Kubernetes.
+
+## [Chaos Mesh](https://chaos-mesh.org/)
+
+Chaos Mesh is a chaos engineering platform open-sourced by PingCAP. It has a 
strong capability to orchestrate failure scenarios and provide comprehensive 
failure simulation types, which allow users to simulate the faults that might 
occur in production and testing environments and helps them identify potential 
failures. Chaos Mesh provides comprehensive visual tools to help beginner 
programmers conveniently run and monitor their own chaos scenarios. Chaos Mesh 
was developed based on Kuber [...]
+
+* Chaos Dashboard: a visible platform of Chaos Mesh, provides a user-friendly 
WebUI, allowing users to design, monitor for Chaos, and manage RABC permits.
+* Chaos Controller Manager: core logical components of Chaos Mesh, able to 
schedule users’ designed Chaos CR. The component includes many CRD Controllers, 
such as PodChaos Controller, WorkerFlow Controller, etc.
+* Chaos Daemon: the main execution component of Chaos Mesh. Chaos Daemon runs 
as DeamonSet, and holds Privileged access by default (opt-in). Generally, this 
component interferes with network equipment, file systems, and kernels by 
invasion to target Pod Namespace.
+
+## [Chaos Blade](https://chaosblade.io/)
+
+Chaos Blade is a chaos engineering project designed and open-sourced by 
Alibaba in 2019, which includes the chaos engineering experiment tool Chaos 
Blade and platform Chaosblade-box. It helps enterprises solve high availability 
issues during cloud-native processes through chaos engineering.
+
+Chaosblade supports three big platforms, four computing language applications, 
involves over 200 experimental scenarios, and over 3000 parameters, allowing 
for fine control of the experimental scope.
+
+ChaosBlade-Box supports the management of experimental tools, and in addition 
to managing Chaos Blade, it also supports the aggregation of experimental tools 
from other platforms such as Litmuschaos.
+
+# Conclusion
+
+To introduce chaos engineering into a certain system, we can refer to the 
chaos maturity model and start with simple inputs. In the case of our 
community, we can agree upon a date for developers of various components in the 
system to perform a fault test together, record the results to enhance the 
sense of participation and importance of chaos engineering for contributors. 
We’ll then observe system behavior, define steady states, and design reasonable 
chaos experiment plans. These experi [...]
+
+# Reference
+
+1. [Principles of Chaos Engineering](https://principlesofchaos.org/)
+2. [LitmusChaos](https://litmuschaos.io/)
+3. [Chaos Mesh: A Powerful Chaos Engineering Platform for 
Kubernetes](https://chaos-mesh.org/)
+4. [Chaos Blade](https://chaosblade.io/)
+5. [Bullwhip 
effect](https://zh.wikipedia.org/wiki/%E9%95%BF%E9%9E%AD%E6%95%88%E5%BA%94)
+6. [CMM Map](https://www.oreilly.com/content/chaos-engineering/#cmm_map_image)
+7. [Chaos Engineering (Rosenthal, C, Jones,N, 
2020)](https://www.oreilly.com/library/view/chaos-engineering/9781492043850/)
\ No newline at end of file
diff --git a/docs/blog/static/img/chaos_engineering1.jpeg 
b/docs/blog/static/img/chaos_engineering1.jpeg
new file mode 100644
index 00000000000..023512c1372
Binary files /dev/null and b/docs/blog/static/img/chaos_engineering1.jpeg differ
diff --git a/docs/blog/static/img/chaos_engineering2.png 
b/docs/blog/static/img/chaos_engineering2.png
new file mode 100644
index 00000000000..a22da329d82
Binary files /dev/null and b/docs/blog/static/img/chaos_engineering2.png differ
diff --git a/docs/blog/static/img/chaos_engineering3.png 
b/docs/blog/static/img/chaos_engineering3.png
new file mode 100644
index 00000000000..7ebdb1db5ea
Binary files /dev/null and b/docs/blog/static/img/chaos_engineering3.png differ

[shardingsphere] branch master updated: update a blog (#26362)

Reply via email to