Infection-induced cascading failures – impact and mitigation

Li, Bo; Saad, David

doi:10.1038/s42005-024-01638-1

Download PDF

Article
Open access
Published: 04 May 2024

Infection-induced cascading failures – impact and mitigation

Communications Physics volume 7, Article number: 144 (2024) Cite this article

170 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

In the context of epidemic spreading, many intricate dynamical patterns can emerge due to the cooperation of different types of pathogens or the interaction between the disease spread and other failure propagation mechanism. To unravel such patterns, simulation frameworks are usually adopted, but they are computationally demanding on big networks and subject to large statistical uncertainty. Here, we study the two-layer spreading processes on unidirectionally dependent networks, where the spreading infection of diseases or malware in one layer can trigger cascading failures in another layer and lead to secondary disasters, e.g., disrupting public services, supply chains, or power distribution. We utilize a dynamic message-passing method to devise efficient algorithms for inferring the system states, which allows one to investigate systematically the nature of complex intertwined spreading processes and evaluate their impact. Based on such dynamic message-passing framework and optimal control, we further develop an effective optimization algorithm for mitigating network failures.

Greedy control of cascading failures in interdependent networks

Article Open access 08 February 2021

General protocol for predicting outbreaks of infectious diseases in social networks

Article Open access 12 March 2024

Network isolators inhibit failure spreading in complex networks

Article Open access 25 May 2021

Introduction

Epidemic outbreaks do not only possess a direct threat to public health but also, indirectly, impact other sectors^1,2,3. For instance, when many infected individuals have to rest, be hospitalized or quarantined in order to slow down the epidemic spread, this could severely disrupt public services, causing disutility even to those who are not infected. For instance, the highly interdependent supply chains can be easily disrupted due to epidemic outbreaks^4,5. Similar concerns apply to cyber security. The spread of malware is not merely detrimental to computer networks, but can also cause failures to power grids or urban transportation networks which rely on modern communication systems^6,7. What is even worse is that the failures of certain components of technological networks can by themselves trigger a cascade of secondary failures, which can eventually lead to large-scale outages⁸. Therefore, it is vital to understand the nature of epidemic (or malware) spreading and failure propagation on interacting networks, based on which further mitigation and control measures can be devised.

A number of previous papers address the scenario of interacting spreading processes. In the context of epidemic spreading, two types of pathogens can cooperate or compete with each other, creating many intricate patterns of disease propagation^{9,10,11,12,13,14,15}. For interacting technological networks (e.g., communication and power networks), the failure of components in one network layer will not only affect neighboring parts within the same network, but will also influence the second network layer through the cross-layer connections. Macroscopic analyses based on simplified models show that such a spreading mechanism can easily result in a catastrophic breakdown of the whole system^16,17,18.

Most existing research in the area of multi-layer spreading processes employs macroscopic approaches, such as the degree-distribution-based mean-field methods and asymptotic percolation analysis, in order to obtain the global picture of the models’ behavior¹⁹. Such methods typically do not consider specific network instances and lack the ability to treat the interplay between the spreading dynamics and the fine-grained network topology¹⁹. For stochastic spreading processes with specific system conditions (e.g., topology initial conditions and individual node properties), it is common to apply extensive Monte Carlo (MC) simulations to observe the evolution of the spread, based on which important policy decisions are made²⁰. However, such simulations are computationally demanding on big networks and can be subject to large statistical uncertainty; as a result, they are difficult to be used for downstream analysis or optimization tasks. Therefore, researchers have been pursuing tractable and accurate theoretical methods to tackle the complex stochastic dynamics on networks^19,21.

Among the various developed theoretical approaches used, dynamic message-passing (DMP) is based on ideas from statistical physics offering a desirable algorithmic framework for approximate inference while it remains computationally efficient^22,23,24. The DMP method has been shown to be more accurate than the widely adopted individual-based mean-field method, especially in sparse networks^25,26. Moreover, the DMP approach yields a set of closed-form equations, which is very convenient for additional parameter estimation and optimization tasks^14,27,28.

In this work, we study a scenario where the epidemic or malware spreading on one network can trigger cascading failures on another. This is relevant in the cases where epidemic outbreaks cause disruption in public services or economic activities. Similarly, it can also be applied to study the effect of malware spread on computer networks causing the breakdown of other technological networks such as the power grid. The latter phenomenon is gaining more and more attention due to the increasing interactions among various engineering networks⁷. We explore the dynamics and consequences of such infection-induced cascading failures across two-layer networks using the DMP method. Our results reveal that even relatively low infection rates can induce large-scale cascading failures, leading to widespread network disruptions. We characterized these phenomena through the derivation and analysis of DMP equations, achieving a comprehensive understanding by linking the process to combined bond and bootstrap percolation models analytically. Leveraging the analytical tractability of the DMP model, we also developed optimization algorithms that effectively mitigate these network failures. By adjusting control parameters based on the back-propagation of final state impacts, these algorithms help minimize the size of system failure.

Methods

Model and framework

The model

To study the impact of infection spread of diseases or malware and their secondary effects, we consider multiplex networks comprising two layers²⁹, which are denoted as layers a and b, and are represented by two graphs G_a(V_a, E_a) and G_b(V_b, E_b). For convenience, we assume that the nodes in both layers correspond to the same set of individuals, denoted as V = V_a = V_b. This can be extended to more general settings. Denote ${\partial }_{i}^{a}$ and ${\partial }_{i}^{b}$ as the sets of nodes adjacent to node i in layers a and b, respectively. We also define ${\partial }_{i}={\partial }_{i}^{a}\cup {\partial }_{i}^{b}$. See Fig. 1 for an example of the network model under consideration.

**Fig. 1: An example of the two-layer spreading process considered in this work.**

Each node has states on both layers a and b. In layer a, each node assumes one of four states, susceptible (S), infected (I), recovered (R), and protected (P) at any particular time step. The infection spreading process occurs in layer a only, which is dictated by the stochastic discrete-time SIR model¹⁹ augmented with a protection mechanism, which we term the SIRP model. Stochastic models are commonly employed for modeling the spreads of epidemics or malware^20,30,31. The stochastic SIR model is commonly used for representing the spread of infections, wherein a susceptible individual (in state S) may become infected through contact with infected neighbors, and an infected individual (in state I) can recover, transitioning to the recovered state (R) after a certain period. The process we consider is based on the SIR model but includes one more state, P, in layer a; it admits the following state-transition rule

$$\begin{array}{c} S(i)+I(j) {\longrightarrow}^{\beta_{ji}} I(i)+I(j),\\ I(i) {\longrightarrow}^{\mu_{i}} R(i) \\ S(i) {\longrightarrow}^{\gamma_{i}(t)} P(i), \end{array}$$

(1)

where β_ji is the probability that node j being in the infected state transmits the infection to its susceptible neighboring node i at a certain time step. At each time step, an existing infected node i recovers with probability μ_i; the recovery process is assumed to occur after possible transmission activities. At time t, an existing susceptible node i turns into state P if it receives protection at time t − 1, which occurs with probability γ_i(t − 1). The protection can be achieved by vaccination in the epidemic setting or special protection measures in the malware spread setting, which is usually subject to certain budget constraints. The protection probabilities {γ_i(t)} will be the major control variables for mitigating the outbreaks. Note that when no protection is provided, i.e., all {γ_i(t)} are zero, the SIRP model reduces to the traditional SIR model. At initial time t = 0, we assume that node i has a probability ${P}_{S}^{i}(0)$ to be in state S, and probability ${P}_{I}^{i}(0)=1-{P}_{S}^{i}(0)$ to be in state I.

In layer b, each node i can either be in the normal state (N) or the failed state (F), indicated by a binary state variable x_i where x_i = 1 (0) denotes the ‘fail’ (‘normal’) state at a particular time step. A node i in layer b fails if (i) it has been infected, i.e., node i is in state I or R in layer a; (ii) there exists certain neighboring failed nodes such that ${\sum }_{j\in {\partial }_{i}^{b}}{x}_{j}{b}_{ji}\ge {\Theta }_{i}$, where Θ_i is a threshold and the influence parameter b_ji measures the importance of the failure of node j on node i. The latter case indicates that node i can fail due to the failures of its neighbors which it relies on, even though node i itself is not infected. In summary, the failure propagation process in layer b can be expressed as

$${x}_{i}=\left\{\begin{array}{ll}1,\quad &\,{{\mbox{either (i)}}}\,\,\,{{\mbox{node}}}\,\,i\,\,{{\mbox{in state}}}\,\,I\,\,{{\mbox{or}}}\,\,R\,\,{{\mbox{in layer}}}\,\,a,\\ \quad &{{\mbox{or (ii)}}}{ \sum }_{j \in {\partial }_{i}^{b}}{x}_{j}{b}_{ji} \ge {\Theta }_{i}{{\mbox{in layer}}}\,b;\hfill\\ 0,\quad &\,{{\mbox{otherwise}}}.\hfill\end{array}\right.$$

(2)

The whole process is simulated for T time steps. As we are interested in the time scale of infection spread which is usually very fast, we do not consider any repair rule in layer b. Therefore, a failed node cannot return to normality within the time window under consideration.

Such a failure propagation mechanism is equivalent to the linear threshold model (LTM), which is commonly used in studying social contagion and other cascade processes^19,32,33. The LTM model also offers a straightforward yet effective framework for understanding cascading failures in various systems, as it effectively encapsulates the pivotal dynamics where a component can become dysfunctional if a significant number of its dependent components fail^18,32. Other popular models for cascading failures incorporate more details of the system functionalities^34,35,36; these models require theoretical analyses specific to each case, which fall outside the scope of the current study.

Figure 1 illustrates the infection-induced cascades of our model in a simple network of 4 nodes. Node 1 is the initial infected node (or the seed) in layer a, which transmits the infection to node 2 at a certain time step. Now that node 2 is in the infected state in layer a, it also fails to function in layer b. If b₂₄ ≥ Θ₄, then node 4 will also fail as it loses the support from node 2, even though node 4 itself has not been infected. Such additional cascade propagation needs extra care when infections spread out. Similar interacting SIR (without a protection mechanism) and LTM processes have also been considered in the social contagion setting³⁷.

We reiterate that the infection-spreading process (described by the SIRP model) occurs in layer a only and not the entire network, while the cascade process (described by the LTM model) occurs in layer b. Typically, a holistic treatment of the combined two-layer processes is needed to understand their impact and develop mitigation strategies. We also remark that our model differs from the traditional settings of interdependent networks, which typically includes reciprocal dependency.

The DMP framework

We aim to use the DMP approach to investigate the two-layer spreading processes described above. The DMP equations of the usual SIR and the LTM model have been derived, based on the microscopic dynamic belief propagation equations^24,38. As in generic belief propagation methods³⁹, the DMP method is exact for tree graphs, while it can constitute a good approximation for loopy graphs, particularly when short loops, such as those spanning 3 or 4 nodes, are scarce. The two-layer spreading processes combining the SIR and LTM model appear more involved, where approximations relying on uncorrelated multiplex networks were used³⁷. Such approximations become less adequate when the two network layers are correlated, e.g., both layers share the same network topology.

Dynamic belief propagation

To devise more accurate DMP equations for general network models and accommodate the protection mechanism for mitigation, we start from the principled dynamic belief propagation equations of the two-layer processes. One important characteristic of our model is that state transition is unidirectional, which can only take the direction S → I → R or S → P in layer a, and N → F in layer b. Note that layer b does not influence layer a. As a result, our model admits a reduced representation of the system’s dynamical trajectories that subsequently facilitates a drastic simplification of the derivation of the DMP equations, which are exact on tree networks²⁴. Nevertheless, we emphasize that the exactness of the DMP formalism for tree networks is conditioned on the unidirectional nature of the model, which no longer holds if layer b also influences layer a. Introducing reciprocal interactions between both model layers requires additional theoretical tools, which are interesting by themselves but are beyond the scope of the current study.

Following previous works^24,38, we parametrize the dynamical trajectory of each node by its state transition times. In layer a, we denote ${\tau }_{i}^{a},{\omega }_{i}^{a}$ and ${\varepsilon }_{i}^{a}$ as the first time at which node i turns into state I, R and P, respectively. In layer b, we denote ${\tau }_{i}^{b}$ as the first time at which node i turns into state F. The quantity of interest is the probability of the trajectory of node i considered in the entire graph comprising layers a and b but having a cavity where node j is absent, denoted as ${m}^{i\to j}({\tau }_{i}^{a},{\omega }_{i}^{a},{\varepsilon }_{i}^{a},{\tau }_{i}^{b})$. Throughout the manuscript, we will refer to probabilities defined within a cavity graph as cavity probabilities. It is computed by the following dynamic belief propagation equations

$$\begin{array}{lll}&&{m}^{i\to j}({\tau }_{i}^{a},{\omega }_{i}^{a},{\varepsilon }_{i}^{a},{\tau }_{i}^{b})\\ &=&{\sum}_{{\{{\tau }_{k}^{a},{\omega }_{k}^{a},{\varepsilon }_{k}^{a},{\tau }_{k}^{b}\}}_{k\in {\partial }_{i}}}{W}_{{{{{{\rm{SIRP}}}}}}}^{i}({\tau }_{i}^{a},{\omega }_{i}^{a},{\varepsilon }_{i}^{a}| | {\{{\tau }_{k}^{a},{\omega }_{k}^{a},{\varepsilon }_{k}^{a}\}}_{k\in {\partial }_{i}^{a}})\\ &&\times {W}_{{{{{{\rm{LTM}}}}}}}^{i}({\tau }_{i}^{b}| | {\tau }_{i}^{a},{\varepsilon }_{i}^{a},{\{{\tau }_{k}^{b}\}}_{k\in {\partial }_{i}^{b}})\\ &&\times {\prod}_{k\in {\partial }_{i}\backslash j}{m}^{k\to i}({\tau }_{k}^{a},{\omega }_{k}^{a},{\varepsilon }_{k}^{a},{\tau }_{k}^{b}),\end{array}$$

(3)

where ${W}_{{{{{{\rm{SIRP}}}}}}}^{i}(\cdot )$ and ${W}_{{{{{{\rm{LTM}}}}}}}^{i}(\cdot )$ are the transition kernels dictated by the dynamical rules of the SIRP and LTM model, respectively (for details see Supplementary Note 1). The marginal probability of the trajectory of node i, denoted as ${m}^{i}({\tau }_{i}^{a},{\omega }_{i}^{a},{\varepsilon }_{i}^{a},{\tau }_{i}^{b})$, can be computed in a similar way as Eq. (3), by replacing the product ${\prod }_{k\in {\partial }_{i}\backslash j}$ in the last line of Eq. (3) by ${\prod }_{k\in {\partial }_{i}}$. That is, the marginal probability mⁱ( ⋅ ) is calculated using the entire graph, in contrast to the cavity probability m^i→j( ⋅ ) which is determined with a cavity graph where node j is absent.

The probability of node i in a certain state can be computed by summing the trajectory-level probability, which will be described in the next section.

Full node-level DMP equations

Consider the cavity probability of node i being in state S in layer a at time t (assuming node j is absent - the cavity), it is obtained by tracing over the corresponding probabilities of trajectories m^i→j( ⋅ ) in the cavity graph (assuming node j is removed)

$$\begin{array}{r}{P}_{S}^{i\to j}(t)={\sum}_{{\tau }_{i}^{a},{\omega }_{i}^{a},{\varepsilon }_{i}^{a},{\tau }_{i}^{b}}{\mathbb{I}}(t < {\tau }_{i}^{a} < {\omega }_{i}^{a}){\mathbb{I}}(t < {\varepsilon }_{i}^{a})\\ \times {m}^{i\to j}({\tau }_{i}^{a},{\omega }_{i}^{a},{\varepsilon }_{i}^{a},{\tau }_{i}^{b}),\end{array}$$

(4)

where ${\mathbb{I}}(\cdot )$ is the indicator function enforcing the order of state transitions. Similarly, we denote the cavity probability of node i in state F in layer b (in the absence of node j) as ${P}_{F}^{i\to j}(t)$; it is obtained by

$$\begin{array}{r}{P}_{F}^{i\to j}(t)={\sum}_{{\tau }_{i}^{a},{\omega }_{i}^{a},{\varepsilon }_{i}^{a},{\tau }_{i}^{b}}{\mathbb{I}}({\tau }_{i}^{b}\le t){m}^{i\to j}({\tau }_{i}^{a},{\omega }_{i}^{a},{\varepsilon }_{i}^{a},{\tau }_{i}^{b}).\end{array}$$

(5)

The marginal probabilities ${P}_{S}^{i}(t)$ and ${P}_{F}^{i}(t)$ can be computed in a similar manner, by replacing m^i→j( ⋅ ) in Eq. (4) and Eq. (5) with mⁱ( ⋅ ).

DMP equations in Layer a

We note that infection spread in layer a is not influenced by cascades in layer b, while the failure time in layer b depends on the infection time and the protection time of the corresponding node in layer a. Hence, we can decompose the message m^i→j( ⋅ ) to the respective components as

$$\begin{array}{r}{m}^{i\to j}({\tau }_{i}^{a},{\omega }_{i}^{a},{\varepsilon }_{i}^{a},{\tau }_{i}^{b})={m}_{a}^{i\to j}({\tau }_{i}^{a},{\omega }_{i}^{a},{\varepsilon }_{i}^{a})\quad \\ \times {m}_{b}^{i\to j}({\tau }_{i}^{b}| {\tau }_{i}^{a},{\varepsilon }_{i}^{a}).\end{array}$$

(6)

where ${m}_{a}^{i\to j}(\cdot )$ and ${m}_{b}^{i\to j}(\cdot )$ denote the trajectory-level probabilities of the processes in layer a and b, respectively. Note that the messages {m^i→j( ⋅ )} live in the entire network comprising layers a and b, which implies that $\{{m}_{a}^{i\to j}(\cdot ),{m}_{b}^{i\to j}(\cdot )\}$ are also defined on the entire network.

Summing ${m}_{a}^{i\to j}(\cdot )$ over ${\tau }_{i}^{a},{\omega }_{i}^{a},{\varepsilon }_{i}^{a}$ up to a certain time yields the normal DMP equations of node-level probabilities for the infection spread in layer a (see details in Supplementary Note 1). They admit the following expressions for t > 0

$${P}_{S}^{i\to j}(t)={P}_{S}^{i}(0){\prod }_{{t}^{{\prime} }=0}^{t-1}\left[1-{\gamma }_{i}({t}^{{\prime} })\right]{\prod}_{k\in {\partial }_{i}^{a}\backslash j}{\theta }^{k\to i}(t),$$

(7)

$$\begin{array}{l}{\theta }^{k\to i}(t)={\theta }^{k\to i}(t-1)-{\beta }_{ki}{\phi }^{k\to i}(t-1),\\ {\phi }^{k\to i}(t)=\left(1-{\beta }_{ki}\right)\left(1-{\mu }_{k}\right){\phi }^{k\to i}(t-1)\end{array}$$

(8)

$$-\left\{{P}_{S}^{k\to i}(t)-{P}_{S}^{k\to i}(t-1)\left[1-{\gamma }_{k}(t-1)\right]\right\},$$

(9)

where θ^k→i(t) is the cavity probability that node k has not transmitted the infection signal to node i up to time t, and ϕ^k→i(t) is the cavity probability that k is in state I but has not transmitted the infection signal to node i up to time t. Note that the messages $\{{P}_{S}^{k\to i}(t),{\theta }^{k\to i}(t),{\phi }^{k\to i}(t)\}$ are only needed for edges belonging to layer a where the SIRP model is defined.

At time t = 0, as we consider that each node i is either in state S with probability ${P}_{S}^{i}(0)$ or in state I with probability $1-{P}_{S}^{i}(0)$, we have the following initial conditions for the messages

$$\begin{array}{rcl}{P}_{S}^{i\to j}(0)&=&{P}_{S}^{i}(0),\\ {\phi }^{i\to j}(0)&=&1-{P}_{S}^{i}(0),\\ {\theta }^{i\to j}(0)&=&1.\end{array}$$

(10)

Upon iterating the above messages (7)-(8) starting from the initial conditions (10), the node-level marginal probabilities can be computed as

$${P}_{S}^{i}(t)={P}_{S}^{i}(0){\prod }_{{t}^{{\prime} }=0}^{t-1}\left[1-{\gamma }_{i}({t}^{{\prime} })\right]{\prod}_{k\in {\partial }_{i}^{a}}{\theta }^{k\to i}(t),$$

(11)

$${P}_{R}^{i}(t)={P}_{R}^{i}(t-1)+{\mu }_{i}{P}_{I}^{i}(t-1),$$

(12)

$${P}_{P}^{i}(t)={P}_{P}^{i}(t-1)+{\gamma }_{i}(t-1){P}_{S}^{i}(t-1),$$

(13)

$${P}_{I}^{i}(t)=1-{P}_{S}^{i}(t)-{P}_{R}^{i}(t)-{P}_{P}^{i}(t).$$

(14)

The above DMP Eqs. (11)–(14) bear similarity to those of SIR model²³, except for the protection mechanism with control parameters {γ_i(t)}. The computational complexity for obtaining the messages for the SIRP process in layer a over a total time T is O(∣E_a∣T), where ∣E_a∣ denotes the number of edges in layer a.

DMP equations in layer b

As for the cascade process in layer b, whether node i will turn into state F (fail) also depends on the state in layer a, making it more challenging to derive the corresponding DMP equations. The key to obtaining node-level DMP equations for ${P}_{F}^{i\to j}(t)$ in Eq. (5) (and the corresponding marginal probability ${P}_{F}^{i}(t)$) is to introduce several intermediate quantities to facilitate the calculation; the details are outlined in Supplementary Note 1.

To summarize, the node-level failure probability ${P}_{F}^{i}(t)$ can be decomposed as

$${P}_{F}^{i}(t)={P}_{I}^{i}(t)+{P}_{R}^{i}(t)+{P}_{SF}^{i}(t)+{P}_{PF}^{i}(t),$$

(15)

where ${P}_{SF}^{i}(t)$ and ${P}_{PF}^{i}(t)$ are the probabilities that node i is in state F in layer b, while it is in state S or state P in layer a, respectively. For these two cases, the failure of node i is triggered by the failure propagation of its neighbors from layer b. A similar relation holds for the cavity probability ${P}_{F}^{i\to j}(t)$.

The probability ${P}_{SF}^{i}(t)$ admits the following iteration

$${P}_{SF}^{i}(t)= \, {P}_{S}^{i}(0){\prod }_{{t}^{{\prime} }=0}^{t-1}\left[1-{\gamma }_{i}({t}^{{\prime} })\right]{\prod}_{k\in {\partial }_{i}^{a}\backslash {\partial }_{i}^{a}\cap {\partial }_{i}^{b}}{\theta }^{k\to i}(t)\\ \times {\sum}_{{\{{x}_{k}\}}_{k\in {\partial }_{i}^{b}}}{\mathbb{I}}\left({\sum}_{k\in {\partial }_{i}^{b}}{x}_{k}{b}_{ki}\ge {\Theta }_{i}\right)\\ \times {\prod}_{\begin{array}{c}k\in {\partial }_{i}^{b}\backslash {\partial }_{i}^{a}\cap {\partial }_{i}^{b},\\ {x}_{k}=1\end{array}}{P}_{F}^{k\to i}(t-1){\prod}_{\begin{array}{c}k\in {\partial }_{i}^{b}\backslash {\partial }_{i}^{a}\cap {\partial }_{i}^{b},\\ {x}_{k}=0\end{array}}\left[1-{P}_{F}^{k\to i}(t-1)\right]\\ \times {\prod}_{\begin{array}{c}k\in {\partial }_{i}^{a}\cap {\partial }_{i}^{b},\\ {x}_{k}=1\end{array}}{\chi }^{k\to i}(t){\prod}_{\begin{array}{c}k\in {\partial }_{i}^{a}\cap {\partial }_{i}^{b},\\ {x}_{k}=0\end{array}}\left[{\theta }^{k\to i}(t)-{\chi }^{k\to i}(t)\right],$$

(16)

where χ^k→i(t) is the cavity probability that node k is in state F at time t − 1, and it has not sent the infection signal to node i up to time t.

The cavity probability χ^k→i(t) can be decomposed into

$${\chi }^{k\to i}(t)={\psi }^{k\to i}(t)+{P}_{SF}^{k\to i}(t-1)+{P}_{PF}^{k\to i}(t-1),$$

(17)

where ψ^k→i(t) is the cavity probability that node k is in state I or R at time t − 1, but has not transmitted the infection signal to node i up to time t. The cavity probability ψ^k→i(t) can be computed as

$${\psi }^{k\to i}(t)= {\psi }^{k\to i}(t-1)-{\beta }_{ki}{\phi }^{k\to i}(t-1)\\ +\left[1-{\gamma }_{k}(t-2)\right]{P}_{S}^{k\to i}(t-2)-{P}_{S}^{k\to i}(t-1).$$

(18)

Similarly, the probability ${P}_{PF}^{i}(t)$ admits the following iteration

$${P}_{PF}^{i}(t)= \, {P}_{S}^{i}(0){\sum }_{\varepsilon =1}^{t}{\gamma }_{i}(\varepsilon -1){\prod }_{{t}^{{\prime} }=0}^{\varepsilon -2}\left[1-{\gamma }_{i}({t}^{{\prime} })\right]\\ \times {\prod}_{k\in {\partial }_{i}^{a}\backslash {\partial }_{i}^{a}\cap {\partial }_{i}^{b}}{\theta }^{k\to i}(\varepsilon -1){\sum}_{{\{{x}_{k}\}}_{k\in {\partial }_{i}^{b}}}{\mathbb{I}}\left({\sum}_{k\in {\partial }_{i}^{b}}{x}_{k}{b}_{ki}\ge {\Theta }_{i}\right)\\ \times {\prod}_{\begin{array}{c}k\in {\partial }_{i}^{b}\backslash {\partial }_{i}^{a}\cap {\partial }_{i}^{b},\\ {x}_{k}=1\end{array}}{P}_{F}^{k\to i}(t-1){\prod}_{\begin{array}{c}k\in {\partial }_{i}^{b}\backslash {\partial }_{i}^{a}\cap {\partial }_{i}^{b},\\ {x}_{k}=0\end{array}}\left[1-{P}_{F}^{k\to i}(t-1)\right]\\ \times {\prod}_{\begin{array}{c}k\in {\partial }_{i}^{a}\cap {\partial }_{i}^{b},\\ {x}_{k}=1\end{array}}{\tilde{\chi }}^{k\to i}(t,\varepsilon ){\prod}_{\begin{array}{c}k\in {\partial }_{i}^{a}\cap {\partial }_{i}^{b},\\ {x}_{k}=0\end{array}}\left[{\theta }^{k\to i}(\varepsilon -1)-{\tilde{\chi }}^{k\to i}(t,\varepsilon )\right],$$

(19)

where the dummy variable ε indicates the time at which node i receives the protection signal.

In Eq. (19), ${\tilde{\chi }}^{k\to i}(t,\varepsilon )$ is the cavity probability that node k is in state F at time t − 1, but has not transmitted the infection signal to node i up to time ε. It can be decomposed into

$${\tilde{\chi }}^{k\to i}(t,\varepsilon )={\tilde{\psi }}^{k\to i}(t,\varepsilon )+{P}_{SF}^{k\to i}(t-1)+{P}_{PF}^{k\to i}(t-1),$$

(20)

where ${\tilde{\psi }}^{k\to i}(t,\varepsilon )$ is the cavity probability that node k is in state I or R at time t − 1, but has not transmitted the infection signal to node i up to time ε − 1. The cavity probability ${\tilde{\psi }}^{k\to i}(t)$ can be computed as

$${\tilde{\psi }}^{k\to i}(t,\varepsilon )= {\psi }^{k\to i}(\varepsilon -1)+{P}_{I}^{k\to i}(t-1)+{P}_{R}^{k\to i}(t-1)\\ -\left[{P}_{I}^{k\to i}(\varepsilon -2)+{P}_{R}^{k\to i}(\varepsilon -2)\right].$$

(21)

Note that the cavity probabilities ${P}_{SF}^{i\to j}(t)$ and ${P}_{PF}^{i\to j}(t)$ are computed using the similar formula as in Eqs. (16) and (19), but in the cavity graph where node j is removed. This closes the loop for the DMP equations in layer b. We also observe in the above equations that the node-level messages for the SIRP process only enter into the DMP equations for the LTM process through the overlapping neighbors ${\partial }_{i}^{a}\cap {\partial }_{i}^{b}$.

The initial conditions for the corresponding messages are given by

$${P}_{F}^{k}(0)={P}_{F}^{k\to i}(0)={P}_{I}^{k}(0),$$

(22)

$${P}_{SF}^{k}(0)={P}_{SF}^{k\to i}(0)=0,$$

(23)

$${P}_{PF}^{k}(0)={P}_{PF}^{k\to i}(0)=0,$$

(24)

$${\psi }^{k\to i}(1)={\chi }^{k\to i}(1)=(1-{\beta }_{ki}){P}_{I}^{k}(0),$$

(25)

$${\tilde{\psi }}^{k\to i}(1,1)={\tilde{\chi }}^{k\to i}(1,1)={P}_{I}^{k}(0).$$

(26)

For t ≥ 2, ε = 1, we have

$$\begin{array}{rcl}{\tilde{\psi }}^{k\to i}(t,\varepsilon =1)&=&{P}_{I}^{k\to i}(t-1)+{P}_{R}^{k\to i}(t-1),\\ {\tilde{\chi }}^{k\to i}(t,\varepsilon =1)&=&{P}_{I}^{k\to i}(t-1)+{P}_{R}^{k\to i}(t-1)\end{array}$$

(27)

$$+{P}_{SF}^{k\to i}(t-1)+{P}_{PF}^{k\to i}(t-1).$$

(28)

We remark that for a total time T, the computational complexity for obtaining the messages of the cascade process in layer b is O(∣E_b∣T²) where ∣E_b∣ denotes the number of edges in layer b, unlike the O(∣E_a∣T) complexity for the SIRP process in layer a. This is due to the dependency of layer b on layer a, as well as the protection mechanism in layer a. The summation of the dummy state ${\{{x}_{k}\}}_{k\in {\partial }_{i}^{b}}$ in Eq. (16) and Eq. (19) also implies a high computational demand of networks with high-degree nodes. One way to alleviate this complexity is to use the dynamic programming techniques introduced in by Torrisi et al.⁴⁰.

These DMP equations are exact if both layers are tree networks, while they are approximate solutions when there are loops in the underlying networks.

Simplification under small inter-layer overlap

If there are no overlaps between the neighbors of node i in layer a and those in layer b, i.e., ${\partial }_{i}^{a}\cap {\partial }_{i}^{b}=\varnothing $, the messages ${\chi }^{k\to i},{\psi }^{k\to i},{\tilde{\chi }}^{k\to i}$ and ${\tilde{\psi }}^{k\to i}$ are not needed, and the node-level probabilities ${P}_{SF}^{i}(t)$ and ${P}_{PF}^{i}(t)$ can be much simplified as

$${P}_{SF}^{i}(t)= \, {P}_{S}^{i}(t){\sum}_{{\{{x}_{k}\}}_{k\in {\partial }_{i}^{b}}}{\mathbb{I}}\left({\sum}_{k\in {\partial }_{i}^{b}}{x}_{k}{b}_{ki} \, \ge \, {\Theta }_{i}\right)\\ \times {\prod}_{k\in {\partial }_{i}^{b},{x}_{k}=1}{P}_{F}^{k\to i}(t-1){\prod}_{k\in {\partial }_{i}^{b},{x}_{k}=0}\left[1-{P}_{F}^{k\to i}(t-1)\right],$$

(29)

$${P}_{PF}^{i}(t)= \, {P}_{P}^{i}(t){\sum}_{{\{{x}_{k}\}}_{k\in {\partial }_{i}^{b}}}{\mathbb{I}}\left({\sum}_{k\in {\partial }_{i}^{b}}{x}_{k}{b}_{ki} \, \ge \, {\Theta }_{i}\right)\\ \times {\prod}_{k\in {\partial }_{i}^{b},{x}_{k}=1}{P}_{F}^{k\to i}(t-1){\prod}_{k\in {\partial }_{i}^{b},{x}_{k}=0}\left[1-{P}_{F}^{k\to i}(t-1)\right].$$

(30)

This is also a reasonable approximation if the two layers a and b have little correlation, which has been exploited by previous work³⁷. We remark that the computational complexity of obtaining messages for the cascade process in layer b using this approximated method is O(∣E_b∣T). In this work, we will employ this approximation when we consider the dynamics in the large time limit and devise an optimization algorithm for mitigating the cascading failures, in order to reduce computing time. In situations where inter-layer overlaps are significant and accuracy is important^41,42, one can always use the complete formulations of the DMP equations as detailed in the “Full Node-level DMP Equations” subsection above.

Results

Effectiveness of the DMP method

We firstly test the efficacy of the complete DMP equations derived in “Full Node-level DMP Equations” subsection in the Methods section, by comparing the node-level probabilities ${P}_{S}^{i}(t)$ and ${P}_{F}^{i}(t)$ to those obtained by Monte Carlo simulations. The DMP theory produces exact marginal probabilities for node activities in tree networks; this is verified in Fig. 2a, b where both layers a and b are the same binary tree network of size N = 63. For random regular graphs (RRG) where there are many loops, the DMP method also yields reasonably accurate solutions; this is demonstrated in Fig. 2c, d where both layers a and b are the same RRG of size N = 100 and degree K = 5. We also validate the effectiveness of the non-overlapping approximation applied to the DMP equations for the process in layer b introduced in the subsection “Simplification under Small Inter-layer Overlap” in Methods; the results are shown in Supplementary Note 2.

**Fig. 2: Comparison of node-level probabilities.**

Impact of infection-induced cascades

The obtained DMP equations of the two-layer spreading processes allow us to examine the impact of the infection-induced cascading failures, on either a specific instance of a multiplex network or an ensemble of networks following a certain degree distribution. In this section, we do not consider the protection of nodes by setting γ_i(t) = 0, where the process in layer a is essentially a discrete-time SIR model.

Impact on a specific network

For the process in layer a, we define the outbreak size at time t as the fraction of nodes that have been infected at that time

$${\rho }_{I}(t)+{\rho }_{R}(t)=\frac{1}{N}{\sum}_{i\in {V}_{a}}{P}_{I}^{i}(t)+\frac{1}{N}{\sum}_{i\in {V}_{a}}{P}_{R}^{i}(t).$$

(31)

For the process in layer b, we define the cascade size at time t as the fraction of nodes that have failed at that time

$${\rho }_{F}(t)=\frac{1}{N}{\sum}_{i\in {V}_{b}}{P}_{F}^{i}(t).$$

(32)

By definition, we have ρ_F(t) ≥ ρ_I(t) + ρ_R(t).

In Fig. 3, we demonstrate the time evolution of the infection outbreak size and the cascade size in a multiplex network where both layers are random regular graphs with size N = 1600. It can be observed that ρ_F is much larger than ρ_I + ρ_R asymptotically, which suggests that the failure propagation mechanism in layer b significantly amplifies the impact of the infection outbreaks in layer a. In particular, the failure can eventually propagate to the whole network even though less than 70% of the population gets infected when the spread of the infection saturates. Compare to Monte Carlo simulations, the DMP method systematically overestimates the outbreak sizes due to the effect of mutual infection, but it has been shown to offer a significant improvement over the individual-based mean-field method^25,26,43.

**Fig. 3: Evolution of the sizes of the infection outbreak in layer a and total failures in layer b.**

Asymptotic properties

In the above example, the system converges to a steady state in the large time limit. The DMP approach allows us to systematically investigate the asymptotic behavior of the two-layer spreading processes.

For the process in layer a, we define an auxiliary probability

$${p}_{ij}:= \frac{{\beta }_{ij}}{{\beta }_{ij}+{\mu }_{i}-{\beta }_{ij}{\mu }_{i}}.$$

(33)

Then the messages in layer a admit the following expressions in the limit T → ∞

$${\phi }^{i\to j}(\infty )= 0,\\ {\theta }^{i\to j}(\infty )= 1-{p}_{ij}+{p}_{ij}{P}_{S}^{i\to j}(\infty ),\\ {P}_{S}^{i\to j}(\infty )= \, {P}_{S}^{i}(0){\prod}_{k\in {\partial }_{i}^{a}\backslash j}{\theta }^{k\to i}(\infty ),\\ {P}_{S}^{i}(\infty )= \, {P}_{S}^{i}(0){\prod}_{k\in {\partial }_{i}^{a}}{\theta }^{k\to i}(\infty ),$$

(34)

Details of the derivation can be found in Supplementary Note 3. The above asymptotic equations (34) suggest a well-known relationship between epidemic spreading and bond percolation^19,22,44. The bond percolation problem involves a network where the bonds (or edges) between nodes are randomly occupied with a certain probability (denoted as λ). The main focus is to understand the formation of a giant cluster comprising connected occupied edges in the network; in large systems, this typically occurs when λ is greater than a transition point λ_c⁴⁵.

As mentioned above, it is well established that the asymptotic properties of many stochastic epidemic spreading models can be mapped to certain bond percolation problems^44,46; we refer interested readers to two recent reviews for more details on the subject^19,45. In the SIR model studied here (where γ_i(t) = 0), the quantity p_ij defined in Eq. (33) can be interpreted as the probability that an infection transmission on edge (i, j) has been realized in the long run, corresponding to an edge occupation probability in bond percolation. When the transmission probabilities {β_ij} are large ({p_ij} will also be large), a few initially infected seeds can eventually infect a significant proportion of the population and lead to a pandemic, which corresponds to the formation of a giant cluster in percolation theory. We refer readers to Supplementary Note 3 for more details of the correspondence between our model and bond percolation. Note that the edge occupation probability p_ij in this discrete-time SIR model differs from the continuous-time counterpart^22,44 with an additional term β_ijμ_i in the denominator. The term β_ijμ_i accounts for the simultaneous events that node i infects node j and recovers within the same time step²⁵.

For the process in layer b, we assume that layers a and b are weakly correlated due to their different topologies and adopt the approximation made in the subsection “Simplification under Small Inter-layer Overlap” in Methods. As no protection is applied, we have ${P}_{PF}^{i}(t)=0$. Then the messages in layer b admit the following expression in the limit T → ∞

$${P}_{F}^{i\to j}(\infty )= 1-{P}_{S}^{i}(\infty )\\ +{P}_{S}^{i}(\infty ){\sum}_{{\{{x}_{k}\}}_{k\in {\partial }_{i}^{b}\backslash j}}{\mathbb{I}}\left({\sum}_{k\in {\partial }_{i}^{b}\backslash j}{x}_{k}{b}_{ki}\ge {\Theta }_{i}\right)\\ \times {\prod}_{k\in {\partial }_{i}^{b}\backslash j,{x}_{k}=1}{P}_{F}^{k\to i}(\infty ){\prod}_{k\in {\partial }_{i}^{b}\backslash j,{x}_{k}=0}\left[1-{P}_{F}^{k\to i}(\infty )\right],$$

(35)

where a similar expression holds for ${P}_{F}^{i}(\infty )$ by replacing ${\partial }_{i}^{b}\backslash j$ with ${\partial }_{i}^{b}$ in Eq. (35). The asymptotic equations for layer b suggest a relationship between the LTM model and bootstrap percolation³⁸.

Two-layer percolation in large homogeneous networks

The large-time behaviors of the two processes correspond to two types of percolation problems. To further examine the macroscopic critical behaviors of the two-layer percolation models, it is convenient to consider large-size random regular graphs of degree K (which have a homogeneous network topology), and homogeneous system parameters with β_ji = β, μ_i = μ, b_ji = b, Θ_i = Θ. We further assume that each node i has a vanishingly small probability of being infected at time t = 0 with ${P}_{I}^{i}(0)=1-{P}_{S}^{i}(0)\propto 1/N$. In the large size limit N → ∞, we have ${P}_{S}^{i}(0)\to 1$.

Due to the homogeneity of the system, one can assume that all messages and marginal probabilities are identical,

$${\theta }^{i\to j}(\infty )={\theta }^{\infty },$$

(36)

$${P}_{F}^{i\to j}(\infty )={P}_{F}^{\infty },$$

(37)

$${P}_{S}^{i}(\infty )={\rho }_{S}^{\infty },$$

(38)

$${P}_{F}^{i}(\infty )={\rho }_{F}^{\infty }.$$

(39)

It leads to the self-consistent equations in the large size limit (N → ∞),

$${\theta }^{\infty }=1-p+p\cdot {({\theta }^{\infty })}^{K-1},$$

(40)

$${\rho }_{S}^{\infty }={({\theta }^{\infty })}^{K},$$

(41)

$${P}_{F}^{\infty } = \, 1-{\rho }_{S}^{\infty }\\ +{\rho }_{S}^{\infty }{\sum }_{n=\lceil \Theta \rceil }^{K-1}\left(\begin{array}{c}K-1\\ n\end{array}\right){({P}_{F}^{\infty })}^{n}{(1-{P}_{F}^{\infty })}^{K-1-n},$$

(42)

$${\rho }_{F}^{\infty }= \, 1-{\rho }_{S}^{\infty }\\ +{\rho }_{S}^{\infty }{\sum }_{n=\lceil \Theta \rceil }^{K}\left(\begin{array}{c}K\\ n\end{array}\right){({P}_{F}^{\infty })}^{n}{(1-{P}_{F}^{\infty })}^{K-n},$$

(43)

where $p=\frac{\beta }{\beta +\mu -\beta \mu }$ and ⌈x⌉ is the smallest integer greater than or equal to x.

We observe that ${\theta }^{\infty }=1,{\rho }_{S}^{\infty }=1,{P}_{F}^{\infty }=0,{\rho }_{F}^{\infty }=0$ is always a fixed point to Eqs. (40)–(43), which corresponds to vanishing outbreak sizes. When the infection probability β is larger than a critical point ${\beta }_{c}^{a}$, this fixed point solution becomes unstable and another fixed point with finite outbreak sizes develops.

As a concrete example, we consider random regular graphs of degree K = 5 and fix μ = 0.5, b = 1, Θ = 3. By solving Eqs. (40)–(43) for different β, we obtain outbreak sizes for both layers a and b under different infection strengths. The result is shown in Fig. 4, where the asymptotic theory accurately predicts the behavior of a large-size system (N = 1600) in the large-time limit. It is also observed that the outbreak sizes in both layers become non-zero when β is larger than a critical point ${\beta }_{c}^{a}=\frac{1}{7}$. Furthermore, the outbreak size ${\rho }_{F}^{\infty }$ in layer b exhibits a discontinuous jump to a complete breakdown (${\rho }_{F}^{\infty }=1$) when β increases and surpasses another transition point ${\beta }_{c}^{b} \, \approx \, 0.159$. However, at the transition point ${\beta }_{c}^{b}$, only about 28.6% of the population has been infected in layer a.

**Fig. 4: Size of infection outbreak and total failures as a function of the infection probability β.**

This example again indicates that the cascading failure propagation in layer b can drastically amplify the impact of the epidemic outbreaks in layer a. Lastly, we remark that whether layer b will exhibit a discontinuous transition or not depends on the values of K and Θ³⁸, as predicted by the bootstrap percolation theory⁴⁷.

Mitigation of Infection-induced Cascades

The optimization framework

The catastrophic breakdown can be mitigated if timely protections are provided to stop the infection’s spread. In our model, this is implemented by assigning a non-zero protection probability γ_i(t) to node i, after which it is immune from infection from layer a. To minimize the size of final failures, it would be more effective to take into account the spreading processes in both layers a and b when deciding which nodes to prioritize for protection.

Here, we develop mitigation strategies by solving the following constrained optimization problems

$${\min }_{\gamma }\,\,{{{{{\mathcal{O}}}}}}(\gamma ):= {\rho }_{F}(T)=\frac{1}{N}{\sum}_{i\in {V}_{b}}{P}_{F}^{i}(T),$$

(44)

$$\,{{\mbox{s. t.}}}\,\,\,0\le {\gamma }_{i}(t)\le 1\quad \forall i,t,$$

(45)

$${\sum}_{i\in {V}_{b}}{\sum }_{t=0}^{T-1}{\gamma }_{i}(t) \, \le \, {\gamma }^{{{{{{\rm{tot}}}}}}},$$

(46)

where the constraint in Eq. (45) ensures that γ_i(t) is a probability, and Eq. (46) represents the global budget constraint on the protection resources. As the objective function ${{{{{\mathcal{O}}}}}}(\gamma )$ (the size of final failures) depends on the evolution of the two-layer spreading processes, the optimization problem is challenging. Lokhov and Saad introduced the optimal control framework to tackle similar problems, by estimating the marginal probabilities of individuals with the DMP methods²⁸. The success of the optimal control approach highlights another advantage of the theoretical methods over numerical simulations^14,28,48.

In this work, we adopt a similar strategy to solve the optimization problem defined in Eqs. (44)–(46), where ${P}_{F}^{i}(T)$ is estimated by the DMP equations derived in Methods. As the expressions of the DMP equations have been explicitly given and only involve elementary arithmetic operations, we leverage tools of automatic differentiation to compute the gradient of the objective function ${\nabla }_{\gamma }{{{{{\mathcal{O}}}}}}(\gamma )$ in a back-propagation fashion⁴⁹. It allows us to derive gradient-based algorithms for solving the optimization problem. We remark that such a back-propagation algorithm is equivalent to optimal control with gradient descent update on the control parameters⁵⁰. To save computing time, we adopt the approximation made in the subsection “Simplification under Small Inter-layer Overlap” in Methods for conducting the optimization; but we always use the full DMP formulations developed in the subsection “Full Node-level DMP Equations” in Methods for the evaluation of the outcomes. This is particularly suitable for networks having little inter-layer overlaps. In scenarios where significant inter-layer overlaps exist and precision is crucial, it is always possible to resort to the complete version of the DMP equations.

To handle the box constraint in Eq. (45), we adopt the mirror descent method, which performs the gradient-based update in the dual (or mirror) space rather than the primal space where {γ_i(t)} live^51,52. In our case, we use the logit function $\Psi (x)=\log (\frac{x}{1-x})$ to map the primal control variable γ_i(t) to the dual space as ${h}_{i}(t)=\psi ({\gamma }_{i}(t))\in {\mathbb{R}}$, where the gradient descent updates are performed. The primal variable can be recovered through the inverse mapping of Ψ( ⋅ ), which is ${\Psi }^{-1}(h)=\frac{1}{1+\exp (-h)}$. The elementary mirror descent update step is

$${g}^{n}\leftarrow {\nabla }_{\gamma }{{{{{\mathcal{O}}}}}}({\gamma }^{n}),$$

(47)

$${\gamma }^{n+1}\leftarrow {\Psi }^{-1}\left(\Psi ({\gamma }^{n})-s{g}^{n}\right),$$

(48)

where n is an index keeping track of the optimization process and s is the step size of the gradient update.

In general, the above optimization process tends to increase the total resources ∑_i,tγ_i(t). To prevent the violation of the constraint in Eq. (46) during the updates, we suppress the gradient component which increases the total resources when ∑_i,tγ_i(t) ≥ (1 − ϵ)γ^tot, by shifting the gradient gⁿ in Eq. (48) with a magnitude bⁿ

$${b}^{n}\leftarrow \frac{{\sum }_{t,i}{\gamma }_{i}^{n}(t)(1-{\gamma }_{i}^{n}(t))\frac{\partial }{\partial {\gamma }_{i}(t)}{{{{{\mathcal{O}}}}}}({\gamma }^{n})}{{\sum }_{t,i}{\gamma }_{i}^{n}(t)(1-{\gamma }_{i}^{n}(t))},$$

(49)

$${g}^{n}\leftarrow {\nabla }_{\gamma }{{{{{\mathcal{O}}}}}}({\gamma }^{n})-{b}^{n}.$$

(50)

The rationale for the choice of bⁿ is explained in Supplementary Note 4. In our implementation of the algorithm, we choose ϵ = 0.02. Even though the shifted gradient method is used, it does not strictly forbid the violation of the constraint in Eq. (46). If the resource capacity constraint is violated, we project the control variables to the feasible region through the simple rescaling

$${\gamma }^{n}\leftarrow \frac{{\gamma }^{{{{{{\rm{tot}}}}}}}}{{\sum }_{t,i}{\gamma }_{i}^{n}(t)}{\gamma }^{n}.$$

(51)

Finally, the resource capacity constraint Eq. (46) implies that a γ^tot amount of protection resources can be distributed in different time steps. In some scenarios, the resources arrive in an online fashion, e.g., a limited number of vaccines can be produced every day. In these cases, there is a resource capacity constraint at each time step. Some results of such a scenario are discussed in Supplementary Note 5.

Case study in a tree network

We first verify the effectiveness of the optimization method by considering a simple problem on a binary tree network of size N = 63, where both layers have the same topology. The results are shown in Fig. 5, where three individuals are chosen to be the infected seeds at time t = 0, and the outbreak is simulated for T = 50 time steps. Without any mitigation strategy, more than half of the population fail at the end of the process.

**Fig. 5: Mitigation of the network failures in a binary tree network of size N = 63, where both layers have the same topology.**

We then protect some vital nodes to mitigate the system failure, by using the optimization method proposed above. In Fig. 5a–c, we restrict the total resources to be γ^tot = 5. Fig. 5a shows that the optimization algorithm successfully reduces the final failure rate, which demonstrates the effectiveness of the method. We found that the optimal protection resource distribution $\{{\gamma }_{i}^{* }(t)\}$ mostly concentrates on a few nodes at a certain time step (as shown in Fig. 5b), which implies that we can confidently select which nodes to protect. All the nodes with high ${\gamma }_{i}^{* }(t)$ receive protection at time t = 0, which implies that the best mitigation strategy in this example is to distribute all γ^tot resources as early as possible to stop the infection spread. Figure 5c shows the optimal placement of resources, which can completely block the infection spread, hence minimizing the network failure. In this example, both layers a and b have the same network structure, which is depicted in Fig. 5c.

Similar phenomena are observed in the case with γ^tot = 4 as shown in Fig. 5d–f, except that the protections are not sufficient to completely block the infection spread. The optimization algorithm sacrifices only two nodes in the vicinity of the infected node in the lower right corner of Fig. 5f (indicated by a black arrow), leaving other parts of the network in the normal state.

In Fig. 6, we further examine the influence of the total resource availability, i.e., γ^tot, on the final failure size N ⋅ ρ_F(T) determined at the optimal solution ${\gamma }_{i}^{* }(t)$. It is observed that when γ^tot increases, the failure size (at the optimum) firstly decreases monotonically, and then saturate when γ^tot reaches a certain value such that there are enough protection resources to completely block the infection transmission. Another interesting observation is that for the cases with more initially infected seeds, introducing additional units of protection resource yields a less effective reduction in failure size compared to the cases with fewer initial infected seeds.

**Fig. 6: Final failure size N ⋅ ρ_F(T) of a binary tree network evaluated at the optimal solution $\{{\gamma }_{i}^{* }(t)\}$, as a function of the amount of total resources γ^tot.**

The good performance of the optimization is based on the fact that there are enough protection resources (i.e., having a large γ^tot) as well as being aware of the origins of the outbreak. In some cases, whether a node was infected at the initial time is not fully determined but follows a probability distribution. Such cases can be easily accommodated in the DMP framework which is intrinsically probabilistic. We investigated such a scenario with probabilistic seeding in Supplementary Note 6, and found that the optimization method can still effectively reduce the sizes of network failures.

Case study in a synthetic network

To further showcase the applicability of the optimization algorithm for failure mitigation, we consider a synthetic technological multiplex network where layer a represents a communication network and layer b represents a power network. We consider the scenario that the communication network can be attacked by malware but can also be protected by technicians, which is modeled by the proposed SIRP model. The infection of a node in the communication network causes the breakdown of the corresponding node in the power network. The breakdown of components in a power network can trigger further failures and form a cascade, which is modeled by the proposed LTM model. We have neglected the details of the power flow dynamics in order to obtain a tractable model and an insightful simple example.

Here, we extract the network topology from the IEEE 118-bus test case to form layer b⁵³, which has N = 118 nodes. We then obtain layer a by rewiring a regular graph of the same size with degree K = 4 using a rewiring probability ${p}_{{{{{{\rm{rewire}}}}}}}=0.3$, which creates a Watts-Strogatz small-world network and mimics the topology of communication networks⁵⁴. The resulting multiplex network is plotted in Fig. 7a.

**Fig. 7: A synthetic two-layer network and the evolution of its failure rate.**

As the failures in layer b are initially induced by the infections in layer a, one may wonder whether deploying the protection resources by minimizing the size of infections, i.e., minimizing ρ_I(T) + ρ_R(T) instead of minimizing ρ_F(T), is already sufficient to mitigate the final failures. To investigate this effect, we replace the objective function in Eq. (44) by ${{{{{{\mathcal{O}}}}}}}^{a}(\gamma )={\rho }_{I}(T)+{\rho }_{R}(T)$ and solve the optimization problem using the same techniques in the subsection “The Optimization Framework”. The result is shown in Fig. 7b, which suggests that blocking the infection is as good as minimizing the original objective function in Eq. (44) for the purpose of minimizing the total failure size. Minimizing either objective function constitutes a much better improvement over the random deployment of the same amount of protection resources in this case.

The results in Fig. 7b point to the conventional wisdom that one should try best to stop the epidemic or malware spread (in layer a) for mitigating system failure. The situation will be different if there are vital components in layer b, which should be protected to prevent the failure cascade. This is typically manifested in the heterogeneity of the network connectivity or the system parameters. To showcase this effect, we manually plant a vulnerable connected cluster in layer b by setting the influence parameters b_ji for an edge (i, j) in this cluster as b_ji > Θ_i, so that the failure of node j itself is already sufficient to trigger the failure of node i. Such a set-up is relevant for commercial, industrial and engineering networks, among others; e.g., supply chain networks evolve to enhance their throughput and efficiency but may operate with little redundancy and low robustness. In this case, we found that minimizing ρ_F(T) yields a much better improvement over minimizing ρ_I(T) + ρ_R(T) for the purpose of mitigating the system failure, as shown in Fig. 7c.

Case studies in a real-world social networks

Lastly, we examine the Kapferer’s tailor shop network, a well-known social network dataset gathered by B. Kapferer in Zambia, documenting interactions among workers in a tailor shop^55,56. This dataset records two types of interactions across two different time frames. The first interaction type is termed “sociational”, which encapsulates friendship and socioemotional relationship among the workers. The second interaction type is termed “instrumental”, which reflects work- and assistance-related connections among them. For our analysis, the “sociational” network observed in the initial time frame is assigned to layer a, acting as the substrate for infection transmission, while the corresponding “instrumental” network is assigned to layer b, where the failure (in terms of work accomplishment) of a node can be triggered by the malfunctioning of its neighboring nodes. These networks are treated as undirected graphs for simplicity. The resulting two-layer network is depicted in Fig. 8a.

**Fig. 8: The Kapferer’s tailor shop network and the evolution of its failure rate.**

We assign homogeneous values to the majority of system parameters without deliberately introducing any vulnerable component in the network; the set-up closely aligns with the scenario depicted in Fig. 7b and presents a stark contrast to the scenario in Fig. 7c. We select five nodes that possess the highest degrees to serve as the initially infected individuals, which can be viewed as super-spreaders in the network. We then protect the vital nodes to mitigate the system failures by using the optimization method as above, where the result is shown in Fig. 8b. Interestingly, minimizing the size of failures (i.e., ρ_F(T)) is evidently better than minimizing the size of infections (i.e., ρ_I(T) + ρ_F(T)) for the purpose of failure mitigation. It suggests that in this realistic and natural scenario, simply blocking the infection transmission is sub-optimal and one needs to take a holistic view of the two-layer model for optimizing the network’s utility.

Conclusion

We investigate the nature of a type of two-layer spreading processes in unidirectionally dependent networks, comprising two interacting layers a and b. Disease or malware spreads in layer a, which can trigger cascading failures in layer b, leading to secondary disasters. The spreading processes in the two layers are modeled by the SIRP and LTM models, respectively. To tackle the complex stochastic dynamics in the two-layer networks, we utilized the dynamic message-passing method by working out the dynamic belief propagation equations. The resulting DMP algorithms have low computational complexity in sparse networks and allow us to perform accurate and efficient inference of the system states.

Based on the DMP method, we systematically studied and evaluated the impact of the infection-induced cascading failures. The cascade process in layer b can lead to large-scale network failures, even when the infection rate in layer a remains at a relatively low level. By considering a homogeneous network topology and homogeneous system parameters, we derive the asymptotic and large-size limits of the DMP equations. The asymptotic limit of the two-layer spreading processes corresponds to the coupling between a bond percolation model and a bootstrap percolation model, which can be analytically solved. The infection outbreak size in layer a changes continuously from zero to non-zero as the infection probability β surpasses a transition point ${\beta }_{c}^{a}$, while the failure size in layer b can exhibit a discontinuous jump to the completely failed state when β surpasses another transition point ${\beta }_{c}^{b}$ under certain conditions. All these results highlight the observation that cascading failure propagation in layer b can drastically amplify the impact of the epidemic outbreaks in layer a, which requires special attention.

Another advantage of the DMP method is that it yields a set of closed-form equations, which can be very useful for other downstream analyses and tasks. We exploited this property to devise optimization algorithms for mitigating network failure. The optimization method works by back-propagating the impact at the final time to adjust the control parameters (i.e., the protection probabilities). The mirror descent method and a heuristic gradient shift method were also used to handle the constraints on the control parameter. We show that the resulting algorithm can effectively minimize the size of system failures. We believe that our dedicated analyses provide valuable insights and a deeper understanding of the impact the infection-induced cascading failures on networks, and the obtained optimization algorithms will be useful for practical applications in systems of this kind.

Data availability

Datasets cited in this study are publicly accessible and have been referenced accordingly in the manuscript.

Code availability

Source codes of the methods and analyses used in this study are available at https://github.com/boli8/DMP-for-SIRP-LTM.

References

Pak, A. et al. Economic consequences of the covid-19 outbreak: the need for epidemic preparedness. Front. Public Health 8, https://www.frontiersin.org/articles/10.3389/fpubh.2020.00241 (2020).
Chaturvedi, K., Vishwakarma, D. K. & Singh, N. Covid-19 and its impact on education, social life and mental health of students: A survey. Children Youth Serv. Rev. 121, 105866 (2021).
Article Google Scholar
Cochran, A. L. Impacts of covid-19 on access to transportation for people with disabilities. Transp. Res. Interdiscipl. Perspect. 8, 100263 (2020).
Article Google Scholar
Xu, Z., Elomri, A., Kerbache, L. & El Omri, A. Impacts of covid-19 on global supply chains: Facts and perspectives. IEEE Eng. Manag. Rev. 48, 153–166 (2020).
Article Google Scholar
Aday, S. & Aday, M. S. Impact of COVID-19 on the food supply chain. Food Qual. Safety 4, 167–180 (2020).
Article Google Scholar
Amini, M. H., Arasteh, H. & Siano, P.Sustainable Smart Cities Through the Lens of Complex Interdependent Infrastructures: Panorama and State-of-the-art, 45–68 (Springer International Publishing, Cham, 2019). https://doi.org/10.1007/978-3-319-98923-5_3.
Liu, X., Chen, B., Chen, C. & Jin, D. Electric power grid resilience with interdependencies between power and communication networks - a review. IET Smart Grid 3, 182–193 (2020).
Guo, H., Zheng, C., Iu, H. H.-C. & Fernando, T. A critical review of cascading failure analysis and modeling of power system. Renew. Sustain. Energy Rev. 80, 9–22 (2017).
Article Google Scholar
Castillo-Chavez, C., Huang, W. & Li, J. Competitive exclusion in gonorrhea models and other sexually transmitted diseases. SIAM J. Appl. Math. 56, 494–508 (1996).
Article MathSciNet Google Scholar
Castillo-Chavez, C., Huang, W. & Li, J. Competitive exclusion and coexistence of multiple strains in an sis std model. SIAM J. Appl. Math. 59, 1790–1811 (1999).
Article MathSciNet Google Scholar
Karrer, B. & Newman, M. E. J. Competing epidemics on complex networks. Phys. Rev. E 84, 036106 (2011).
Article ADS Google Scholar
Cai, W., Chen, L., Ghanbarnejad, F. & Grassberger, P. Avalanche outbreaks emerging in cooperative contagions. Nat. Phys. 11, 936–940 (2015).
Article Google Scholar
Wang, W., Liu, Q.-H., Liang, J., Hu, Y. & Zhou, T. Coevolution spreading in complex networks. Phys. Rep. 820, 1–51 (2019).
Article ADS MathSciNet Google Scholar
Sun, H., Saad, D. & Lokhov, A. Y. Competition, collaboration, and optimization in multiple interacting spreading processes. Phys. Rev. X 11, 011048 (2021).
Google Scholar
Liu, J. et al. Analysis and control of a continuous-time bi-virus model. IEEE Trans. Automatic Control 64, 4891–4906 (2019).
Article MathSciNet Google Scholar
Buldyrev, S. V., Parshani, R., Paul, G., Stanley, H. E. & Havlin, S. Catastrophic cascade of failures in interdependent networks. Nature 464, 1025–1028 (2010).
Article ADS Google Scholar
Bashan, A., Berezin, Y., Buldyrev, S. V. & Havlin, S. The extreme vulnerability of interdependent spatially embedded networks. Nat. Phys. 9, 667–672 (2013).
Article Google Scholar
Valdez, L. D. et al. Cascading failures in complex networks. J. Complex Netw. 8, cnaa013 (2020).
Article Google Scholar
Pastor-Satorras, R., Castellano, C., Van Mieghem, P. & Vespignani, A. Epidemic processes in complex networks. Rev. Mod. Phys. 87, 925–979 (2015).
Article ADS MathSciNet Google Scholar
Adam, D. Special report: The simulations driving the world’s response to COVID-19. Nature 580, 316–318 (2020).
Article ADS Google Scholar
Wang, W., Tang, M., Stanley, H. E. & Braunstein, L. A. Unification of theoretical approaches for epidemic spreading on complex networks. Rep. Progr. Phys. 80, 036603 (2017).
Article ADS Google Scholar
Karrer, B. & Newman, M. E. J. Message passing approach for general epidemic models. Phys. Rev. E 82, 016101 (2010).
Article ADS MathSciNet Google Scholar
Lokhov, A. Y., Mézard, M., Ohta, H. & Zdeborová, L. Inferring the origin of an epidemic with a dynamic message-passing algorithm. Phys. Rev. E 90, 012801 (2014).
Article ADS Google Scholar
Lokhov, A. Y., Mézard, M. & Zdeborová, L. Dynamic message-passing equations for models with unidirectional dynamics. Phys. Rev. E 91, 012811 (2015).
Article ADS Google Scholar
Koher, A., Lentz, H. H. K., Gleeson, J. P. & Hövel, P. Contact-based model for epidemic spreading on temporal networks. Phys. Rev. X 9, 031017 (2019).
Google Scholar
Li, B. & Saad, D. Impact of presymptomatic transmission on epidemic spreading in contact networks: A dynamic message-passing analysis. Phys. Rev. E 103, 052303 (2021).
Article ADS MathSciNet Google Scholar
Lokhov, A. Reconstructing parameters of spreading models from partial observations. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I. & Garnett, R. (eds.) Proceedings of the 30th International Conference on Neural Information Processing Systems, vol. 29, 3467 – 3475 (Curran Associates Inc., 2016).
Lokhov, A. Y. & Saad, D. Optimal deployment of resources for maximizing impact in spreading processes. Proc. Natl Acad. Sci. 114, E8138–E8146 (2017).
Article ADS Google Scholar
Boccaletti, S. et al. The structure and dynamics of multilayer networks. Phys. Rep. 544, 1–122 (2014).
Article ADS MathSciNet Google Scholar
Balcan, D. et al. Modeling the spatial spread of infectious diseases: The global epidemic and mobility computational model. J. Comput. Sci. 1, 132–145 (2010).
Article Google Scholar
Garetto, M., Gong, W. & Towsley, D. Modeling malware spreading dynamics. In IEEE INFOCOM 2003. Twenty-second Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No.03CH37428), 3, 1869–1879 (IEEE, 2003).
Watts, D. J. A simple model of global cascades on random networks. Proc. Natl Acad. Sci. 99, 5766–5771 (2002).
Article ADS MathSciNet Google Scholar
Kempe, D., Kleinberg, J. & Tardos, É. Maximizing the spread of influence through a social network. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’03, 137–146 (Association for Computing Machinery, 2003). https://doi.org/10.1145/956750.956769.
Motter, A. E. & Lai, Y.-C. Cascade-based attacks on complex networks. Phys. Rev. E 66, 065102 (2002).
Article ADS Google Scholar
Carreras, B. A., Lynch, V. E., Dobson, I. & Newman, D. E. Critical points and transitions in an electric power transmission model for cascading failure blackouts. Chaos 12, 985–994 (2002).
Article ADS MathSciNet Google Scholar
Crucitti, P., Latora, V. & Marchiori, M. Model for cascading failures in complex networks. Phys. Rev. E 69, 045104 (2004).
Article ADS Google Scholar
Su, Z. & Kurths, J. A dynamic message-passing approach for social contagion in time-varying multiplex networks. Europhys. Lett. 123, 68004 (2018).
Article ADS Google Scholar
Altarelli, F., Braunstein, A., Dall’Asta, L. & Zecchina, R. Large deviations of cascade processes on graphs. Phys. Rev. E 87, 062115 (2013).
Article ADS Google Scholar
Mézard, M. & Montanari, A.Information, Physics, and Computation (Oxford University Press, Oxford, 2009). https://doi.org/10.1093/acprof:oso/9780198570837.001.0001.
Torrisi, G., Annibale, A. & Kühn, R. Overcoming the complexity barrier of the dynamic message-passing method in networks with fat-tailed degree distributions. Phys. Rev. E 104, 045313 (2021).
Article ADS MathSciNet Google Scholar
Parshani, R., Rozenblat, C., Ietri, D., Ducruet, C. & Havlin, S. Inter-similarity between coupled networks. Europhys. Lett. 92, 68002 (2011).
Article ADS Google Scholar
Cellai, D., López, E., Zhou, J., Gleeson, J. P. & Bianconi, G. Percolation in multiplex networks with overlap. Phys. Rev. E 88, 052811 (2013).
Article ADS Google Scholar
Shrestha, M., Scarpino, S. V. & Moore, C. Message-passing approach for recurrent-state epidemic models on networks. Phys. Rev. E 92, 022821 (2015).
Article ADS Google Scholar
Grassberger, P. On the critical behavior of the general epidemic process and dynamical percolation. Math. Biosci. 63, 157–172 (1983).
Article Google Scholar
Li, M. et al. Percolation on complex networks: Theory and application. Phys. Rep. 907, 1–68 (2021).
Article ADS MathSciNet Google Scholar
Newman, M. E. J. Spread of epidemic disease on networks. Phys. Rev. E 66, 016128 (2002).
Article ADS MathSciNet Google Scholar
Chalupa, J., Leath, P. L. & Reich, G. R. Bootstrap percolation on a bethe lattice. J. Phys. C 12, L31 (1979).
Article ADS Google Scholar
Zhou, J., Zhao, Y. & Ye, Y. Complex dynamics and control strategies of SEIR heterogeneous network model with saturated treatment. Phys. A 608, 128287 (2022).
Article MathSciNet Google Scholar
Baydin, A. G., Pearlmutter, B. A., Radul, A. A. & Siskind, J. M. Automatic differentiation in machine learning: A survey. J. Mach. Learn. Res. 18, 5595–5637 (2017).
MathSciNet Google Scholar
Li, Q., Chen, L., Tai, C. & E, W. Maximum principle based algorithms for deep learning. J. Mach. Learn. Res. 18, 1–29 (2018).
ADS MathSciNet Google Scholar
Nemirovski, A. & Yudin, D. Problem Complexity and Method Efficiency in Optimization (Wiley, 1983).
Beck, A. & Teboulle, M. Mirror descent and nonlinear projected subgradient methods for convex optimization. Oper. Res. Lett. 31, 167–175 (2003).
Article MathSciNet Google Scholar
Christie, R. Power systems test case archive, university of washington. Available at: https://labs.ece.uw.edu/pstca/pf118/pg_tca118bus.htm (1993)
Cai, Y., Li, Y., Cao, Y., Li, W. & Zeng, X. Modeling and impact analysis of interdependent characteristics on cascading failures in smart grids. Int. J. Electrical Power Energy Syst. 89, 106–114 (2017).
Article Google Scholar
Kapferer, B. Strategy and Transaction in an African Factory (Manchester University Press, Manchester, 1972).
Kapferer tailor shop data set. Available at the UCI Network Data Repository: https://networkdata.ics.uci.edu/netdata/html/kaptail.html (1972).

Download references

Acknowledgements

B.L. and D.S. acknowledge support from European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie Grant Agreement No. 835913. B.L. acknowledges support from the National Natural Science Foundation of China (Grant No. 12205066), the Shenzhen Start-Up Research Funds (Grant No. BL20230925) and the start-up funding from Harbin Institute of Technology, Shenzhen (Grant No. 20210134). D.S. acknowledges support from the Leverhulme Trust (RPG-2018-092) and the EPSRC programme grant TRANSNET (EP/R035342/1).

Author information

Authors and Affiliations

School of Science, Harbin Institute of Technology (Shenzhen), Shenzhen, 518055, China
Bo Li
Non-linearity and Complexity Research Group, Aston University, Birmingham, B4 7ET, UK
Bo Li & David Saad

Authors

Bo Li
View author publications
You can also search for this author in PubMed Google Scholar
David Saad
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

B.L. and D.S. conceived the project and developed the theoretical framework. B.L. carried out the theoretical calculations and the numerical simulations. B.L. and D.S. discussed the results and prepared the manuscript.

Corresponding authors

Correspondence to Bo Li or David Saad.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Communications Physics thanks Lenka Zdeborova, Louis Shekhtman, Davide Ghio and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Peer review file

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Li, B., Saad, D. Infection-induced cascading failures – impact and mitigation. Commun Phys 7, 144 (2024). https://doi.org/10.1038/s42005-024-01638-1

Download citation

Received: 07 September 2023
Accepted: 18 April 2024
Published: 04 May 2024
DOI: https://doi.org/10.1038/s42005-024-01638-1

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.