Scheduling to Minimize Age of Incorrect Information with Imperfect Channel State Information

Chen, Yutao; Ephremides, Anthony

doi:10.3390/e23121572

Open AccessFeature PaperArticle

Scheduling to Minimize Age of Incorrect Information with Imperfect Channel State Information

by

Yutao Chen

and

Anthony Ephremides

^*

Department of Electrical and Computer Engineering, University of Maryland, College Park, MD 20742, USA

^*

Author to whom correspondence should be addressed.

Entropy 2021, 23(12), 1572; https://0-doi-org.brum.beds.ac.uk/10.3390/e23121572

Submission received: 3 November 2021 / Revised: 21 November 2021 / Accepted: 23 November 2021 / Published: 25 November 2021

(This article belongs to the Special Issue Age of Information: Concept, Metric and Tool for Network Control)

Download

Browse Figures

Versions Notes

Abstract

:

In this paper, we study a slotted-time system where a base station needs to update multiple users at the same time. Due to the limited resources, only part of the users can be updated in each time slot. We consider the problem of minimizing the Age of Incorrect Information (AoII) when imperfect Channel State Information (CSI) is available. Leveraging the notion of the Markov Decision Process (MDP), we obtain the structural properties of the optimal policy. By introducing a relaxed version of the original problem, we develop the Whittle’s index policy under a simple condition. However, indexability is required to ensure the existence of Whittle’s index. To avoid indexability, we develop Indexed priority policy based on the optimal policy for the relaxed problem. Finally, numerical results are laid out to showcase the application of the derived structural properties and highlight the performance of the developed scheduling policies.

Keywords:

age of incorrect information; multi-user system; scheduling policy

1. Introduction

The Age of Incorrect Information (AoII) is introduced in [1] as a combination of age-based metrics (e.g., Age of Information (AoI)) and error-based metrics (e.g., Minimum Mean Square Error). In communication systems, AoII captures not only the information mismatch between the source and the destination but also the aging process of inconsistent information. Hence, two functions dominate AoII. The first is the time penalty function, which reflects how the inconsistency of information affects the system over time. In real-life applications, inconsistent information will affect different communication systems in different ways. For example, machine temperature monitoring is time-sensitive because the damage caused by overheating will accumulate quickly. However, reservoir water level monitoring is less sensitive to time. Therefore, by adopting different time penalty functions, AoII can capture different aging processes of the mismatch in different systems. The second is the information penalty function, which captures the information mismatch between the source and the destination. It allows us to measure mismatches in different ways, depending on how sensitive different systems are to information inconsistencies. For example, the navigation system requires precise information to give correct instructions, but the real-time delivery tracking system does not need very accurate location information. Since we can choose different penalty functions for different systems, AoII is adaptable to various communication goals, which is why it is regarded as a semantic metric [2].

Since the introduction of AoII, several studies have been performed to reveal its fundamental nature. The authors of [3] consider a system with random packet delivery times and compare AoII with AoI and real-time error via extensive numerical results. The authors of [4] study the problem of minimizing the AoII that takes the general time penalty function. Three real-life applications are considered to showcase the performance advantages of AoII over AoI and real-time error. In [5], the authors investigate the AoII that considers the quantified mismatch between the source and the destination. The optimization problem is studied when the system is resource-constrained. The authors of [6] studied the AoII minimization problem in the context of scheduling. It considers a system where the central scheduler needs to update multiple users at the same time. However, the central scheduler cannot know the states of the sources before receiving the updates. By introducing the belief value, Whittle’s index policy is developed and evaluated. In this paper, we also consider the problem of minimizing AoII in scheduling. Different from [6], we consider the generic time penalty function and study the minimization problem in the presence of imperfect Channel State Information (CSI). Due to the existence of CSI, Whittle’s index policy becomes infeasible in general. Hence, we introduce another scheduling policy that is more versatile and has comparable performance to Whittle’s index policy.

The problem of scheduling to minimize AoI is studied under various system settings in [7,8,9,10,11]. The problem studied in this paper is different and more complicated because AoII considers the aging process of inconsistent information rather than the aging process of updates. Meanwhile, none of them consider the case where CSI is available. The problem of optimizing information freshness in the presence of CSI is studied in [12,13]. However, they focus on the system with a single user and mainly discuss the case where CSI is perfect. The scheduling problems with the goal of minimizing an error-based performance measure are considered in [14,15,16]. Our problem is fundamentally different because AoII also considers the time effect. Moreover, we consider the system where a base station observes multiple sources simultaneously and needs to send updates to multiple destinations.

The main contributions of this work can be summarized as follows. (1) We study the problem of minimizing AoII in a multi-user system where imperfect CSI is available. Meanwhile, the time penalty function is generic. (2) We derive the structural properties of the optimal policy for the considered problem. (3) We establish the indexability of the considered problem under a simple condition and develop Whittle’s index policy. (4) We obtain the optimal policy for a relaxed version of the original problem. By exploring the characteristics of the relaxed problem, we provide an efficient algorithm to obtain the optimal policy. (5) Based on the optimal policy for the relaxed problem, we develop the Indexed priority policy that is free from indexability and has comparable performance to Whittle’s index policy.

The remainder of this paper is organized in the following way. In Section 2, we introduce the system model and formulate the primal problem. Section 3 explores the structural properties of the optimal policy for the primal problem. Under a simple condition, we develop Whittle’s index policy in Section 4. Section 5 presents the optimal policy for a relaxed version of the primal problem. On this basis, we develop the Indexed priority policy in Section 6. Finally, in Section 7, the numerical results are laid out.

2. System Overview

2.1. Communication Model

We consider a slotted-time system with N users and one base station. Each user is composed of a source process, a channel, and a receiver. We assume all the users share the same structure, but the parameters are different. The structure of the communication model is provided in Figure 1.

For user i, the source process is modeled by a two-state Markov chain where transitions happen between the two states with probability

p_{i} > 0

and self-transitions happen with probability

1 - p_{i}

. At any time slot t, the state of the source process

X_{i, t} \in {0, 1}

will be reported to the base station as an update, and the base station will decide whether to transmit this update through the corresponding channel. The channel is unreliable, but the estimate of the Channel State Information (CSI) is available at the beginning of each time slot. Let

r_{i, t} \in {0, 1}

be the CSI at time t. We assume that

r_{i, t}

is independent across time and user indices.

r_{i, t} = 1

if and only if the transmission attempt at time t will succeed and

r_{i, t} = 0

otherwise. Then, we denote by

{\hat{r}}_{i, t} \in {0, 1}

the estimate of

r_{i, t}

. We assume that

{\hat{r}}_{i, t}

is an independent Bernoulli random variable with parameter

γ_{i}

, i.e.,

{\hat{r}}_{i, t} = 1

with probability

γ_{i} \in [0, 1]

and

{\hat{r}}_{i, t} = 0

with probability

1 - γ_{i}

. However, the estimate is imperfect. We assume that the error depends only on the user and its estimate. More precisely, we define the probability of error as

p_{e, i}^{{\hat{r}}_{i}} ≜ P r [r_{i} \neq {\hat{r}}_{i} ∣ {\hat{r}}_{i}]

. We assume

p_{e, i}^{{\hat{r}}_{i}} < 0.5

because we can flip the estimate if

p_{e, i}^{{\hat{r}}_{i}} > 0.5

. We are not interested in the case of

p_{e, i}^{{\hat{r}}_{i}} = 0.5

since

{\hat{r}}_{i, t}

is useless in this case. Although the channel is unreliable, each transmission attempt takes exactly one time slot regardless of the result, and the successfully transmitted update will not be corrupted. Every time an update is received, the receiver will use it as the new estimate

{\hat{X}}_{i, t}

. The receiver will send an

A C K / N A C K

packet to inform the base station of its reception of the new update. Since an

A C K / N A C K

packet is generally very small and simple, we assume that it is transmitted reliably and received instantaneously. Then, if

A C K

is received, the base station knows that the receiver’s estimate changed to the transmitted update. If

N A C K

is received, the base station knows that the receiver’s estimate did not change. Therefore, the base station always knows the estimate at the receiver side.

At the beginning of each time slot, the base station receives updates from each source and the estimates of CSI from each channel. The old updates and estimates are discarded upon the arrival of new ones. Then, the base station decides which updates to transmit, and the decision is independent of the transmission history. Due to the limited resources, at most

M < N

updates are allowed per transmission attempt. We consider a base station that always transmits M updates.

2.2. Age of Incorrect Information

All the users adopt AoII as a performance metric, but the choices of penalty functions vary. Let

X_{t}

and

{\hat{X}}_{t}

be the true state and the estimate of the source process, respectively. Then, in a slotted-time system, AoII can be expressed as follows

Δ_{A o I I} (X_{t}, {\hat{X}}_{t}, t) = \sum_{k = U_{t} + 1}^{t} (g (X_{k}, {\hat{X}}_{k}) \times F (k - U_{t})),

(1)

where

U_{t}

is the last time instance before time t (including t) that the receiver’s estimate is correct.

g (X_{t}, {\hat{X}}_{t})

can be any information penalty function that captures the difference between

X_{t}

and

{\hat{X}}_{t}

.

F (t) ≜ f (t) - f (t - 1)

where

f (t)

can be any time penalty function that is non-decreasing in t. We consider the case where the users adopt the same information penalty function

g (X_{t}, {\hat{X}}_{t}) = | X_{t} - {\hat{X}}_{t} |

but possibly different time penalty functions. To ease the analysis, we require

f (t)

to be unbounded. Combined together, we require

f (t_{1}) \leq f (t_{2})

if

t_{1} < t_{2}

and

{lim}_{t \to + \infty} f (t) = + \infty

. Without a loss of generality, we assume

f (0) = 0

, as the source is modeled by a two-state Markov chain,

g (X_{t}, {\hat{X}}_{t}) \in {0, 1}

. Hence, Equation (1) can be simplified to

Δ_{A o I I} (X_{t}, {\hat{X}}_{t}, t) = \sum_{k = U_{t} + 1}^{t} F (k - U_{t}) = f (s_{t}),

where

s_{t} ≜ t - U_{t}

. Therefore, the evolution of

s_{t}

is sufficient to characterize the evolution of AoII. To this end, we distinguish between the following cases.

When the receiver’s estimate is correct at time $t + 1$ , we have $U_{t + 1} = t + 1$ . Then, by definition, $s_{t + 1} = 0$ .
When the receiver’s estimate is incorrect at time $t + 1$ , we have $U_{t + 1} = U_{t}$ . Then, by definition, $s_{t + 1} = t + 1 - U_{t} = s_{t} + 1$ .

To sum up, we get

s_{t + 1} = 𝟙_{{U_{t + 1} \neq t + 1}} \times (s_{t} + 1) .

(2)

A sample path of

s_{t}

is shown in Figure 2. In the remainder of this paper, we use

f_{i} (\cdot)

to denote the time penalty function user i adopts.

Remark 1.

Under this particular choice of the penalty function,

s_{t}

can be interpreted as the time elapsed since the last time the receiver’s estimate is correct. Please note that

s_{t}

is different from the Age of Information (AoI) [17], which is defined as the time elapsed since the generation time of the last received update. We can see that AoI considers the aging process of the update, while AoII considers the aging process of the estimation error. At the same time,

s_{t}

is also fundamentally different from the holding time, which, according to [18,19], is defined as the time elapsed since the last successful transmission. We notice that the receiver’s estimate can become correct even when no new update is successfully transmitted. Moreover, the information carried by the update may have become incorrect by the time it is received. We also notice that [18,19] consider the problem of minimizing the estimation error. However, by adopting AoII as the performance metric, we study the impact of estimation error on the system.

2.3. System Dynamic

In this section, we tackle the system dynamic. We notice that the status of user i can be captured by the pair

x_{i, t} ≜ (s_{i, t}, {\hat{r}}_{i, t})

. In the following, we will use

x_{i, t}

and

(s_{i, t}, {\hat{r}}_{i, t})

interchangeably. Then, the system dynamic can be fully characterized by the dynamic of

x_{t} ≜ (x_{1, t}, \dots, x_{N, t})

. Hence, it suffices to characterize the value of

x_{t + 1}

given

x_{t}

and the base station’s action. To this end, we denote, by

a_{t} = (a_{1, t}, \dots, a_{N, t})

, the base station’s action at time t.

a_{i, t} = 1

if the base station transmits the update from user i at time t and

a_{i, t} = 0

otherwise. We notice that given action

a_{t}

, users are independent and the action taken on user i will only affect itself. Consequently

P r (x_{t + 1} ∣ x_{t}, a_{t}) = \prod_{i = 1}^{N} P r (x_{i, t + 1} ∣ x_{i, t}, a_{t}) = \prod_{i = 1}^{N} P r (x_{i, t + 1} ∣ x_{i, t}, a_{i, t}) .

Combined with the fact that all the users share the same structure, it is sufficient to study the dynamic of a single user. In the following discussions, we drop the user-dependent subscript i. We recall that

{\hat{r}}_{t + 1}

is an independent Bernoulli random variable. Then, we have

P r (x_{t + 1} ∣ x_{t}, a_{t}) = P ({\hat{r}}_{t + 1}) \times P r (s_{t + 1} ∣ x_{t}, a_{t}) .

(3)

By definition,

P ({\hat{r}}_{t + 1} = 1) = γ

and

P ({\hat{r}}_{t + 1} = 0) = 1 - γ

. Then, we only need to tackle the value of

P r (s_{t + 1} ∣ x_{t}, a_{t})

. To this end, we distinguish between the following cases

When $x_{t} = (0, {\hat{r}}_{t})$ , the estimate at time t is correct (i.e., ${\hat{X}}_{t} = X_{t}$ ). Hence, for the receiver, $X_{t}$ carries no new information about the source process. In other words, ${\hat{X}}_{t + 1} = {\hat{X}}_{t}$ regardless of whether an update is transmitted at time t. We recall that $U_{t + 1} = U_{t}$ if ${\hat{X}}_{t + 1} \neq X_{t + 1}$ and $U_{t + 1} = t + 1$ otherwise. Since the source is binary, we obtain $U_{t + 1} = U_{t}$ if $X_{t + 1} \neq X_{t}$ , which happens with probability p and $U_{t + 1} = t + 1$ otherwise. According to (2), we obtain

$P r (1 ∣ (0, {\hat{r}}_{t}), a_{t}) = p,$

$P r (0 ∣ (0, {\hat{r}}_{t}), a_{t}) = 1 - p .$
When $a_{t} = 0$ and $x_{t} = (s_{t}, {\hat{r}}_{t})$ , where $s_{t} > 0$ , the channel will not be used and no new update will be received by the receiver, and so, ${\hat{X}}_{t + 1} = {\hat{X}}_{t}$ . We recall that $U_{t + 1} = U_{t}$ if ${\hat{X}}_{t + 1} \neq X_{t + 1}$ and $U_{t + 1} = t + 1$ otherwise. Since $X_{t} \neq {\hat{X}}_{t}$ and the source is binary, we have $U_{t + 1} = U_{t}$ if $X_{t + 1} = X_{t}$ , which happens with probability $1 - p$ and $U_{t + 1} = t + 1$ otherwise. According to (2), we obtain

$P r (s_{t} + 1 ∣ (s_{t}, {\hat{r}}_{t}), a_{t} = 0) = 1 - p,$

$P r (0 ∣ (s_{t}, {\hat{r}}_{t}), a_{t} = 0) = p .$
When $a_{t} = 1$ and $x_{t} = (s_{t}, 1)$ where $s_{t} > 0$ , the transmission attempt will succeed with probability $1 - p_{e}^{1}$ and fail with probability $p_{e}^{1}$ . We recall that $U_{t + 1} = U_{t}$ if ${\hat{X}}_{t + 1} \neq X_{t + 1}$ and $U_{t + 1} = t + 1$ otherwise. Then, when the transmission attempt succeeds (i.e., ${\hat{X}}_{t + 1} = X_{t}$ ), $U_{t + 1} = U_{t}$ if $X_{t + 1} \neq X_{t}$ and $U_{t + 1} = t + 1$ otherwise. When the transmission attempt fails (i.e., ${\hat{X}}_{t + 1} = {\hat{X}}_{t} \neq X_{t}$ ), we have $U_{t + 1} = U_{t}$ if $X_{t + 1} = X_{t}$ and $U_{t + 1} = t + 1$ otherwise. Combining (2) with the dynamic of the source process we obtain

$P r (s_{t} + 1 ∣ (s_{t}, 1), a_{t} = 1) = p_{e}^{1} (1 - p) + (1 - p_{e}^{1}) p ≜ α,$

$P r (0 ∣ (s_{t}, 1), a_{t} = 1) = p_{e}^{1} p + (1 - p_{e}^{1}) (1 - p) = 1 - α .$
When $a_{t} = 1$ and $x_{t} = (s_{t}, 0)$ , where $s_{t} > 0$ , following the same line, we obtain

$P r (s_{t} + 1 ∣ (s_{t}, 0), a_{t} = 1) = p_{e}^{0} p + (1 - p_{e}^{0}) (1 - p) ≜ β,$

$P r (0 ∣ (s_{t}, 0), a_{t} = 1) = p_{e}^{0} (1 - p) + (1 - p_{e}^{0}) p = 1 - β .$

Combines together, we obtain the value of

P r (s_{t + 1} ∣ x_{t}, a_{t})

in all cases. As only M out of N updates are allowed per transmission attempt, we realize a necessity to require transmission attempts always help minimize AoII. It is equivalent to impose

P r (s_{t + 1} > s_{t} ∣ (s_{t}, {\hat{r}}_{t}), a_{t} = 0) > P r (s_{t + 1} > s_{t} ∣ (s_{t}, {\hat{r}}_{t}), a_{t} = 1)

for any

(s_{t}, {\hat{r}}_{t})

. Leveraging the results above, it is sufficient to require

p < 0.5

. As all the users share the same structure, we assume, for the rest of this paper, that

0 < p_{i} < 0.5

for

1 \leq i \leq N

.

2.4. Problem Formulation

The communication goal is to minimize the expected AoII. Therefore, the problem can be formulated as the following

\begin{array}{l} (4a) & \underset{ϕ \in Φ}{\arg \min} & \lim_{T \to \infty} \frac{1}{T} E_{ϕ} (\sum_{t = 0}^{T - 1} \sum_{i = 1}^{N} f_{i} (s_{i, t})) \\ (4b) & subject to & \sum_{i = 1}^{N} a_{i, t} = M \forall t, \end{array}

where

Φ

is the set of all causal policies. We refer to the constrained minimization problem reported in problem (4) as the Primal Problem (PP). We notice that the PP is a Restless Multi-Armed Bandit (RMAB) Problem. The optimal policy for this type of problem is far from reachable since it is PSPACE-hard in general [20]. However, we can still derive the structural properties of the optimal policy. These structural properties can be used as a guide for the development of scheduling policies and can indicate the good performance of the developed scheduling policies.

3. Structural Properties of the Optimal Policy

In this section, we investigate the structural properties of the optimal policy for PP. We first define an infinite horizon with an average cost Markov Decision Process (MDP)

M_{N} (w, M) = (X_{N}, A_{N} (M), P_{N}, C_{N} (w))

, where

$X_{N}$ denotes the state space. The state is $x = (x_{1}, \dots, x_{N})$ where $x_{i} = (s_{i}, {\hat{r}}_{i})$ .
$A_{N} (M)$ denotes the action space. The feasible action is $a = (a_{1}, \dots, a_{N})$ where $a_{i} \in {0, 1}$ and $\sum_{i = 1}^{N} a_{i} = M$ . Note that the feasible actions are independent of the state and the time.
$P_{N}$ denotes the state transition probabilities. We define $P_{x, x^{'}} (a)$ as the probability that action $a$ at state $x$ will lead to state $x^{'}$ . It is calculated by

$P_{x, x^{'}} (a) = \prod_{i = 1}^{N} P ({\hat{r}}_{i}^{'}) P_{s_{i}, s_{i}^{'}} (a_{i}, {\hat{r}}_{i}),$

where $P_{s_{i}, s_{i}^{'}} (a_{i}, {\hat{r}}_{i})$ is the transition probability from $s_{i}$ to $s_{i}^{'}$ when the estimate of CSI is ${\hat{r}}_{i}$ and action $a_{i}$ is taken. The values of $P_{s_{i}, s_{i}^{'}} (a_{i}, {\hat{r}}_{i})$ can be obtained easily from the results in Section 2.3.
$C_{N} (w)$ denotes the instant cost. When the system is at state $x$ and action $a$ is taken, the instant cost is $C (x, a) ≜ \sum_{i = 1}^{N} C (x_{i}, a_{i}) ≜ \sum_{i = 1}^{N} (f_{i} (s_{i}) + w a_{i})$ .

We notice that PP can be cast into

M_{N} (0, M)

. Since

w = 0

, the instant cost is independent of action

a

. Therefore, we abbreviate

C (x, a)

as

C (x)

. To simplify the analysis, we consider the case of

M = 1

. Equivalently, we investigate the structural properties of the optimal policy for

M_{N} (0, 1)

.

Remark 2.

For the case of

M > 1

, we can apply the same methodology. However, as M increases, the action space will grow quickly, resulting in the need to consider more feasible actions in each step of the proof. Hence, to better demonstrate the methodology, we only consider the case of

M = 1

in this paper.

It is well known that the optimal policy for

M_{N} (0, 1)

can be characterized by the value function. We denote the value function of state

x

as

V (x)

. A canonical procedure to calculate

V (x)

is applying the Value Iteration Algorithm (VIA). To this end, we define

V_{ν} (\cdot)

as the estimated value function at iteration

ν

of VIA and initialize

V_{0} (\cdot) = 0

. Then, VIA updates the estimated value functions in the following way

\begin{matrix} V_{ν + 1} (x) & = C (x) - θ + min_{a \in A_{N} (1)} \{\sum_{x^{'} \in X_{N}} P_{x, x^{'}} (a) V_{ν} (x^{'})\}, \end{matrix}

(5)

where

θ

is the optimal value of

M_{N} (0, 1)

. VIA is guaranteed to converge to the value function [21]. More precisely,

V_{ν} (\cdot) = V (\cdot)

when

ν \to + \infty

. However, the exact value function is impossible to get since we need infinite iterations and the state space is infinite. Instead, we provide two structural properties of the value function.

Lemma 1 (Monotonicity).

For

M_{N} (0, 1)

,

V (x)

is non-decreasing in

s_{i}

for

1 \leq i \leq N

.

Proof.

Leveraging the iterative nature of VIA, we use mathematical induction to prove the desired results. The complete proof can be found in Appendix A. ☐

Before introducing the next structural property, we make the following definition.

Definition 1 (Statistically identical).

Two users are said to be statistically identical if the user-dependent parameters and the adopted time penalty functions are the same.

For the users that are statistically identical, we can prove the following

Lemma 2 (Equivalence).

For

M_{N} (0, 1)

, if users j and k are statistically identical,

V (x) = V

(P (x))

where

P (x)

is state

x

with

x_{j}

and

x_{k}

exchanged.

Proof.

Leveraging the iterative nature of VIA, we use mathematical induction to prove the desired results. At each iteration, we show that for each feasible action at state

x

, we can find an equivalent action at state

P (x)

. Two actions are equivalent if they lead to the same value function. The complete proof can be found in Appendix B. ☐

Equipped with the above lemmas, we proceed with characterizing the structural properties of the optimal policy. We recall that the optimal action at each state can be characterized by the value function. Hence, we denote, by

V^{j} (x)

, the value function resulting from choosing user j to update at state

x

. Then,

V^{j} (x)

can be calculated by

V^{j} (x) = C (x) - θ + \sum_{x^{'} - x_{j}^{'}} \{(\prod_{i \neq j} P_{x_{i}, x_{i}^{'}} (0)) \sum_{{\hat{r}}_{j}^{'}} [P ({\hat{r}}_{j}^{'}) (\sum_{s_{j}^{'}} P_{s_{j}, s_{j}^{'}} (1, {\hat{r}}_{j}) V (x^{'}))]\} .

If

V^{j} (x) < V^{k} (x)

for all

k \neq j

, it is optimal to transmit the update from user j. When

V^{j} (x) = V^{k} (x)

, the two choices are equally desirable. In the following, we will characterize the properties of

δ^{j, k} (x) ≜ V^{j} (x) - V^{k} (x)

for any j and k.

Theorem 1 (Structural properties).

For

M_{N} (0, 1)

,

δ^{j, k} (x)

has the following properties

$δ^{j, k} (x) \leq 0$ if ${\hat{r}}_{k} = p_{e, k}^{0} = 0$ . The equality holds when $s_{j} = 0$ or ${\hat{r}}_{j} = p_{e, j}^{0} = 0$ .
$δ^{j, k} (x)$ is non-increasing in ${\hat{r}}_{j}$ and is non-decreasing in ${\hat{r}}_{k}$ when $s_{j}, s_{k} > 0$ . At the same time, $δ^{j, k} (x)$ is independent of ${\hat{r}}_{i}$ for any $i \neq j, k$ .
$δ^{j, k} (x) \leq 0$ if $s_{k} = 0$ . The equality holds when $s_{j} = 0$ or ${\hat{r}}_{j} = p_{e, j}^{0} = 0$ .
$δ^{j, k} (x)$ is non-increasing in $s_{j}$ if $Γ_{j}^{{\hat{r}}_{j}} \leq Γ_{k}^{{\hat{r}}_{k}}$ and is non-decreasing in $s_{k}$ if $Γ_{j}^{{\hat{r}}_{j}} \geq Γ_{k}^{{\hat{r}}_{k}}$ when $s_{j}, s_{k} > 0$ . We define $Γ_{i}^{1} ≜ \frac{α_{i}}{1 - p_{i}}$ and $Γ_{i}^{0} ≜ \frac{β_{i}}{1 - p_{i}}$ for $1 \leq i \leq N$ .
$δ^{j, k} (x) \leq 0$ if $s_{j} \geq s_{k}$ , ${\hat{r}}_{j} \geq {\hat{r}}_{k}$ , and users j and k are statistically identical.

Proof.

The proof can be found in Appendix C. ☐

We notice that

Γ_{i}^{{\hat{r}}_{i}}

can be written as

Γ_{i}^{{\hat{r}}_{i}} = \frac{P r (s_{i} + 1 ∣ (s_{i}, {\hat{r}}_{i}), a_{i} = 1)}{P r (s_{i} + 1 ∣ (s_{i}, {\hat{r}}_{i}), a_{i} = 0)} < 1,

where

s_{i}

can be any positive integer. Consequently,

Γ_{i}^{{\hat{r}}_{i}}

is independent of any

s_{i} > 0

and indicates the decrease in the probability of increasing

s_{i}

caused by action

a_{i} = 1

. When

Γ_{i}^{{\hat{r}}_{i}}

is large, action

a_{i} = 1

will achieve a small decrease in the probability of increasing

s_{i}

. In the following, we provide an intuitive interpretation of why the monotonicity in Property 4 of Theorem 1 depends on

Γ_{i}^{{\hat{r}}_{i}}

. We take the case of

Γ_{j}^{{\hat{r}}_{j}} \leq Γ_{k}^{{\hat{r}}_{k}}

as an example and assume that there are only users j and k in the system. Then, according to Section 2.3, the dynamic of

s_{j}

and

s_{k}

can be divided into the following three cases

Neither $s_{j}$ nor $s_{k}$ increases. In this case, both $s_{j}$ and $s_{k}$ become zero.
Either $s_{j}$ or $s_{k}$ increases and the other becomes zero. We denote by $P_{j}^{k}$ the probability that only $s_{k}$ increases when $a_{j} = 1$ . The notation for other cases is defined analogously. The probabilities can be obtained easily using the results in Section 2.3.
Both $s_{j}$ and $s_{k}$ increase. We denote by $P_{j}$ the probability that both $s_{j}$ and $s_{k}$ increase when $a_{j} = 1$ . $P_{k}$ is defined analogously. The probabilities can be obtained easily using the results in Section 2.3.

We notice that

δ^{j, k} (x)

implies the tendency of the base station to choose between the two users. The larger

δ^{j, k} (x)

is, the more the base station tends to choose user k. Thus, we investigate the base station’s propensity to choose user k when

s_{k}

increases but

s_{j}

stays the same. We ignore the case where the resulting

s_{k}

is zero since it is independent of the increase in

s_{k}

. With this in mind, we first notice that

P_{k}^{k} \leq P_{j}^{k}

. Meanwhile, we can easily verify that

\frac{P_{j}}{P_{k}} = \frac{Γ_{j}^{{\hat{r}}_{j}}}{Γ_{k}^{{\hat{r}}_{k}}}

. When

Γ_{j}^{{\hat{r}}_{j}} \leq Γ_{k}^{{\hat{r}}_{k}}

, we have

P_{j} \leq P_{k}

. Then, there exists a subtle trade-off. More precisely, choosing user k will result in

P_{k}^{k} \leq P_{j}^{k}

, but at the cost of

P_{k} \geq P_{j}

. Hence, in this case, the propensity of the base station is hard to determine. Following the same line, we can show that choosing user j will lead to

P_{j}^{j} \leq P_{k}^{j}

and

P_{j} \leq P_{k}

. Thus, there exists no such trade-off when we investigate the base station’s propensity to choose user j as

s_{j}

increases but

s_{k}

stays the same.

Leveraging Theorem 1, we can provide some specific structural properties of the optimal policy.

Corollary 1 (Application of Theorem 1).

When

M = 1

, the optimal policy for PP must satisfy the following

The user i with ${\hat{r}}_{i} = p_{e, i}^{0} = 0$ or $s_{i} = 0$ will not be chosen unless it is to break the tie.
When user j is chosen at state $x_{1}$ , then for state $x_{2}$ , such that ${\hat{r}}_{1, j} \leq {\hat{r}}_{2, j}$ and $s_{1, i} = s_{2, i}$ for $1 \leq i \leq N$ , the optimal choice must be in the set $G = {j} \cup {k : {\hat{r}}_{1, k} < {\hat{r}}_{2, k}}$ .
When $N = 2$ , we consider two states, $x_{1}$ and $x_{2}$ , which differ only in the value of $s_{j}$ . Specifically, $s_{1, j} \leq s_{2, j}$ . If user j is chosen at state $x_{1}$ and $Γ_{j}^{{\hat{r}}_{1, j}} \leq Γ_{k}^{{\hat{r}}_{1, k}}$ , the optimal choice at state $x_{2}$ will also be user j.
When $N = 2$ , we consider two states, $x_{1}$ and $x_{2}$ , which differ only in the value of $s_{k}$ . Specifically, $s_{1, k} \geq s_{2, k}$ . If user j is chosen at state $x_{1}$ and $Γ_{j}^{{\hat{r}}_{1, j}} \geq Γ_{k}^{{\hat{r}}_{1, k}}$ , the optimal choice at state $x_{2}$ will also be user j.
When all users are statistically identical, the optimal choice at any time slot must be either the user with $x = (s_{m a x, 1}, 1)$ where $s_{m a x, 1} ≜ {max}_{s_{i}} {(s_{i}, 1)}$ or the user with $x = (s_{m a x, 0}, 0)$ where $s_{m a x, 0} ≜ {max}_{s_{i}} {(s_{i}, 0)}$ . Moreover,
- If $s_{m a x, 1} \geq s_{m a x, 0}$ , it is optimal to choose the user with $x = (s_{m a x, 1}, 1)$ .
- If $s_{m a x, 1} < s_{m a x, 0}$ , the optimal choice will switch from the user with $x = (s_{m a x, 0}, 0)$ to the user with $x = (s_{m a x, 1}, 1)$ when $s_{m a x, 1}$ increases from 0 to $s_{m a x, 0}$ solely.

Proof.

The first property follows directly from Property 1 and Property 3 of Theorem 1. For the second property, leveraging Property 2 of Theorem 1, we have

δ^{j, k} (x_{2}) \leq δ^{j, k} (x_{1}) \leq 0

if

{\hat{r}}_{1, j} \leq {\hat{r}}_{2, j}

,

{\hat{r}}_{1, k} \geq {\hat{r}}_{2, k}

, and

s_{1, i} = s_{2, i}

for

1 \leq i \leq N

. Thus, the optimal choice will not be user k in this case. Then, we can conclude that the optimal choice must be in the set

G = {j} \cup {k : {\hat{r}}_{1, k} < {\hat{r}}_{2, k}}

.

For the third property, we have proved in Property 4 of Theorem 1 that

δ^{j, k} (x)

is non-increasing in

s_{j}

if

Γ_{j}^{{\hat{r}}_{j}} \leq Γ_{k}^{{\hat{r}}_{k}}

. Hence,

δ^{j, k} (x_{2}) \leq δ^{j, k} (x_{1}) \leq 0

. As we consider the case of

N = 2

, the optimal choice at state

x_{2}

will also be user j. The fourth property can be shown in a similar way by noticing that

δ^{j, k} (x)

is non-decreasing in

s_{k}

when

Γ_{j}^{{\hat{r}}_{j}} \geq Γ_{k}^{{\hat{r}}_{k}}

.

For the last property, we recall from Property 5 of Theorem 1 that it is always better to choose the user with a larger s if they are statistically identical and have the same

\hat{r}

. Thus, we can conclude that the optimal choice must be either the user with

x = (s_{m a x, 1}, 1)

or the user with

x = (s_{m a x, 0}, 0)

. Without a loss of generality, we assume

x_{j} = (s_{m a x, 1}, 1)

and

x_{k} = (s_{m a x, 0}, 0)

. Now, we distinguish between the following cases

According to Property 5 of Theorem 1, we can conclude that it is optimal to choose user j when $s_{m a x, 1} \geq s_{m a x, 0}$ .
To determine the optimal choice in the case of $s_{m a x, 1} < s_{m a x, 0}$ , we recall that the optimal choice will be user k (i.e., $δ^{j, k} (x) \geq 0$ ) if $s_{j} = 0$ and will be user j (i.e., $δ^{j, k} (x) \leq 0$ ) if $s_{j} = s_{k}$ . At the same time, Property 4 of Theorem 1 tells us that $δ^{j, k} (x)$ is non-increasing in $s_{j}$ when users j and k are statistically identical. Therefore, we can conclude that the optimal choice will switch from user k to user j when $s_{j}$ increases from 0 to $s_{k}$ solely.

☐

4. Whittle’s Index Policy

Whittle’s index policy is a well-known low-complexity heuristic that shows a strong performance in many problems that belong to RMAB [22,23,24]. In this section, we develop Whittle’s index policy for PP. We first present the general procedures we adopt to obtain Whittle’s index.

We first formulate a relaxed version of PP and apply the Lagrangian approach.
Then, we decouple the problem of minimizing the Lagrangian function into N decoupled problems, each of which only considers a single user. By casting the decoupled problem into an MDP, we investigate the structural properties and performance of the optimal policy.
Leveraging the results above and under a simple condition, we establish the indexability of the decoupled problem.
Finally, we obtain the expression of Whittle’s index by solving the Bellman equation.

4.1. Relaxed Problem

The first step in obtaining Whittle’s index is to formulate the Relaxed Problem (RP). More precisely, instead of requiring the limit on the number of updates allowed per transmission attempt to be met in each time slot, we relax the constraint such that the limit is not violated in an average sense. Then, RP can be formulated as

\begin{matrix} \underset{ϕ \in Φ}{\arg \min} & \bar{Δ} ≜ \lim_{T \to \infty} \frac{1}{T} E_{ϕ} (\sum_{t = 0}^{T - 1} \sum_{i = 1}^{N} f_{i} (s_{i, t})) \end{matrix}

(6a)

\begin{matrix} subject to & {\bar{ρ}}_{ϕ} ≜ \lim_{T \to \infty} \frac{1}{T} E_{ϕ} (\sum_{t = 0}^{T - 1} \sum_{i = 1}^{N} a_{i, t}) \leq M . \end{matrix}

(6b)

As RP is specified, we apply the Lagrangian approach. First of all, we write RP into its Lagrangian form.

L (λ, ϕ) = lim_{T \to \infty} \frac{1}{T} E_{ϕ} (\sum_{t = 0}^{T - 1} \sum_{i = 1}^{N} (f_{i} (s_{i, t}) + λ a_{i, t})) - λ M,

where

λ \geq 0

is the Lagrange multiplier. Then, we investigate the problem of minimizing the Lagrangian function. Since

λ M

is independent of policies, we can ignore it. More precisely, we consider the following minimization problem

\begin{matrix} \underset{ϕ \in Φ}{minimize} & \lim_{T \to \infty} \frac{1}{T} E_{ϕ} (\sum_{t = 0}^{T - 1} \sum_{i = 1}^{N} (f_{i} (s_{i, t}) + λ a_{i, t}) . \end{matrix}

(7)

4.2. Decoupled Model

In this section, we formulate the decoupled problem and investigate its optimal policy. The decoupled model associated with each user follows the system model with

N = 1

. Since all the users share the same structure, we drop the user-dependent subscript i for simplicity. Then, the decoupled problem can be formulated as

\begin{matrix} \underset{ϕ \in Φ^{'}}{minimize} & \lim_{T \to \infty} \frac{1}{T} E_{ϕ} (\sum_{t = 0}^{T - 1} (f (s_{t}) + λ a_{t})), \end{matrix}

(8)

where

Φ^{'}

is the set of all causal policies when

N = 1

. We notice that problem (8) can be cast into the MDP

M_{1} (λ, - 1)

. We define

M = - 1

when there is no restriction on the number of updates allowed per transmission attempt.

We first investigate the structural properties of the optimal policy for

M_{1} (λ, - 1)

when

λ

is a given non-negative constant. We start with characterizing the corresponding value function

V (x)

.

Corollary 2 (Extension of Lemma 1).

For

M_{1} (λ, - 1)

,

V (x)

is non-decreasing in s.

Proof.

The proof follows the same steps as in the proof of Lemma 1. The complete proof can be found in Appendix D. ☐

Equipped with the above corollary, we can characterize the structural properties of the optimal policy for (8).

Proposition 1 (Optimal policy for decoupled problem).

The optimal policy for the decoupled problem is a threshold policy with the following properties.

The optimal policy can be fully captured by $n = (n_{0}, n_{1})$ . More precisely, when the system is at state $(s, \hat{r})$ , it is optimal to make a transmission attempt only when $s \geq n_{\hat{r}}$ .
$n_{0} \geq n_{1} > 0$ .

Proof.

We define

Δ V (x) ≜ V^{1} (x) - V^{0} (x)

, where

V^{a} (x)

is the value function resulting from taking action a at state x. Then, the optimal action at state x is

a = 1

if

Δ V (x) < 0

, and

a = 0

is optimal otherwise. We use Corollary 2 to characterize the sign of

Δ V (x)

. The complete proof can be found in Appendix E. ☐

In the following, we evaluate the performance of the threshold policy detailed in Proposition 1. More precisely, we calculate the expected AoII

{\bar{Δ}}_{n}

and the expected transmission rate

{\bar{ρ}}_{n}

resulting from the adoption of threshold policy

n

. We will see in the following that

{\bar{Δ}}_{n}

and

{\bar{ρ}}_{n}

are essential for establishing the indexability and obtaining the expression of Whittle’s index.

Proposition 2 (Performance).

Under threshold policy

n = (n_{0}, n_{1})

,

{\bar{Δ}}_{n} = π_{0} p [\sum_{k = 1}^{n_{1} - 1} f (k) {(1 - p)}^{k - 1} + {(1 - p)}^{n_{1} - 1} (\sum_{k = n_{1}}^{n_{0} - 1} f (k) c_{1}^{k - n_{1}} + c_{1}^{n_{0} - n_{1}} \sum_{k = n_{0}}^{+ \infty} f (k) c_{2}^{k - n_{0}})],

{\bar{ρ}}_{n} = π_{0} p {(1 - p)}^{n_{1} - 1} [\frac{γ}{1 - c_{1}} + c_{1}^{n_{0} - n_{1}} (\frac{1}{1 - c_{2}} - \frac{γ}{1 - c_{1}})],

where

π_{0} = \frac{1}{2 + p {(1 - p)}^{n_{1} - 1} [\frac{1}{1 - c_{1}} - \frac{1}{p} + c_{1}^{n_{0} - n_{1}} (\frac{1}{1 - c_{2}} - \frac{1}{1 - c_{1}})]},

c_{1} = (1 - γ) (1 - p) + γ α

, and

c_{2} = (1 - γ) β + γ α

.

Proof.

We notice that the dynamic of AoII under the threshold policy can be fully captured by a Discrete-Time Markov Chain (DTMC). Then, combined with the fact that

\hat{r}

is an independent Bernoulli random variable, we can obtain the desired results from the stationary distribution of the induced DTMC. The complete proof can be found in Appendix F. ☐

As

f (\cdot)

can be any non-decreasing function,

\bar{Δ}

can grow indefinitely. Thus, it is necessary to require that there exists at least one threshold policy that causes a finite

\bar{Δ}

. By noting that

1 - p \geq c_{1} \geq c_{2}

, we have

\begin{matrix} \bar{Δ} & \geq π_{0} p [\sum_{k = 1}^{n_{1} - 1} f (k) c_{2}^{k - 1} + c_{2}^{n_{1} - 1} (\sum_{k = n_{1}}^{n_{0} - 1} f (k) c_{2}^{k - n_{1}} + c_{2}^{n_{0} - n_{1}} \sum_{k = n_{0}}^{+ \infty} f (k) c_{2}^{k - n_{0}})] \\ = π_{0} p (\sum_{k = 1}^{+ \infty} f (k) c_{2}^{k - 1}) . \end{matrix}

The equality is achieved when

n_{0} = n_{1} = 1

. Then, we can conclude that it is sufficient to require

\sum_{k = 1}^{+ \infty} f (k) c_{2}^{k - 1} < + \infty

. This will be the underlying assumption throughout the rest of this paper.

4.3. Indexability

In this section, we establish the indexability of the decoupled problem, which ensures the existence of Whittle’s index. We start with the definition of indexability.

Definition 2

(Indexability). The decoupled problem is indexable if the set of states in which

a = 0

is the optimal action increases with λ, that is,

λ^{'} < λ ⟹ D (λ^{'}) \subseteq D (λ),

where

D (λ)

is the set of states in which

a = 0

is optimal when Lagrange multiplier λ is adopted.

The Lagrange multiplier

λ

can be viewed as a cost associated with each transmission attempt. Intuitively, as

λ

increases, the base station should stay idle (i.e.,

a = 0

) for a longer time until s becomes large enough to offset the cost. Although it is intuitively correct that the decoupled problem is indexable, the indexability is hard to establish as the optimal policy is characterized by two thresholds. Thus, Whittle’s index does not necessarily exist. However, the indexability can be established when the following condition is satisfied

p_{e, i}^{0} = 0 f o r 1 \leq i \leq N .

(9)

Remark 3.

Problem (9) only requires the estimate

{\hat{r}}_{i}

to be perfect when

{\hat{r}}_{i} = 0

. In the case of

{\hat{r}}_{i} = 1

, we still allow the estimate to be inaccurate.

When (9) is satisfied, Propositions 1 and 2 reduce to the following

Corollary 3 (Consequences of (9)).

When (9) is satisfied, the optimal policy for the decoupled problem (8) is the threshold policy

n = (+ \infty, n)

. The corresponding

{\bar{Δ}}_{n}

and

{\bar{ρ}}_{n}

are

{\bar{Δ}}_{n} = π_{0} p (\sum_{k = 1}^{n - 1} f (k) {(1 - p)}^{k - 1} + {(1 - p)}^{n - 1} \sum_{k = n}^{+ \infty} f (k) c_{1}^{k - n}),

{\bar{ρ}}_{n} = π_{0} p {(1 - p)}^{n - 1} (\frac{γ}{1 - c_{1}}),

where

π_{0} = \frac{1}{2 + p {(1 - p)}^{n - 1} (\frac{1}{1 - c_{1}} - \frac{1}{p})} .

Proof.

We continue with the same notations as in the proof of Propositions 1 and 2. It is sufficient to show that

n_{0} = + \infty

. To this end, we consider the state

x = (s, 0)

. By following the same steps as in the proof of Proposition 1, we have

Δ V (s, 0) = λ \geq 0 .

Therefore, it is optimal to stay idle (i.e.,

a = 0

) at state

x = (s, 0)

for any

s \geq 0

. Equivalently,

n_{0} = + \infty

. Then, the corresponding

{\bar{Δ}}_{n}

and

{\bar{ρ}}_{n}

can be calculated as a special case of Proposition 2 where

n_{0} = + \infty

,

n_{1} = n

, and

p_{e}^{0} = 0

. ☐

Leveraging Corollary 3, we can establish the indexability of the decoupled problem.

Proposition 3 (Indexability of decoupled problem).

The decoupled problem is indexable when (9) is satisfied.

Proof.

According to Proposition 2.2 of [25], we only need to verify that the expected transmission rate

{\bar{ρ}}_{n}

is strictly decreasing in n. From Corollary 3, we have

\begin{matrix} {\bar{ρ}}_{n} = \frac{γ (\frac{p}{1 - c_{1}})}{\frac{2}{{(1 - p)}^{n - 1}} + (\frac{p}{1 - c_{1}} - 1)} . \end{matrix}

As

\frac{1}{2} < 1 - p < 1

, we can easily verify that

{\bar{ρ}}_{n}

is strictly decreasing in n. Thus, the decoupled problem is indexable when (9) is satisfied. ☐

4.4. Whittle’s Index Policy

In this section, we proceed with finding the expression of Whittle’s index and defining Whittle’s index policy. First of all, we give the definition of Whittle’s index.

Definition 3 (Whittle’s index).

When the decoupled problem is indexable, Whittle’s index at state x is defined as the infimum λ, such that both actions are equally desirable. Equivalently, Whittle’s index at state x is defined as the infimum λ such that

V^{0} (x) = V^{1} (x)

.

Let us denote by

W_{x}

the Whittle’s index at state x. Then, the expression of Whittle’s index is given by the following Proposition.

Proposition 4 (Whittle’s index).

When (9) is satisfied, Whittle’s index is

W_{x} = \{\begin{matrix} 0 & w h e n x = (0, \hat{r}) o r x = (s, 0), \\ \frac{(1 - c_{1}) \sum_{k = s + 1}^{+ \infty} f (k) c_{1}^{k - s - 1} - {\bar{Δ}}_{s}}{\frac{(1 - c_{1}) (1 - p) - γ (1 - p - α)}{c_{1} (1 - p - α)} + {\bar{ρ}}_{s}} & w h e n x = (s, 1), \end{matrix}

where

s > 0

and

c_{1} = (1 - γ) (1 - p) + γ α

.

{\bar{Δ}}_{s}

and

{\bar{ρ}}_{s}

are the expected AoII and the expected transmission rate when threshold policy

n = (+ \infty, s)

is adopted, respectively. At the same time,

W_{x}

is non-negative and is non-decreasing in s.

Proof.

Whittle’s indexes at state

x = (0, \hat{r})

and

x = (s, 0)

are obtained easily from the proof of Proposition 1. For state

x = (s, 1)

, we first use backward induction to calculate the expressions of some value functions. Then, the expression of Whittle’s index can be obtained from its definition. The complete proof can be found in Appendix G. ☐

Definition 4

(Whittle’s index policy). At any state

x = (x_{1}, x_{2}, \dots, x_{N})

, the base station will transmit the updates from M users with the largest

W_{x_{i}}

. The ties are broken arbitrarily.

W_{x_{i}}

is calculated using Proposition 4 with the parameters of user i.

Remark 4.

Whittle’s index policy possesses the structural properties detailed in Corollary 1.

The first two properties can be verified by noting that $W_{x_{i}} \geq 0$ and the equality holds when ${\hat{r}}_{i} = 0$ or $s_{i} = 0$ . At the same time, $W_{x_{i}}$ is non-decreasing in ${\hat{r}}_{i}$ .
The third and fourth properties can be verified by noting that $W_{x_{i}}$ is non-decreasing in $s_{i}$ .
For the last property, we first notice that $W_{x_{j}} = W_{x_{k}}$ when users j and k are statistically identical and $x_{j} = x_{k}$ . Then, the property can be verified by noting that $W_{x_{i}}$ is non-decreasing in both $s_{i}$ and ${\hat{r}}_{i}$ .

5. Optimal Policy for Relaxed Problem

In this section, we provide an efficient algorithm to obtain the optimal policy for RP, based on which we will develop another scheduling policy for PP in the next section that is free from indexability. At the same time, the performance of the optimal policy for RP forms a universal lower bound because the following ordering holds

{\bar{Δ}}_{A o I I}^{R P} \leq {\bar{Δ}}_{A o I I}^{P P},

where

{\bar{Δ}}_{A o I I}^{R P}

and

{\bar{Δ}}_{A o I I}^{P P}

are the minimal expected AoII of RP and PP, respectively.

Remark 5.

Note that the optimal policy for RP may not necessarily be a valid policy for PP, as the transmitter may transmit more than M updates in one transmission attempt under RP-optimal policy.

To solve RP, we follow the discussion in Section 4.1. More precisely, we take the Lagrangian approach and consider the problem reported in (7). We will see in the following discussion that the optimal policy for RP can be characterized by the optimal policies for problem (7). Therefore, we first cast problem (7) into the MDP

M_{N} (λ, - 1)

. However, the optimal policy for

M_{N} (λ, - 1)

is difficult to obtain because the state space is infinite. Even though we can make the state space finite by imposing an upper limit on the value of s, the state space and the action space grow exponentially with the number of users in the system. To overcome the difficulty, we investigate the optimal policy for

M_{1}^{i} (λ, - 1)

where

1 \leq i \leq N

. The superscript i means that the only user in the system is user i. We will show later that the optimal policy for

M_{N} (λ, - 1)

can be fully characterized by the optimal policies for

M_{1}^{i} (λ, - 1)

where

1 \leq i \leq N

.

5.1. Optimal Policy for Single User

In this section, we tackle the problem of finding the optimal policy for

M_{1}^{i} (λ, - 1)

. Since the users share the same structure, we ignore the superscript i for simplicity. To find the optimal policy, we first use the Approximating Sequence Method (ASM) introduced in [26] to make the state space finite. More precisely, we impose

s \leq m

where m is a predetermined upper limit. The state transition probabilities

P_{s, s^{'}}^{'} (a, \hat{r})

are modified in the following way

P_{s, s^{'}}^{'} (a, \hat{r}) = \{\begin{matrix} P_{s, s^{'}} (a, \hat{r}) & i f s^{'} < m, \\ P_{s, s^{'}} (a, \hat{r}) + \sum_{z > m} P_{s, z} (a, \hat{r}) & i f s^{'} = m . \end{matrix}

(10)

The action space and the instant cost remain unchanged. Then, we can apply Relative Value Iteration (RVI) with convergence criteria

ϵ

to obtain the optimal policy. We notice that

M_{1} (λ, - 1)

coincides with the decoupled model studied in Section 4.2. Hence, we can utilize the threshold structure of the optimal policy to improve RVI. To this end, we class a state as active if the optimal action at this state is

a = 1

. Then, the threshold structure detailed in Proposition 1 tells us the following. For any state x, if there exists an active state

x_{1}

with

s_{1} \leq s

and

{\hat{r}}_{1} \leq \hat{r}

, then x must also be active. Hence, we can determine the optimal action at state x immediately instead of comparing all feasible actions. In this way, we can reduce the running time of RVI. The pseudocode for the improved RVI can be found in Algorithm A1 of Appendix M. A similar technique is also presented in [5].

For

M_{1} (λ, - 1)

, when problem (9) is satisfied, Whittle’s index exists and can be calculated efficiently using Proposition 4. Therefore, we can obtain the optimal policy using Whittle’s index and further reduce the computational complexity. To this end, we denote by

n_{λ}

the optimal policy for

M_{1} (λ, - 1)

and present the following proposition

Proposition 5 (Optimal deterministic policy).

When (9) is satisfied, the optimal policy for

M_{1} (λ, - 1)

is

n_{λ} = (+ \infty, n)

where n is given by

n = \{\begin{matrix} 1 & i f λ = 0, \\ max {s \in N_{0} : W_{s} \leq λ} + 1 & i f λ > 0 . \end{matrix}

W_{s}

is the Whittle’s index at state

(s, 1)

.

Proof.

We first notice that

M_{1} (λ, - 1)

coincides with the decoupled model studied in Section 4.2. Then, we show the optimal action for each state with

\hat{r} = 1

using the definition of Whittle’s index and the fact that the decoupled problem is indexable when (9) is satisfied. The complete proof can be found in Appendix H. ☐

In the following, we provide a randomized policy that is also optimal for

M_{1} (λ, - 1)

. We will see later that the randomized policy is the key to obtaining the optimal policy for RP.

Theorem 2 (Optimal randomized policy).

There exist two deterministic policies

n_{λ_{+}}

and

n_{λ_{-}}

, which are both optimal for

M_{1} (λ, - 1)

. We consider the following randomized policy

n_{λ}

: every time the system reaches state

(0, 0)

, the base station will make the choice between

n_{λ_{-}}

with probability μ and

n_{λ_{+}}

with probability

1 - μ

. The chosen policy will be followed until the next choice. Then, the randomized policy

n_{λ}

is optimal for

M_{1} (λ, - 1)

under any

μ \in [0, 1]

.

Proof.

We show that our system verifies the assumptions given in [27]. Then, leveraging the characteristics of our system, we can obtain the optimal randomized policy. The complete proof can be found in Appendix I. ☐

In practice, we approximate

λ_{+} \approx λ + ξ

and

λ_{-} \approx λ - ξ

where

ξ

is a small perturbation. Then, the deterministic policies

n_{λ_{+}}

and

n_{λ_{-}}

can be obtained by following the discussion at the beginning of this subsection. Note that, in most cases,

n_{λ_{+}}

and

n_{λ_{-}}

are the same.

5.2. Optimal Policy for RP

In this section, we characterize the optimal policy for RP. Let us denote by

V (x)

and

V^{i} (x_{i})

the value functions of

M_{N} (λ, - 1)

and

M_{1}^{i} (λ, - 1)

, respectively. Then, we can prove the following

Proposition 6 (Separability).

V (x) = \sum_{i = 1}^{N} V^{i} (x_{i})

where

x = (x_{1}, \dots, x_{N})

. In other words, the policy, under which each user adopts its own optimal policy, is optimal for

M_{N} (λ, - 1)

.

Proof.

We show

V (x) = \sum_{i = 1}^{N} V^{i} (x_{i})

by comparing the Bellman equations they must satisfy. The complete proof can be found in Appendix J. ☐

We denote the optimal policy for

M_{N} (λ, - 1)

as

ϕ_{λ} = [n_{λ, 1}, \dots, n_{λ, N}]

where

n_{λ, i}

is the optimal policy for

M_{1}^{i} (λ, - 1)

. For simplicity, we define

\bar{Δ} (λ)

and

\bar{ρ} (λ)

as the expected AoII and the expected transmission rate associated with

ϕ_{λ}

, respectively.

{\bar{Δ}}^{i} (λ)

and

{\bar{ρ}}^{i} (λ)

are defined analogously for user i under policy

n_{λ, i}

. We also define

λ^{*} ≜ inf {λ > 0 :

\bar{ρ} (λ) \leq M}

. With Proposition 6 and the above definitions in mind, we proceed with constructing the optimal policy for RP.

Theorem 3 (Optimal policy for RP).

The optimal policy for RP can be characterized by two deterministic policies

ϕ_{λ_{+}^{*}} = [n_{λ_{+}^{*}, 1}, \dots, n_{λ_{+}^{*}, N}]

and

ϕ_{λ_{-}^{*}} = [n_{λ_{-}^{*}, 1}, \dots, n_{λ_{-}^{*}, N}]

where

n_{λ_{+}^{*}, i}

and

n_{λ_{-}^{*}, i}

are both the optimal deterministic policies for

M_{1}^{i} (λ^{*}, - 1)

. Then, we mix

ϕ_{λ_{+}^{*}}

and

ϕ_{λ_{-}^{*}}

in the following way: for each user i, every time the user reaches state

(0, 0)

, the base station will make the choice between

n_{λ_{-}^{*}, i}

with probability

μ_{i}

and

n_{λ_{+}^{*}, i}

with probability

1 - μ_{i}

. The chosen policy will be followed by user i until the next choice. Where

1 \leq i \leq N

, the

μ_{i}

is chosen in such a way as to satisfy

\sum_{i = 1}^{N} {\bar{ρ}}^{i} (λ^{*}) = \sum_{i = 1}^{N} (μ_{i} {\bar{ρ}}^{i} (λ_{-}^{*}) + (1 - μ_{i}) {\bar{ρ}}^{i} (λ_{+}^{*})) = M .

(11)

Then, the mixed policy, denoted by

ϕ_{λ^{*}}

, is optimal for RP.

Proof.

According to Lemma 3.10 of [27], a policy is optimal for RP if

It is optimal for $M_{N} (λ^{*}, - 1)$ ;
The resulting expected transmission rate is equal to M.

Then, we construct such a policy using Theorem 2 and Proposition 6. The complete proof can be found in Appendix K. ☐

Since we approximate

λ_{+}^{*} \approx λ^{*} + ξ

and

λ_{-}^{*} \approx λ^{*} - ξ

in practice,

{\bar{ρ}}^{i} (λ_{+}^{*}) \leq {\bar{ρ}}^{i} (λ_{-}^{*})

for all i according to the monotonicity given by Lemma 3.4 of [27]. Combining with the definition of

λ^{*}

, we must have

\bar{ρ} (λ_{+}^{*}) \leq M < \bar{ρ} (λ_{-}^{*})

. Therefore, we can always find

μ_{i}

’s that realize (11). In this paper, we choose

μ_{i} = μ = \frac{M - \bar{ρ} (λ_{+}^{*})}{\bar{ρ} (λ_{-}^{*}) - \bar{ρ} (λ_{+}^{*})}, f o r 1 \leq i \leq N .

(12)

Then, we describe the algorithm used to obtain the optimal policy for RP. As detailed in Theorem 3, it is essential to find

λ^{*}

. To this end, we recall that, for any user i under given

λ

, the optimal deterministic policy

n_{λ, i}

can be obtained using the results in Section 5.1 and the resulting expected transmission rate

{\bar{ρ}}^{i} (λ)

is given by Proposition 2. Since

{\bar{ρ}}^{i} (λ)

is non-increasing in

λ

for all i according to Lemma 3.4 of [27],

\bar{ρ} (λ) = \sum_{i = 1}^{N} {\bar{ρ}}^{i} (λ)

is also non-increasing in

λ

. Hence, we can regard

\bar{ρ} (λ)

as a non-increasing function of

λ

. Then, according to the definition of

λ^{*}

, we can use the Bisection search to obtain

λ^{*}

efficiently. The main steps can be summarized as follows.

Initialize $λ_{-} = 0$ and $λ_{+} = 1$ .
Do $λ_{-} = λ_{+}$ and $λ_{+} = 2 λ_{+}$ until $\bar{ρ} (λ_{+}) < M$ .
Run Bisection search on the interval $[λ_{-}, λ_{+}]$ until the tolerance $2 ξ$ is met.

Then,

λ_{-}^{*}

and

λ_{+}^{*}

can simply be the boundaries of the final interval. The pseudocode for the Bisection search can be found in Algorithm A2 of Appendix M. After obtaining

λ_{-}^{*}

and

λ_{+}^{*}

, the optimal policy

ϕ_{λ^{*}}

is detailed in Theorem 3 and the mixing probabilities

μ_{i}

’s are given by (12).

Remark 6.

We recall that the optimal deterministic policy for each user can be characterized by two positive thresholds (i.e.,

n_{0}, n_{1} > 0

). Consequently, under RP-optimal policy, the base station will never choose the user at state

(0, \hat{r})

. Then, when M increases, the expected transmission rate achieved by RP-optimal policy will saturate before M reaches N. When the expected transmission rate saturates, the RP-optimal policy is

ϕ^{*} = [n_{1}, \dots, n_{N}]

where

n_{i} = (1, 1)

for

1 \leq i \leq N

. The saturation happens when M is larger than or equal to the expected transmission rate achieved by

ϕ^{*}

.

6. Indexed Priority Policy

Although the performance of Whittle’s index policy is known to be good, it requires indexability, which is usually difficult to establish. In this section, based on the primal-dual heuristic introduced in [28], we develop a policy that does not require indexability and has comparable performance to Whittle’s index policy. We start with presenting the primal-dual heuristic.

6.1. Primal-Dual Heuristic

The heuristic is based on the optimal primal and dual solution pair to the linear program associated with RP. To introduce the linear program, we define

π_{x_{i}}^{a_{i}} (ϕ) \geq 0

as the expected time that user i is at state

x_{i}

and action

a_{i}

is taken according to policy

ϕ

. Then, for any

ϕ

,

π_{x_{i}}^{a_{i}} (ϕ)

must satisfy the following problems

π_{x_{i}}^{0} (ϕ) + π_{x_{i}}^{1} (ϕ) = \sum_{x_{i}^{'}} \sum_{a_{i}^{'}} P_{x_{i}^{'}, x_{i}} (a_{i}^{'}) π_{x_{i}^{'}}^{a_{i}^{'}} (ϕ), \forall x_{i}, i .

\sum_{x_{i}} \sum_{a_{i}} π_{x_{i}}^{a_{i}} (ϕ) = 1, \forall i .

The objective function of RP can be rewritten as

\begin{matrix} \underset{ϕ \in Φ}{minimize} & \sum_{i = 1}^{N} \sum_{x_{i}, a_{i}} C (x_{i}) π_{x_{i}}^{a_{i}} (ϕ), \end{matrix}

where

C (x_{i}) = f_{i} (s_{i})

is the instant cost at state

x_{i}

. The constraint on the expected transmission rate can be rewritten as

\sum_{i = 1}^{N} \sum_{x_{i}} π_{x_{i}}^{1} (ϕ) \leq M .

Thus, the linear program associated with RP can be formulated as the following

\begin{array}{l} (13a) & \underset{π_{x_{i}}^{a_{i}}}{minimize} & \sum_{i = 1}^{N} \sum_{x_{i}, a_{i}} C (x_{i}) π_{x_{i}}^{a_{i}} \\ (13b) & subject to & π_{x_{i}}^{0} + π_{x_{i}}^{1} - \sum_{x_{i}^{'}} \sum_{a_{i}^{'}} P_{x_{i}^{'}, x_{i}} (a_{i}^{'}) π_{x_{i}^{'}}^{a_{i}^{'}} = 0 \forall x_{i}, i, \\ (13c) & \sum_{x_{i}} \sum_{a_{i}} π_{x_{i}}^{a_{i}} = 1 \forall i, \\ (13d) & \sum_{i = 1}^{N} \sum_{x_{i}} π_{x_{i}}^{1} \leq M, \\ (13e) & π_{x_{i}}^{a_{i}} \geq 0, \forall x_{i}, a_{i}, i . \end{array}

The corresponding dual problem is

\begin{array}{l} (14a) & \underset{σ, σ_{i}, σ_{x_{i}}}{maximize} & \sum_{i = 1}^{N} σ_{i} - M_{σ} \\ (14b) & subject to & σ_{x_{i}} + σ_{i} - \sum_{x_{i}^{'}} P_{x_{i}, x_{i}^{'}} (0) σ_{x_{i}^{'}} \leq C (x_{i}), \forall x_{i}, i, \\ (14c) & σ_{x_{i}} + σ_{i} - \sum_{x_{i}^{'}} P_{x_{i}, x_{i}^{'}} (1) σ_{x_{i}^{'}} - σ \leq C (x_{i}), \forall x_{i}, i, \\ (14d) & σ \geq 0 . \end{array}

Let

{{\bar{π}}_{x_{i}}^{a_{i}}}

and

{\bar{σ}, {\bar{σ}}_{i}, {\bar{σ}}_{x_{i}}}

be the optimal primal and dual solution pair to the problems reported in (13) and (14). We define

{\bar{ψ}}_{x_{i}}^{0} = \sum_{x_{i}^{'}} P_{x_{i}, x_{i}^{'}} (0) {\bar{σ}}_{x_{i}^{'}} + C (x_{i}) - {\bar{σ}}_{i} - {\bar{σ}}_{x_{i}} \geq 0,

{\bar{ψ}}_{x_{i}}^{1} = \sum_{x_{i}^{'}} P_{x_{i}, x_{i}^{'}} (1) {\bar{σ}}_{x_{i}^{'}} + \bar{σ} + C (x_{i}) - {\bar{σ}}_{i} - {\bar{σ}}_{x_{i}} \geq 0 .

For any state

x = (x_{1}, \dots, x_{N})

, let

h (x) = \sum_{i = 1}^{N} 𝟙_{{{\bar{π}}_{x_{i}}^{1} > 0}}

. Then, the heuristic operates in the following way

If $h (x) \geq M$ , the base station will choose the M users with the largest ${\bar{ψ}}_{x_{i}}^{0}$ among the $h (x)$ users.
If $h (x) < M$ , these $h (x)$ users are chosen by the base station. The base station will choose $M - h (x)$ additional users with the smallest ${\bar{ψ}}_{x_{i}}^{1}$ .

However, Linear Programming (LP) is a very general technique and does not appear to take advantage of the special structure of the problem. Although there are algorithms for solving rational LP that take time polynomial in the number of variables and constraints, they run extremely slowly in practice [29]. For our problem, we notice that the users have separate activity areas that are linked through a common resource constraint. Therefore, the primal problem can be solved using Dantzig-Wolfe decomposition. Even so, the problem is still computationally demanding when the system scales up. We recall that we solved the exact problem efficiently using MDP-specific algorithms in Section 5. It is more efficient because of the following reasons

According to Proposition 6, we can decompose the problem into N subproblems.
For each subproblem, the threshold structure of the optimal policy is utilized to reduce the running time of RVI.
As we will see later, the developed policy can be obtained directly from the result of RVI in practice.

In the following, we will translate the results in Section 5 into the optimal primal and dual solution pair and propose Indexed priority policy.

6.2. Indexed Priority Policy

We first define the Lagrangian function associated with (13).

\begin{matrix} L (π_{x_{i}}^{a_{i}}, σ, σ_{i}, σ_{x_{i}}, ψ_{x_{i}}^{a_{i}}) = & (\sum_{i = 1}^{N} \sum_{x_{i}, a_{i}} C (x_{i}) π_{x_{i}}^{a_{i}}) + \sum_{i, x_{i}} σ_{x_{i}} (\sum_{x_{i}^{'}} \sum_{a_{i}^{'}} P_{x_{i}^{'}, x_{i}} (a_{i}^{'}) π_{x_{i}^{'}}^{a_{i}^{'}} - π_{x_{i}}^{0} - π_{x_{i}}^{1}) + \\ \sum_{i = 1}^{N} σ_{i} (1 - \sum_{x_{i}} \sum_{a_{i}} π_{x_{i}}^{a_{i}}) + σ (\sum_{i = 1}^{N} \sum_{x_{i}} π_{x_{i}}^{1} - M) - \sum_{i, x_{i}, a_{i}} ψ_{x_{i}}^{a_{i}} π_{x_{i}}^{a_{i}} . \end{matrix}

Then, the corresponding Lagrangian dual function is

g (σ, σ_{i}, σ_{x_{i}}, ψ_{x_{i}}^{a_{i}}) = inf_{π_{x_{i}}^{a_{i}}} L (π_{x_{i}}^{a_{i}}, σ, σ_{i}, σ_{x_{i}}, ψ_{x_{i}}^{a_{i}}) .

Let

π_{x_{i}}

be the expected time that user i is at state

x_{i}

caused by the adoption of

ϕ_{λ^{*}}

, where

ϕ_{λ^{*}}

is the optimal policy detailed in Theorem 3. Then, we define

{π_{x_{i}}^{a_{i}}}

as follows

State $x_{i}$ is where randomization happens (randomization happens when the actions suggested by the two optimal deterministic policies are different), and it has a value of $π_{x_{i}}^{0} = a_{n_{λ_{-}^{*}, i}} (x_{i}) (1 - μ_{i}) π_{x_{i}} + a_{n_{λ_{+}^{*}, i}} (x_{i}) μ_{i} π_{x_{i}}$ and $π_{x_{i}}^{1} = π_{x_{i}} - π_{x_{i}}^{0}$ where $μ_{i}$ is given by (12) and $a_{n_{λ, i}} (x_{i})$ is the action suggested by $n_{λ, i}$ at state $x_{i}$ .
For other values of $x_{i}$ , we have $π_{x_{i}}^{0} = (1 - a_{n_{λ^{*}, i}} (x_{i})) π_{x_{i}}$ and $π_{x_{i}}^{1} = π_{x_{i}} - π_{x_{i}}^{0}$ .

We also define

σ = λ^{*}

,

σ_{i} = θ_{i}

, and

σ_{x_{i}} = V^{i} (x_{i})

where

λ^{*}

is specified in Section 5.2,

θ_{i}

is the optimal value of

M_{1}^{i} (λ^{*}, - 1)

, and

V^{i} (x_{i})

is the value function associated with

M_{1}^{i} (λ^{*}, - 1)

. Lastly, we define

{ψ_{x_{i}}^{a_{i}}}

as follows

ψ_{x_{i}}^{0} = \sum_{x_{i}^{'}} P_{x_{i}, x_{i}^{'}} (0) σ_{x_{i}^{'}} + C (x_{i}) - σ_{i} - σ_{x_{i}},

ψ_{x_{i}}^{1} = \sum_{x_{i}^{'}} P_{x_{i}, x_{i}^{'}} (1) σ_{x_{i}^{'}} + σ + C (x_{i}) - σ_{i} - σ_{x_{i}} .

Then, we can prove the following proposition.

Proposition 7 (Optimal solution pair).

{π_{x_{i}}^{a_{i}}}

and

{σ, σ_{i}, σ_{x_{i}}, ψ_{x_{i}}^{a_{i}}}

are primal and dual solutions to (13), respectively.

Proof.

Since (13) is linear and strictly feasible, it is sufficient to show that

{π_{x_{i}}^{a_{i}}}

and

{σ, σ_{i}, σ_{x_{i}}, ψ_{x_{i}}^{a_{i}}}

verify the KKT conditions, which can be expressed as the following four conditions.

Primal feasibility: the constraints in (13) are satisfied.
Dual feasibility: $σ \geq 0$ and $ψ_{x_{i}}^{a_{i}} \geq 0$ for all $x_{i}$ , $a_{i}$ , and i.
Complementary slackness: $σ (\sum_{i = 1}^{N} \sum_{x_{i}} π_{x_{i}}^{1} - M) = 0$ and $ψ_{x_{i}}^{a_{i}} π_{x_{i}}^{a_{i}} = 0$ for all $x_{i}$ , $a_{i}$ , and i.
Stationarity: the gradient of $L (π_{x_{i}}^{a_{i}}, σ, σ_{i}, σ_{x_{i}}, ψ_{x_{i}}^{a_{i}})$ with respect to ${π_{x_{i}}^{a_{i}}}$ vanishes.

Apparently, the first condition is satisfied by

{π_{x_{i}}^{a_{i}}}

. For the second condition,

σ \geq 0

since

σ = λ^{*} \geq 0

by definition. For

ψ_{x_{i}}^{a_{i}}

, we can verify that

ψ_{x_{i}}^{a_{i}} = V^{i, a_{i}} (x_{i}) - V^{i} (x_{i})

where

V^{i, a_{i}} (x_{i})

is the value function resulting from taking action

a_{i}

at state

x_{i}

. Then, the non-negativity is guaranteed by the Bellman equation. For the third condition, the first term is zero because we choose the

μ_{i}

’s given by (12). For the second term, we recall that

ψ_{x_{i}}^{a_{i}} = V^{i, a_{i}} (x_{i}) - V^{i} (x_{i})

. According to the definition of

π_{x_{i}}^{a_{i}}

, we know

V^{i} (x_{i}) = V^{i, a_{i}} (x_{i})

if

π_{x_{i}}^{a_{i}} > 0

. Combined together, we can conclude that

ψ_{x_{i}}^{a_{i}} = 0

when

π_{x_{i}}^{a_{i}} > 0

. Thus, the third condition is satisfied. For the last condition, setting the gradient equal to zero yields a system of linear equations. More precisely, for each

x_{i}

and

1 \leq i \leq N

{\begin{matrix} \sum_{x_{i}^{'}} P_{x_{i}, x_{i}^{'}} (0) σ_{x_{i}^{'}} + C (x_{i}) = σ_{x_{i}} + σ_{i} + ψ_{x_{i}}^{0} . \\ \sum_{x_{i}^{'}} P_{x_{i}, x_{i}^{'}} (1) σ_{x_{i}^{'}} + σ + C (x_{i}) = σ_{x_{i}} + σ_{i} + ψ_{x_{i}}^{1} . \end{matrix}

Then,

{σ, σ_{i}, σ_{x_{i}}, ψ_{x_{i}}^{a_{i}}}

verifies the system of linear equations by definition. Since all four conditions are satisfied, we can conclude our proof. ☐

According to Proposition 7, we know that

{π_{x_{i}}^{a_{i}}}

and

{σ, σ_{i}, σ_{x_{i}}}

defined above are the optimal solutions to problems (13) and (14), respectively. As the optimal solutions are obtained, we can adopt the heuristic detailed in Section 6.1.

The heuristic can be expressed equivalently as an index policy. To this end, we define the index

I_{x_{i}}

for state

x_{i}

as

I_{x_{i}} ≜ {\bar{ψ}}_{x_{i}}^{0} - {\bar{ψ}}_{x_{i}}^{1} .

According to the complementary slackness,

I_{x_{i}}

can be reduced to the following.

For state $x_{i}$ such that ${\bar{π}}_{x_{i}}^{1} > 0$ and ${\bar{π}}_{x_{i}}^{0} = 0$ , we have ${\bar{ψ}}_{x_{i}}^{1} = 0$ . Therefore, $I_{x_{i}} = {\bar{ψ}}_{x_{i}}^{0} \geq 0$ .
For state $x_{i}$ such that ${\bar{π}}_{x_{i}}^{1} > 0$ and ${\bar{π}}_{x_{i}}^{0} > 0$ , we have ${\bar{ψ}}_{x_{i}}^{1} = {\bar{ψ}}_{x_{i}}^{0} = 0$ . Therefore, $I_{x_{i}} = 0$ .
For state $x_{i}$ such that ${\bar{π}}_{x_{i}}^{1} = 0$ and ${\bar{π}}_{x_{i}}^{0} > 0$ , we have ${\bar{ψ}}_{x_{i}}^{0} = 0$ . Therefore, $I_{x_{i}} = - {\bar{ψ}}_{x_{i}}^{1} \leq 0$ .

We can show that

I_{x_{i}}

possesses the following properties.

Proposition 8 (Properties of

I_{x_{i}}

).

For

1 \leq i \leq N

,

I_{x_{i}} \geq - λ^{*}

for any

x_{i}

. The equality holds when

{\hat{r}}_{i} = p_{e, i}^{0} = 0

or

s_{i} = 0

. At the same time,

I_{x_{i}}

is non-decreasing in both

s_{i}

and

{\hat{r}}_{i}

.

Proof.

We notice that

I_{x_{i}}

can be expressed as a function of

V^{i} (x_{i})

and

λ^{*}

. Meanwhile,

M_{1}^{i} (λ^{*}, - 1)

coincides with the decoupled model studied in Section 4.2. Then, we can verify the properties of

I_{x_{i}}

using the results in Section 4.2. The complete proof can be found in Appendix L. ☐

Comparing with the heuristic detailed in Section 6.1, we can define the Indexed priority policy.

Definition 5 (Indexed priority policy).

At any state

x = (x_{1}, x_{2}, \dots, x_{N})

, the base station will transmit the updates from M users with the largest

I_{x_{i}}

. The ties are broken arbitrarily.

Remark 7.

Indexed priority policy belongs to the class of priority policies introduced in [30]. These priority policies are asymptotically optimal when certain conditions are satisfied.

Remark 8.

Indexed priority policy possesses the structural properties detailed in Corollary 1.

The first two properties can be verified by noting that $I_{x_{i}} \geq - λ^{*}$ and the equality holds when ${\hat{r}}_{i} = p_{e, i}^{0} = 0$ or $s_{i} = 0$ . At the same time, $I_{x_{i}}$ is non-decreasing in ${\hat{r}}_{i}$ .
The third and fourth properties can be verified by noting that $I_{x_{i}}$ is non-decreasing in $s_{i}$ .
For the last property, we first notice that $I_{x_{j}} = I_{x_{k}}$ when users j and k are statistically identical and $x_{j} = x_{k}$ . Then, the property can be verified by noting that $I_{x_{i}}$ is non-decreasing in both $s_{i}$ and ${\hat{r}}_{i}$ .

We notice that

θ_{i}

’s and

C (x_{i})

’s are canceled out by the definition of

I_{x_{i}}

. Therefore,

I_{x_{i}}

can be calculated using

λ^{*}

and the value function of

M_{1}^{i} (λ^{*}, - 1)

. In practice, we can use either

λ_{-}^{*}

or

λ_{+}^{*}

to approximate

λ^{*}

, and the value function can be approximated by the result of the RVI detailed in Section 5.1. Since the state space is infinite, we only calculate a finite number of

V^{i} (x_{i})

, the number of which depends on the truncation parameter m of ASM. Meanwhile, the probabilities

P_{x_{i}, x_{i}^{'}} (a_{i})

in

I_{x_{i}}

are modified according to (10).

7. Numerical Results

In this section, we provide numerical results to showcase the performance of the developed scheduling policies. To eliminate the effect of N, we plot the expected average AoII. In particular, we provide the expected average AoII achieved by the Indexed priority policy and Whittle’s index policy when

M = 1

. The policies are calculated using the results detailed in Section 4, Section 5 and Section 6. When obtaining the Indexed priority policy, we set the tolerance in the Bisection search to

ξ = 0.005

. Meanwhile, we choose the truncation parameter in ASM

m = 800

and the convergence criteria in RVI

ϵ = 0.01

. We notice that the calculation of Whittle’s index involves an infinite sum. In practice, we approximate the result by replacing

+ \infty

with a large enough number

k_{m a x}

. Here, we choose

k_{m a x} = 800

. For both scheduling policies, the resulting expected average AoII is obtained via simulations. Each data point is the average of 15 runs with 15,000 time slots considered in each run.

We also compare the developed policies with the optimal policy for RP, which can be calculated by following the discussion in Section 5.2. We adopt the same choices of parameters as we used to obtain the developed policies. The corresponding performance is calculated using Proposition 2. Like before, the infinite sum is approximated by replacing

+ \infty

with

k_{m a x} = 800

. We also provide the expected average AoII achieved by the Greedy policy to show the performance advantages of the developed policies. When the Greedy policy is adopted, the base station always chooses the user with the largest AoII. The resulting expected average AoII is obtained via the same simulations as applied to the developed policies.

Figure 3 and Figure 4 illustrate the performance when the source processes have different dynamics and when each user’s communication goal is different, respectively. Figure 3a provides the performance when

p_{i} = 0.05 + \frac{0.4 (i - 1)}{N - 1}

for

1 \leq i \leq N

. For other parameters, the users make the same choices. More precisely,

f_{i} (s) = s

,

γ_{i} = 0.6

, and

p_{e, i}^{0} = p_{e, i}^{1} = 0.1

for

1 \leq i \leq N

. Figure 4a provides the performance when

f_{i} (s) = s^{0.5 + \frac{i - 1}{N - 1}}

for

1 \leq i \leq N

. Same as before, the users make the same choices for other parameters. More precisely,

p_{i} = 0.3

,

γ_{i} = 0.6

, and

p_{e, i}^{0} = p_{e, i}^{1} = 0.1

for

1 \leq i \leq N

. In Figure 3b and Figure 4b, we force

p_{e, i}^{0} = 0

for all users to ensure the existence of Whittle’s index. Other choices remain the same as in Figure 3a and Figure 4a. According to Corollary 1, the optimal policy will never choose the user with

\hat{r} = p_{e}^{0} = 0

unless it is to break the tie. Therefore, in Figure 3b and Figure 4b, we also consider the Greedy+ policy where the base station always chooses the user with the largest AoII among the users with

\hat{r} = 1

. The resulting expected average AoII is obtained via the same simulations as applied to the Greedy policy.

Figure 5 shows the performance in systems where the parameters for each user are generated uniformly and randomly within their ranges. In Figure 5a, we consider

N = 5

,

γ \in [0, 1]

,

p \in [0.05, 0.45]

,

p_{e}^{\hat{r}} \in [0, 0.45]

, and

f (s) = s^{τ}

, where

τ \in [0.5, 1.5]

. There are a total of 300 different choices and the results are sorted by the performance of RP-optimal policy in ascending order. Figure 5b adopts the same system settings except that we impose

p_{e, i}^{0} = 0

for

1 \leq i \leq N

to ensure the feasibility of Whittle’s index policy. Meanwhile, we ignore the Greedy policy since the Greedy+ policy achieves a better performance, as indicated by Figure 3b and Figure 4b.

We can make the following observations from the figures.

The Greedy+ policy yields a smaller expected average AoII than that achieved by the Greedy policy. Recall that we obtained the Greedy+ policy by applying the structural properties detailed in Corollary 1. Therefore, simple applications of the structural properties of the optimal policy can improve the performance of scheduling policies.
The Indexed priority policy has comparable performance to Whittle’s index policy in all the system settings considered. The two policies have their own advantages. The Indexed priority policy has a broader scope of application, while Whittle’s index policy has a lower computational complexity.
The performance of the Indexed priority policy and Whittle’s index policy is better than that of the Greedy/Greedy+ policies and is not far from the performance of the RP-optimal policy. Recall that the performance of the RP-optimal policy forms a universal lower bound on the performance of all admissible policies for PP. Hence, we can conclude that both the Indexed priority policy and Whittle’s index policy achieve good performances.

8. Conclusions

In this paper, we studied the problem of minimizing the Age of Incorrect Information in a slotted-time system where a base station needs to schedule M users among N available users. Meanwhile, the base station has access to imperfect channel state information in each time slot. The problem is a restless multi-armed bandit problem which is SPACE-hard. However, by casting the problem into a Markov decision process, we obtain the structural properties of the optimal policy. Then, we introduce a relaxed version of the original problem and investigate the decoupled model. Under a simple condition, we establish the indexability of the decoupled problem and obtain the expression of Whittle’s index. On this basis, we developed Whittle’s index policy. To get rid of the requirement for indexability, we developed the Indexed priority policy based on the optimal policy for the relaxed problem. The characteristics of the relaxed problem are explored to make the calculation of its optimal policy more efficient. Finally, through numerical results, we show that simple applications of the structural properties can improve the performance of scheduling policies. Moreover, Whittle’s index policy and the Indexed priority policy achieve good and comparable performances.

Author Contributions

Formal analysis, Y.C.; Investigation, Y.C.; Methodology, Y.C.; Supervision, A.E.; Validation, Y.C.; Writing—original draft, Y.C.; Writing—review & editing, Y.C. and A.E. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proof of Lemma 1

We consider two states,

x_{1}

and

x_{2}

, that differ only in the value of

s_{j}

. Without the loss of generality, we assume

s_{1, j} < s_{2, j}

. Then, it is sufficient to show that, for any

1 \leq j \leq N

,

V (x_{1}) \leq V (x_{2})

. Leveraging the iterative nature of VIA, we use mathematical induction to prove the monotonicity. First of all, the base case (i.e.,

ν = 0

) is true by initialization. We assume the lemma holds at iteration

ν

. Then, we want to examine whether it holds at iteration

ν + 1

. The update step reported in problem (5) can be rewritten as follows.

V_{ν + 1} (x) = min_{a \in A_{N} (1)} V_{ν + 1}^{a} (x),

(A1)

where

\begin{matrix} V_{ν + 1}^{a} (x) & = C (x) - θ + \sum_{x^{'} - {x_{j}^{'}}} \{(\prod_{i \neq j} P_{x_{i}, x_{i}^{'}} (a_{i})) \sum_{{\hat{r}}_{j}^{'}} P ({\hat{r}}_{j}^{'}) U_{ν}^{j} (x, x^{'})\}, \end{matrix}

U_{ν}^{j} (x, x^{'}) = \sum_{s_{j}^{'}} P_{s_{j}, s_{j}^{'}} (a_{j}, {\hat{r}}_{j}) V_{ν} (x^{'}) .

To prove the desired results, we distinguish between the following cases.

We first consider the case of $s_{1, j} = 0 < s_{2, j}$ and ${\hat{r}}_{1, j} = {\hat{r}}_{2, j} = 0$ . When $a_{j} = 1$ and for any $x^{'} - {s_{j}^{'}}$ , we have

$U_{ν}^{j} (x_{1}, x^{'}) = p_{j} V_{ν} (x^{'}; s_{j}^{'} = 1) + (1 - p_{j}) V_{ν} (x^{'}; s_{j}^{'} = 0),$

$\begin{matrix} U_{ν}^{j} (x_{2}, x^{'}) & = β_{j} V_{ν} (x^{'}; s_{j}^{'} = s_{2, j} + 1) + (1 - β_{j}) V_{ν} (x^{'}; s_{j}^{'} = 0), \end{matrix}$

where $V_{ν} (x^{'}; s_{j}^{'} = 0)$ is the estimated value function of the state $x^{'}$ with $s_{j}^{'} = 0$ at iteration $ν$ (at the risk of abusing the notation, we use $V (x; s_{j} = s_{1})$ and $V (x; s_{j} = s_{2})$ to represent the value functions of two states that differ only in the value of $s_{j}$ ). Then, we get

$\begin{matrix} U_{ν}^{j} (x_{1}, x^{'}) - U_{ν}^{j} (x_{2}, x^{'}) \leq (p_{j} - β_{j}) (V_{ν} (x^{'}; s_{j}^{'} = 1) - V_{ν} (x^{'}; s_{j}^{'} = 0)) \leq 0 . \end{matrix}$

The inequalities hold since $β_{j} > p_{j}$ and Lemma 1 are true at iteration $ν$ by assumption. Therefore, we have $U_{ν}^{j} (x_{1}, x^{'}) \leq U_{ν}^{j} (x_{2}, x^{'})$ when $a_{j} = 1$ for any $x^{'} - {s_{j}^{'}}$ .
For the case of $a_{i} = 1$ where $i \neq j$ , we notice that $a_{j} = 0$ . Then, for any $x^{'} - {s_{j}^{'}}$ , we obtain

$U_{ν}^{j} (x_{1}, x^{'}) = p_{j} V_{ν} (x^{'}; s_{j}^{'} = 1) + (1 - p_{j}) V_{ν} (x^{'}; s_{j}^{'} = 0),$

$\begin{matrix} U_{ν}^{j} (x_{2}, x^{'}) & = (1 - p_{j}) V_{ν} (x^{'}; s_{j}^{'} = s_{2, j} + 1) + p_{j} V_{ν} (x^{'}; s_{j}^{'} = 0) . \end{matrix}$

Therefore, when $a_{i} = 1$ , we have

$\begin{matrix} U_{ν}^{j} (x_{1}, x^{'}) - U_{ν}^{j} (x_{2}, x^{'}) \leq (2 p_{j} - 1) (V_{ν} (x^{'}; s_{j}^{'} = 1) - V_{ν} (x^{'}; s_{j}^{'} = 0)) \leq 0 . \end{matrix}$

The inequalities hold since $2 p_{j} - 1 < 0$ and Lemma 1 is true at iteration $ν$ by assumption. Combining with the case of $a_{j} = 1$ , $U_{ν}^{j} (x_{1}, x^{'}) \leq U_{ν}^{j} (x_{2}, x^{'})$ holds for any $x^{'} - {s_{j}^{'}}$ under any feasible action. Since $x_{1}$ and $x_{2}$ differ only in the value of $s_{j}$ and $C (x)$ is non-decreasing in $s_{i}$ for $1 \leq i \leq N$ , we can see that $V_{ν + 1}^{a} (x_{1}) \leq V_{ν + 1}^{a} (x_{2})$ for any feasible $a$ . Then, by (A1), we can conclude that the lemma holds at iteration $ν + 1$ when $s_{1, j} = 0 < s_{2, j}$ and ${\hat{r}}_{1, j} = {\hat{r}}_{2, j} = 0$ .
When $s_{1, j} = 0 < s_{2, j}$ and ${\hat{r}}_{1, j} = {\hat{r}}_{2, j} = 1$ , by replacing the $β_{j}$ ’s in the above case with $α_{j}$ ’s, we can achieve the same result.
When $0 < s_{1, j} < s_{2, j}$ and ${\hat{r}}_{1, j} = {\hat{r}}_{2, j}$ , we notice that

$\begin{matrix} P_{s_{1, j}, s_{1, j} + 1} (a_{j}, {\hat{r}}_{1, j}) = P_{s_{2, j}, s_{2, j} + 1} (a_{j}, {\hat{r}}_{2, j}), \\ P_{s_{1, j}, 0} (a_{j}, {\hat{r}}_{1, j}) = P_{s_{2, j}, 0} (a_{j}, {\hat{r}}_{2, j}) . \end{matrix}$

Then, leveraging the monotonicity of $V_{ν} (x)$ and $C (x)$ , we can conclude with the same result.

Combining the three cases, we prove that the lemma also holds at iteration

ν + 1

of VIA. Therefore, the lemma holds at any iteration

ν

by mathematical induction. Since the results hold for any

1 \leq j \leq N

and VIA is guaranteed to converge to the value function when

ν \to + \infty

, we can conclude our proof.

Appendix B. Proof of Lemma 2

We inherit the notations in the proof of Lemma 1. We still use mathematical induction to obtain the desired results. The base case

ν = 0

is true by initialization. We assume the lemma holds at iterative

ν

and examine whether it still holds at iteration

ν + 1

. In the case of

M = 1

, we rewrite (5) as

V_{ν + 1} (x) = min_{1 \leq j \leq N} V_{ν + 1}^{j} (x),

(A2)

where

V_{ν + 1}^{j} (x) = C (x) - θ + \sum_{x^{'}} \{(\prod_{i \neq j} P_{x_{i}, x_{i}^{'}}^{i} (0)) P_{x_{j}, x_{j}^{'}}^{j} (1) V_{ν} (x^{'})\},

(A3)

and

P_{x, x^{'}}^{i} (a_{i})

is the probability that action

a_{i}

will lead to state

x^{'}

when user i is at state x. To get the desired results, we distinguish between the following cases

We first show that $V_{ν + 1}^{j} (x) = V_{ν + 1}^{k} (P (x))$ . According to (A3), we have

$V_{ν + 1}^{j} (x) = C (x) - θ + \sum_{x^{'}} \{(\prod_{i \neq j, k} P_{x_{i}, x_{i}^{'}}^{i} (0)) P_{x_{k}, x_{k}^{'}}^{k} (0) P_{x_{j}, x_{j}^{'}}^{j} (1) V_{ν} (x^{'})\} .$

$\begin{matrix} V_{ν + 1}^{k} (P (x)) & = C (P (x)) - θ + \\ \sum_{P {(x)}^{'}} (\prod_{i \neq j, k} P_{P {(x)}_{i}, P {(x)}_{i}^{'}}^{i} (0)) P_{P {(x)}_{k}, P {(x)}_{k}^{'}}^{k} (1) P_{P {(x)}_{j}, P {(x)}_{j}^{'}}^{j} (0) V_{ν} (P {(x)}^{'}) . \end{matrix}$

It is obvious that for any $P {(x)}^{'}$ , there always exists $P (x^{″}) = P {(x)}^{'}$ . Then, we obtain

$\begin{matrix} V_{ν + 1}^{k} (P (x)) & = C (P (x)) - θ + \\ \sum_{P (x^{″})} (\prod_{i \neq j, k} P_{x_{i}, x_{i}^{″}}^{i} (0)) P_{x_{j}, P {(x^{″})}_{k}}^{k} (1) P_{x_{k}, P {(x^{″})}_{j}}^{j} (0) V_{ν} (P (x^{″})) \\ = C (P (x)) - θ + \sum_{x^{″}} (\prod_{i \neq j, k} P_{x_{i}, x_{i}^{″}}^{i} (0)) P_{x_{j}, x_{j}^{″}}^{k} (1) P_{x_{k}, x_{k}^{″}}^{j} (0) V_{ν} (x^{″}) \\ = C (P (x)) - θ + \sum_{x^{'}} (\prod_{i \neq j, k} P_{x_{i}, x_{i}^{'}}^{i} (0)) P_{x_{j}, x_{j}^{'}}^{k} (1) P_{x_{k}, x_{k}^{'}}^{j} (0) V_{ν} (x^{'}) . \end{matrix}$

The second equality follows from the definition of $P (\cdot)$ , the property of summation, and the assumption at iteration $ν$ . The last equality follows from the variable renaming. Then, by the definition of statistically identical, we have $P_{x_{j}, x_{j}^{'}}^{k} (1) = P_{x_{j}, x_{j}^{'}}^{j} (1)$ , $P_{x_{k}, x_{k}^{'}}^{j} (0) = P_{x_{k}, x_{k}^{'}}^{k} (0)$ , and $C (x) = C (P (x))$ . Therefore, we can conclude that $V_{ν + 1}^{j} (x) = V_{ν + 1}^{k} (P (x))$ .
Along the same lines, we can easily show that $V_{ν + 1}^{k} (x) = V_{ν + 1}^{j} (P (x))$ and $V_{ν + 1}^{i} (x) = V_{ν + 1}^{i} (P (x))$ for $i \neq j, k$ .

Combining the above cases with (A2), we prove that

V_{ν + 1} (x) = V_{ν + 1} (P (x))

. Then, by induction, we have

V_{ν} (x) = V_{ν} (P (x))

at any iteration

ν

. Since VIA is guaranteed to converge to the value function when

ν \to + \infty

, we can conclude our proof.

Appendix C. Proof of Theorem 1

For arbitrary j and k

δ^{j, k} (x) = \sum_{x^{'} - {x_{j}^{'}, x_{k}^{'}}} \{(\prod_{i \neq j, k} P_{x_{i}, x_{i}^{'}} (0)) \sum_{{\hat{r}}_{j}^{'}, {\hat{r}}_{k}^{'}} P ({\hat{r}}_{j}^{'}) P ({\hat{r}}_{k}^{'}) R^{j, k} (x, x^{'})\},

(A4)

where

R^{j, k} (x, x^{'}) = \sum_{s_{j}^{'}, s_{k}^{'}} [(P_{s_{k}, s_{k}^{'}} (0, {\hat{r}}_{k}) P_{s_{j}, s_{j}^{'}} (1, {\hat{r}}_{j}) - P_{s_{k}, s_{k}^{'}} (1, {\hat{r}}_{k}) P_{s_{j}, s_{j}^{'}} (0, {\hat{r}}_{j})) V (x^{'})] .

(A5)

With this in mind, we will prove the properties one by one.

Property 1—δ^j,k (x) ≤ 0 if

{\hat{r}}_{k} = p_{e, k}^{0} = 0

. The equality holds when

s_{j} = 0

or

{\hat{r}}_{j} = p_{e, j}^{0} = 0

.

When

{\hat{r}}_{k} = p_{e, k}^{0} = 0

, transmitting the update from user k will necessarily fail. Therefore,

P_{s_{k}, s_{k}^{'}} (0, 0) = P_{s_{k}, s_{k}^{'}} (1, 0)

for any

s_{k}

and

s_{k}^{'}

. Then, we have

\begin{matrix} R^{j, k} (x, x^{'}) = \sum_{s_{k}^{'}} P_{s_{k}, s_{k}^{'}} (0, 0) \sum_{s_{j}^{'}} [(P_{s_{j}, s_{j}^{'}} (1, {\hat{r}}_{j}) - P_{s_{j}, s_{j}^{'}} (0, {\hat{r}}_{j})) V (x^{'})] . \end{matrix}

To identify the sign of

R^{j, k} (x, x^{'})

, we distinguish between the following cases

When $s_{j} = 0$ , we can easily show that $R^{j, k} (x, x^{'}) = 0$ for any $x^{'} - {s_{j}^{'}, s_{k}^{'}}$ by noticing that the two possible actions with respect to user j (i.e., $a_{j} = 1$ and $a_{j} = 0$ ) are equivalent when $s_{j} = 0$ . Since $δ^{j, k} (x)$ is a linear combination of $R^{j, k} (x, x^{'})$ ’s with non-negative coefficients, we can conclude that $δ^{j, k} (x) = 0$ in this case.
When $s_{j} > 0$ and ${\hat{r}}_{j} = 1$ , for any $x^{'} - {s_{j}^{'}, s_{k}^{'}}$ , we have

$\begin{matrix} R^{j, k} (x, x^{'}) & = \sum_{s_{k}^{'}} P_{s_{k}, s_{k}^{'}} (0, 0) (α_{j} + p_{j} - 1) (V (x^{'}; s_{j}^{'} = s_{j} + 1) - V (x^{'}; s_{j}^{'} = 0)) \\ \leq 0 . \end{matrix}$

(A6)

The inequality holds because of Lemma 1 and the fact that $α_{j} + p_{j} < 1$ . We recall that $δ^{j, k} (x)$ is a linear combination of $R^{j, k} (x, x^{'})$ ’s with non-negative coefficients. Then, we can conclude that $δ^{j, k} (x) \leq 0$ in this case.
When $s_{j} > 0$ and ${\hat{r}}_{j} = 0$ , by replacing the $α_{j}$ in (A6) with $β_{j}$ , we can get the same result. In this case, the equality holds when $β_{j} + p_{j} = 1$ , or, equivalently, $p_{e, j}^{0} = 0$ .

Combining the cases, we prove the first property.

Property 2—δ^j,k (x) is non-increasing in

{\hat{r}}_{j}

and is non-decreasing in

{\hat{r}}_{k}

when

s_{j}, s_{k} > 0

. At the same time,

δ^{j, k} (x)

is independent of

{\hat{r}}_{j}

for any i ≠ j,k.

We first prove the monotonicity of

δ^{j, k} (x)

with respect to

{\hat{r}}_{j}

. To this end, we define

x_{1}

and

x_{2}

as two states that differ only in the value of

{\hat{r}}_{j}

. Without a loss of generality, we assume

{\hat{r}}_{1, j} = 1

and

{\hat{r}}_{2, j} = 0

. Then, we investigate the sign of

δ^{j, k} (x_{1}) - δ^{j, k} (x_{2})

. We define

x_{i} ≜ x_{1, i} = x_{2, i}

for

i \neq j

. Then, according to (A4),

δ^{j, k} (x_{1}) - δ^{j, k} (x_{2})

can be written as

\begin{matrix} δ^{j, k} (x_{1}) - & δ^{j, k} (x_{2}) = \\ \sum_{x^{'} - {x_{j}^{'}, x_{k}^{'}}} \{(\prod_{i \neq j, k} P_{x_{i}, x_{i}^{'}} (0)) \sum_{{\hat{r}}_{j}^{'}, {\hat{r}}_{k}^{'}} P ({\hat{r}}_{j}^{'}) P ({\hat{r}}_{k}^{'}) (R^{j, k} (x_{1}, x^{'}) - R^{j, k} (x_{2}, x^{'}))\} . \end{matrix}

Since

x_{1, k} = x_{2, k}

, we have

P_{s_{1, k}, s_{k}^{'}} (a, {\hat{r}}_{1, k}) = P_{s_{2, k}, s_{k}^{'}} (a, {\hat{r}}_{2, k})

for any

s_{k}^{'}

. We recall that the transition probability is independent of

\hat{r}

when

a = 0

. Combining with the fact that

s_{1, j} = s_{2, j}

, we also have

P_{s_{1, j}, s_{j}^{'}} (0, {\hat{r}}_{1, j}) = P_{s_{2, j}, s_{j}^{'}} (0, {\hat{r}}_{2, j})

for any

s_{j}^{'}

. Combining together, we obtain

P_{s_{1, k}, s_{k}^{'}} (1, {\hat{r}}_{1, k}) P_{s_{1, j}, s_{j}^{'}} (0, {\hat{r}}_{1, j}) = P_{s_{2, k}, s_{k}^{'}} (1, {\hat{r}}_{2, k}) P_{s_{2, j}, s_{j}^{'}} (0, {\hat{r}}_{2, j}),

P_{s_{1, k}, s_{k}^{'}} (0, {\hat{r}}_{1, k}) = P_{s_{2, k}, s_{k}^{'}} (0, {\hat{r}}_{2, k}) .

Leveraging the above two problems, we have

\begin{matrix} R^{j, k} (x_{1}, x^{'}) & - R^{j, k} (x_{2}, x^{'}) = \\ \sum_{s_{j}^{'}, s_{k}^{'}} [P_{s_{k}, s_{k}^{'}} (0, {\hat{r}}_{k}) (P_{s_{1, j}, s_{j}^{'}} (1, {\hat{r}}_{1, j}) - P_{s_{2, j}, s_{j}^{'}} (1, {\hat{r}}_{2, j})) V (x^{'})] . \end{matrix}

Consequently, we obtain

\begin{matrix} δ^{j, k} (x_{1}) - & δ^{j, k} (x_{2}) = \\ \sum_{x^{'} - {x_{j}^{'}}} \{\prod_{i \neq j} P_{x_{i}, x_{i}^{'}} (0) [\sum_{{\hat{r}}_{j}^{'}} P ({\hat{r}}_{j}^{'}) \sum_{s_{j}^{'}} (P_{s_{1, j}, s_{j}^{'}} (1, 1) - P_{s_{2, j}, s_{j}^{'}} (1, 0)) V (x^{'})]\} . \end{matrix}

In the following, we characterize the sign of

R_{1} ≜ \sum_{s_{j}^{'}} (P_{s_{1, j}, s_{j}^{'}} (1, 1) - P_{s_{2, j}, s_{j}^{'}} (1, 0)) V (x^{'}) .

As

s_{1, j} = s_{2, j} > 0

, for any

x^{'} - {s_{j}^{'}}

, we have

R_{1} = ((1 - α_{j}) - (1 - β_{j})) V (x^{'}; s_{j}^{'} = 0) + (α_{j} - β_{j}) V (x^{'}; s_{j}^{'} = s_{1, j} + 1) \leq 0 .

The inequality follows from Lemma 1 and the fact that

β_{j} > α_{j}

. Since

δ^{j, k} (x_{1}) - δ^{j, k} (x_{2})

is a linear combination of

R_{1}

’s with non-negative coefficients, we can conclude that

δ^{j, k} (x_{1}) \leq δ^{j, k} (x_{2})

. Since

{\hat{r}}_{1, j} > {\hat{r}}_{2, j}

, we can see that

δ^{j, k} (x)

is non-increasing in

{\hat{r}}_{j}

.

In a very similar way, we can show that

δ^{j, k} (x)

is non-decreasing in

{\hat{r}}_{k}

. We recall that

{\hat{r}}_{i}

will not affect the system dynamic if

a_{i} = 0

. Consequently, we can conclude that

δ^{j, k} (x)

is independent of

{\hat{r}}_{i}

for any

i \neq j, k

.

Combining together, we prove the second property.

Property 3—

δ^{j, k} (x) \leq 0

if

s_{k} = 0

. The equality holds when

s_{j} = 0

or

{\hat{r}}_{j} = p_{e, j}^{0} = 0

.

Since the probabilities are non-negative, it is sufficient to show that

R^{j, k} (x, x^{'})

satisfies Property 3 for any

x^{'} - {s_{j}^{'}, s_{k}^{'}}

. More precisely, it is sufficient to show that

R^{j, k} (x, x^{'}) \leq 0

for any

x^{'} - {s_{j}^{'}, s_{k}^{'}}

when

s_{k} = 0

and the equality holds when

s_{j} = 0

or

{\hat{r}}_{j} = p_{e, j}^{0} = 0

. We recall that

P_{s_{k}, s_{k}^{'}} (1, {\hat{r}}_{k}) = P_{s_{k}, s_{k}^{'}} (0, {\hat{r}}_{k})

for any

s_{k}^{'}

when

s_{k} = 0

. Hence, for any

x^{'} - {s_{j}^{'}, s_{k}^{'}}

, we have

R^{j, k} (x, x^{'}) = \sum_{s_{k}^{'}} [P_{s_{k}, s_{k}^{'}} (0, {\hat{r}}_{k}) \sum_{s_{j}^{'}} (P_{s_{j}, s_{j}^{'}} (1, {\hat{r}}_{j}) - P_{s_{j}, s_{j}^{'}} (0, {\hat{r}}_{j})) V (x^{'})] .

Then, we investigate the following quantity for any

x^{'} - {s_{j}^{'}}

R_{2} ≜ \sum_{s_{j}^{'}} (P_{s_{j}, s_{j}^{'}} (1, {\hat{r}}_{j}) - P_{x_{j}, x_{j}^{'}} (0, {\hat{r}}_{j})) V (x^{'}) .

To this end, we distinguish between the following cases

When $s_{j} = 0$ , we have $P_{s_{j}, s_{j}^{'}} (1, {\hat{r}}_{j}) = P_{s_{j}, s_{j}^{'}} (0, {\hat{r}}_{j})$ for any $s_{j}^{'}$ . Thus, we conclude that $R_{2} = 0$ for any $x^{'} - {s_{j}^{'}}$ . Consequently, $R^{j, k} (x, x^{'}) = 0$ for any $x^{'} - {s_{j}^{'}, s_{k}^{'}}$ .
When $s_{j} > 0$ and ${\hat{r}}_{j} = 1$ , for any $x^{'} - {s_{j}^{'}}$ , we have

$R_{2} = (α_{j} - 1 + p_{j}) V (x^{'}; s_{j}^{'} = s_{j} + 1) + (1 - α_{j} - p_{j}) V (x^{'}; s_{j}^{'} = 0) \leq 0$

(A7)

The inequality follows from Lemma 1 and the fact that $α_{j} + p_{j} < 1$ . Thus, $R^{j, k} (x, x^{'}) \leq 0$ for any $x^{'} - {s_{j}^{'}, s_{k}^{'}}$ .
When $s_{j} > 0$ and ${\hat{r}}_{j} = 0$ , by replacing the $α_{j}$ in (A7) with $β_{j}$ , we can get the same result. In this case, the equality holds when $β_{j} + p_{j} = 1$ , or, equivalently, $p_{e, j}^{0} = 0$ .

Combined together, we can conclude that Property 3 is true.

Property 4—

δ^{j, k} (x)

is non-increasing in

s_{j}

if

Γ_{j}^{{\hat{r}}_{j}} \leq Γ_{k}^{{\hat{r}}_{k}}

and is non-decreasing in

s_{k}

if

Γ_{j}^{{\hat{r}}_{j}} \geq Γ_{k}^{{\hat{r}}_{k}}

when

s_{j}, s_{k} > 0

. We define

Γ_{i}^{1} ≜ \frac{α_{i}}{1 - p_{i}}

and

Γ_{i}^{0} ≜ \frac{β_{i}}{1 - p_{i}}

for

1 \leq i \leq N

.

Such as we did in the proof of Property 3, it is sufficient to show that

R^{j, k} (x, x^{'})

satisfies Property 4 for any

x^{'} - {s_{j}^{'}, s_{k}^{'}}

. We recall that

R^{j, k} (x, x^{'})

depends on the values of

{\hat{r}}_{j}

and

{\hat{r}}_{k}

. Therefore, we distinguish between the following cases

In the case of ${\hat{r}}_{j} = {\hat{r}}_{k} = 1$ and $s_{j}, s_{k} > 0$ , for any $x^{'} - {s_{j}^{'}, s_{k}^{'}}$ , (A5) can be written as

$\begin{matrix} R^{j, k} (x, x^{'}) & = \sum_{s_{j}^{'}, s_{k}^{'}} [(P_{s_{k}, s_{k}^{'}} (0, 1) P_{s_{j}, s_{j}^{'}} (1, 1) - P_{s_{k}, s_{k}^{'}} (1, 1) P_{s_{j}, s_{j}^{'}} (0, 1)) V (x^{'})] \\ = (p_{k} α_{j} - (1 - p_{j}) (1 - α_{k})) V (x^{'}; s_{j}^{'} = s_{j} + 1; s_{k}^{'} = 0) \\ + ((1 - p_{k}) (1 - α_{j}) - p_{j} α_{k}) V (x^{'}; s_{j}^{'} = 0; s_{k}^{'} = s_{k} + 1) \\ + ((1 - p_{k}) α_{j} - (1 - p_{j}) α_{k}) V (x^{'}; s_{j}^{'} = s_{j} + 1; s_{k}^{'} = s_{k} + 1) \\ + (p_{k} (1 - α_{j}) - p_{j} (1 - α_{k})) V (x^{'}; s_{j}^{'} = 0; s_{k}^{'} = 0) . \end{matrix}$

As we can verify

$p_{k} α_{j} - (1 - p_{j}) (1 - α_{k}) < \frac{1}{2} (p_{k} + p_{j} - 1) < 0,$

$(1 - p_{k}) (1 - α_{j}) - p_{j} α_{k} > \frac{1}{2} (1 - p_{k} - p_{j}) > 0 .$

We define $Γ_{i}^{1} ≜ \frac{α_{i}}{1 - p_{i}}$ and $Γ_{i}^{0} ≜ \frac{β_{i}}{1 - p_{i}}$ for $1 \leq i \leq N$ . Then, we have

$Γ_{j}^{1} ⋚ Γ_{k}^{1} ⟹ (1 - p_{k}) α_{j} - (1 - p_{j}) α_{k} ⋚ 0 .$

Combining with Lemma 1, we can conclude that, for any $x^{'} - {s_{j}^{'}, s_{k}^{'}}$ , $R^{j, k} (x, x^{'})$ is non-increasing in $s_{j}$ if $Γ_{j}^{1} \leq Γ_{k}^{1}$ and is non-decreasing in $s_{k}$ if $Γ_{j}^{1} \geq Γ_{k}^{1}$ .
In the case of ${\hat{r}}_{j} = {\hat{r}}_{k} = 0$ and $s_{j}, s_{k} > 0$ , by replacing the $α$ ’s in the above case with $β$ ’s, we can conclude with the same result.
In the case of ${\hat{r}}_{j} = 1$ , ${\hat{r}}_{k} = 0$ , and $s_{j}, s_{k} > 0$ , for any $x^{'} - {s_{j}^{'}, s_{k}^{'}}$ , (A5) can be written as

$\begin{matrix} R^{j, k} (x, x^{'}) & = \sum_{s_{j}^{'}, s_{k}^{'}} [(P_{s_{k}, s_{k}^{'}} (0, 0) P_{s_{j}, s_{j}^{'}} (1, 1) - P_{s_{k}, s_{k}^{'}} (1, 0) P_{s_{j}, s_{j}^{'}} (0, 1)) V (x^{'})] \\ = (p_{k} α_{j} - (1 - p_{j}) (1 - β_{k})) V (x^{'}; s_{j}^{'} = s_{j} + 1; s_{k}^{'} = 0) \\ + ((1 - p_{k}) (1 - α_{j}) - p_{j} β_{k}) V (x^{'}; s_{j}^{'} = 0; s_{k}^{'} = s_{k} + 1) \\ + ((1 - p_{k}) α_{j} - (1 - p_{j}) β_{k}) V (x^{'}; s_{j}^{'} = s_{j} + 1; s_{k}^{'} = s_{k} + 1) \\ + (p_{k} (1 - α_{j}) - p_{j} (1 - β_{k})) V (x^{'}; s_{j}^{'} = 0; s_{k}^{'} = 0) . \end{matrix}$

As we can verify

$p_{k} α_{j} - (1 - p_{j}) (1 - β_{k}) < p_{k} (p_{j} - \frac{1}{2}) < 0,$

$(1 - p_{k}) (1 - α_{j}) - p_{j} β_{k} > (1 - p_{k}) (\frac{1}{2} - p_{j}) > 0 .$

At the same time

$Γ_{j}^{1} ⋚ Γ_{k}^{0} ⟹ (1 - p_{k}) α_{j} - (1 - p_{j}) β_{k} ⋚ 0 .$

Combined with Lemma 1, we can conclude that, for any $x^{'} - {s_{j}^{'}, s_{k}^{'}}$ , $R^{j, k} (x, x^{'})$ is non-increasing in $s_{j}$ if $Γ_{j}^{1} \leq Γ_{k}^{0}$ and is non-decreasing in $s_{k}$ if $Γ_{j}^{1} \geq Γ_{k}^{0}$ .
In the case of ${\hat{r}}_{j} = 0$ , ${\hat{r}}_{k} = 1$ , and $s_{j}, s_{k} > 0$ , by swapping the $α$ ’s and $β$ ’s in the above case, we can conclude with the same result.

Combined together, we conclude that

R^{j, k} (x, x^{'})

satisfies Property 3 for any

x^{'} - {s_{j}^{'}, s_{k}^{'}}

. Consequently,

δ^{j, k} (x)

is non-increasing in

s_{j}

if

Γ_{j}^{{\hat{r}}_{j}} \leq Γ_{k}^{{\hat{r}}_{k}}

and is non-decreasing in

s_{k}

if

Γ_{j}^{{\hat{r}}_{j}} \geq Γ_{k}^{{\hat{r}}_{k}}

when

s_{j}, s_{k} > 0

.

Property 5—

δ^{j, k} (x) \leq 0

if

s_{j} \geq s_{k}, {\hat{r}}_{j} \geq {\hat{r}}_{k}

and users j and k are statistically identical.

According to Property 3, it is sufficient to consider the case where

s_{j}, s_{k} > 0

. We notice that the sign of

δ^{j, k} (x)

can be captured by the sign of the quantity

Q^{j, k} (x, x^{'}) ≜ \sum_{{\hat{r}}_{j}^{'}, {\hat{r}}_{k}^{'}}

P ({\hat{r}}_{j}^{'}) P ({\hat{r}}_{k}^{'}) R^{j, k} (x, x^{'})

. Thus, we divide our discussion into the following cases.

We first consider the case of $s_{j} \geq s_{k} > 0$ and ${\hat{r}}_{j} = {\hat{r}}_{k} = 0$ . Leveraging the definition of statistically identical, for any $x^{'} - {x_{j}^{'}, x_{k}^{'}}$ , we have

$\begin{matrix} Q^{j, k} (x, x^{'}) = \sum_{{\hat{r}}_{j}^{'}, {\hat{r}}_{k}^{'}} P ({\hat{r}}_{j}^{'}) P ({\hat{r}}_{k}^{'}) κ_{1} ( & V (x^{'}; x_{j}^{'} = (0, {\hat{r}}_{j}^{'}); x_{k}^{'} = (s_{k} + 1, {\hat{r}}_{k}^{'})) - \\ V (x^{'}; x_{j}^{'} = (s_{j} + 1, {\hat{r}}_{j}^{'}); x_{k}^{'} = (0, {\hat{r}}_{k}^{'}))), \end{matrix}$

where $κ_{1} = 1 - p_{j} - β_{j} \geq 0$ . Then, by substituting the values of $P (\hat{r})$ and using Lemma 2, we obtain

$\begin{matrix} Q^{j, k} (x, x^{'}) = & γ_{j} γ_{k} κ_{1} V (x^{'}; x_{j}^{'} = (s_{k} + 1, 1); x_{k}^{'} = (0, 1)) - \\ γ_{j} γ_{k} κ_{1} V (x^{'}; x_{j}^{'} = (s_{j} + 1, 1); x_{k}^{'} = (0, 1)) + \\ (1 - γ_{j}) (1 - γ_{k}) κ_{1} V (x^{'}; x_{j}^{'} = (s_{k} + 1, 0); x_{k}^{'} = (0, 0)) - \\ (1 - γ_{j}) (1 - γ_{k}) κ_{1} V (x^{'}; x_{j}^{'} = (s_{j} + 1, 0); x_{k}^{'} = (0, 0)) + \\ γ_{k} (1 - γ_{j}) κ_{1} V (x^{'}; x_{j}^{'} = (s_{k} + 1, 1); x_{k}^{'} = (0, 0)) - \\ γ_{k} (1 - γ_{j}) κ_{1} V (x^{'}; x_{j}^{'} = (s_{j} + 1, 0); x_{k}^{'} = (0, 1)) + \\ γ_{j} (1 - γ_{k}) κ_{1} V (x^{'}; x_{j}^{'} = (s_{k} + 1, 0); x_{k}^{'} = (0, 1)) - \\ γ_{j} (1 - γ_{k}) κ_{1} V (x^{'}; x_{j}^{'} = (s_{j} + 1, 1); x_{k}^{'} = (0, 0)) . \end{matrix}$

Since users j and k are statistically identical, we have $γ_{j} = γ_{k}$ . Then, by Lemma 1, we have $Q^{j, k} (x, x^{'}) \leq 0$ for any $x^{'} - {x_{j}^{'}, x_{k}^{'}}$ . Since $δ^{j, k} (x)$ is a linear combination of $Q^{j, k} (x, x^{'})$ ’s with non-negative coefficients, we can conclude that $δ^{j, k} (x) \leq 0$ .
For the case of $s_{j} \geq s_{k} > 0$ and ${\hat{r}}_{j} = {\hat{r}}_{k} = 1$ , by replacing $β_{j}$ in $κ_{1}$ with $α_{j}$ , we can conclude with the same result.
Then, we consider the case of $s_{j} \geq s_{k} > 0$ , ${\hat{r}}_{j} = 1$ , and ${\hat{r}}_{k} = 0$ . We first notice that, for any $x^{'} - {s_{j}^{'}, s_{k}^{'}}$

$\begin{matrix} R^{j, k} (x, x^{'}) = & (p_{k} α_{j} - (1 - p_{j}) (1 - β_{k})) V (x^{'}; s_{j}^{'} = s_{j} + 1; s_{k}^{'} = 0) + \\ ((1 - p_{k}) (1 - α_{j}) - p_{j} β_{k}) V (x^{'}; s_{j}^{'} = 0; s_{k}^{'} = s_{k} + 1) + \\ ((1 - p_{k}) α_{j} - (1 - p_{j}) β_{k}) V (x^{'}; s_{j}^{'} = s_{j} + 1; s_{k}^{'} = s_{k} + 1) + \\ (p_{k} (1 - α_{j}) - p_{j} (1 - β_{k})) V (x^{'}; s_{j}^{'} = 0; s_{k}^{'} = 0) . \end{matrix}$

As users j and k are statistically identical, we have $p_{j} = p_{k}$ and $α_{j} < β_{k}$ . Leveraging Lemma 1, we have

$\begin{matrix} R^{j, k} (x, x^{'}) \leq & (α_{j} + p_{j} - 1) (V (x^{'}; s_{j}^{'} = s_{j} + 1; s_{k}^{'} = 0) - \\ V (x^{'}; s_{j}^{'} = 0; s_{k}^{'} = s_{k} + 1)) . \end{matrix}$

Then, for any $x^{'} - {x_{j}^{'}, x_{k}^{'}}$

$\begin{matrix} Q^{j, k} (x, x^{'}) \leq \sum_{{\hat{r}}_{j}^{'}, {\hat{r}}_{k}^{'}} P ({\hat{r}}_{j}^{'}) P ({\hat{r}}_{k}^{'}) κ_{2} & (V (x^{'}; x_{j}^{'} = (0, {\hat{r}}_{j}^{'}); x_{k}^{'} = (s_{k} + 1, {\hat{r}}_{k}^{'})) - \\ V (x^{'}; x_{j}^{'} = (s_{j} + 1, {\hat{r}}_{j}^{'}); x_{k}^{'} = (0, {\hat{r}}_{k}^{'}))), \end{matrix}$

where $κ_{2} = 1 - p_{j} - α_{j} > 0$ . Such as we did in the previous cases, we can leverage Lemmas 1 and 2 to conclude that $Q^{j, k} (x, x^{'}) \leq 0$ for any $x^{'} - {x_{j}^{'}, x_{k}^{'}}$ . Consequently, $δ^{j, k} (x) \leq 0$ in this case. The details are omitted for the sake of space.

Combined together, we conclude the proof of Property 5.

Appendix D. Proof of Corollary 2

We follow the same steps as in the proof of Lemma 1. To prove the corollary, it is sufficient to show that

V (x_{1}) \leq V (x_{2})

when

s_{1} < s_{2}

and

{\hat{r}}_{1} = {\hat{r}}_{2}

. We use mathematical induction to prove the monotonicity. First of all, the base case (i.e.,

ν = 0

) is true by initialization. We assume the lemma holds at iteration

ν

. Then, we want to examine whether it holds at iteration

ν + 1

. For the system with a single user, the update step reported in problem (5) can be simplified and rewritten as follows

V_{ν + 1} (x) = min_{a \in {0, 1}} V_{ν + 1}^{a} (x),

(A8)

where

V_{ν + 1}^{a} (x) = C (x, a) - θ + \sum_{{\hat{r}}^{'}} P ({\hat{r}}^{'}) \sum_{s^{'}} P_{s, s^{'}} (a, \hat{r}) V_{ν} (x^{'}),

and

θ

is the optimal value for

M_{1} (λ, - 1)

. To prove the desired results, we distinguish between the following cases

We first consider the case of $s_{1} = 0 < s_{2}$ and ${\hat{r}}_{1} = {\hat{r}}_{2} = 0$ . When $a = 1$ , we have

$V_{ν + 1}^{1} (x_{1}) = C (x_{1}, 1) - θ + \sum_{{\hat{r}}^{'}} P ({\hat{r}}^{'}) (p V_{ν} (1, {\hat{r}}^{'}) + (1 - p) V_{ν} (0, {\hat{r}}^{'})),$

$\begin{matrix} V_{ν + 1}^{1} (x_{2}) = C (x_{2}, 1) - θ + \sum_{{\hat{r}}^{'}} P ({\hat{r}}^{'}) (β V_{ν} (s_{2} + 1, {\hat{r}}^{'}) + (1 - β) V_{ν} (0, {\hat{r}}^{'})) . \end{matrix}$

Subtracting the two expressions yields

$\begin{matrix} V_{ν + 1}^{1} (x_{1}) & - V_{ν + 1}^{1} (x_{2}) \\ \leq C (x_{1}, 1) - C (x_{2}, 1) + \sum_{{\hat{r}}^{'}} P ({\hat{r}}^{'}) [(p - β) (V_{ν} (1, {\hat{r}}^{'}) - V_{ν} (0, {\hat{r}}^{'}))] \leq 0 . \end{matrix}$

The inequalities hold since $β > p$ , $C (x, a)$ is non-decreasing in s, and Corollary 2 is true at iteration $ν$ by assumption.
For the case of $a = 0$ , we obtain

$V_{ν + 1}^{0} (x_{1}) = C (x_{1}, 0) - θ + \sum_{{\hat{r}}^{'}} P ({\hat{r}}^{'}) (p V_{ν} (1, {\hat{r}}^{'}) + (1 - p) V_{ν} (0, {\hat{r}}^{'})),$

$\begin{matrix} V_{ν + 1}^{0} (x_{2}) = C (x_{2}, 0) - θ + \sum_{{\hat{r}}^{'}} P ({\hat{r}}^{'}) ((1 - p) V_{ν} (s_{2} + 1, {\hat{r}}^{'}) + p V_{ν} (0, {\hat{r}}^{'})) . \end{matrix}$

Therefore, when $a = 0$ , we have

$\begin{matrix} V_{ν + 1}^{0} (x_{1}) & - V_{ν + 1}^{0} (x_{2}) \\ \leq C (x_{1}, 0) - C (x_{2}, 0) + \sum_{{\hat{r}}^{'}} P ({\hat{r}}^{'}) [(2 p - 1) (V_{ν} (1, {\hat{r}}^{'}) - V_{ν} (0, {\hat{r}}^{'}))] \leq 0 . \end{matrix}$

The inequalities hold since $2 p - 1 < 0$ , $C (x, a)$ is non-decreasing in s, and Corollary 2 is true at iteration $ν$ by assumption. Combined together, we can see that $V_{ν + 1}^{a} (x_{1}) \leq V_{ν + 1}^{a} (x_{2})$ for any feasible a. Then, by problem (A8), we can conclude that the lemma holds at iteration $ν + 1$ when $s_{1} = 0 < s_{2}$ and ${\hat{r}}_{1} = {\hat{r}}_{2} = 0$ .
When $s_{1} = 0 < s_{2}$ and ${\hat{r}}_{1} = {\hat{r}}_{2} = 1$ , by replacing the $β$ ’s in the above case with $α$ ’s, we can achieve the same result.
When $0 < s_{1} < s_{2}$ and ${\hat{r}}_{1} = {\hat{r}}_{2}$ , we notice that $P_{s_{1}, s_{1} + 1} (a, {\hat{r}}_{1}) = P_{s_{2}, s_{2} + 1} (a, {\hat{r}}_{2})$ and $P_{s_{1}, 0} (a, {\hat{r}}_{1}) = P_{s_{2}, 0} (a, {\hat{r}}_{2})$ . Then, leveraging the monotonicity of $V_{ν} (x)$ and $C (x, a)$ , we can conclude with the same result.

Combining the three cases, we prove that the lemma holds at iteration

ν + 1

of VIA. Therefore, the lemma holds at any iteration

ν

by mathematical induction. Since VIA is guaranteed to converge to the value function when

ν \to + \infty

, we can conclude our proof.

Appendix E. Proof of Proposition 1

We define

Δ V (x) ≜ V^{1} (x) - V^{0} (x)

where

V^{a} (x)

is the value function resulting from taking action a at state x. Then,

V^{a} (x)

can be calculated as follows

V^{a} (x) = C (x, a) - θ + \sum_{x^{'} \in X} P_{x, x^{'}} (a) V (x^{'}),

(A9)

where

θ

is the optimal value for

M_{1} (λ, - 1)

. Hence, the optimal action at state x can be fully characterized by the sign of

Δ V (x)

. More precisely, the optimal action at state x is

a = 1

if

Δ V (x) < 0

, and

a = 0

is optimal otherwise. To determine the sign of

Δ V (x)

for each state, we distinguish between the following cases

We first consider the state $x = (0, \hat{r})$ . Applying the results in Section 2.3 to problem (A9), we obtain

$\begin{matrix} V^{0} (0, \hat{r}) = & - θ + (1 - γ) (1 - p) V (0, 0) + (1 - γ) p V (1, 0) + \\ γ (1 - p) V (0, 1) + γ p V (1, 1), \end{matrix}$

$V^{1} (0, \hat{r}) = λ + V^{0} (0, \hat{r}) .$

(A10)

Therefore, $Δ V (0, \hat{r}) = λ \geq 0$ . Thus, the optimal action at state $(0, \hat{r})$ is $a = 0$ .
Then, we consider the state $x = (s, 0)$ where $s > 0$ . Applying the results in Section 2.3 to Equation (A9), we obtain

$\begin{matrix} V^{0} (s, 0) & = f (s) - θ + (1 - γ) p V (0, 0) + (1 - γ) (1 - p) V (s + 1, 0) + \\ γ p V (0, 1) + γ (1 - p) V (s + 1, 1), \end{matrix}$

$\begin{matrix} V^{1} (s, 0) & = f (s) + λ - θ + (1 - γ) (1 - β) V (0, 0) + (1 - γ) β V (s + 1, 0) + \\ γ (1 - β) V (0, 1) + γ β V (s + 1, 1) . \end{matrix}$

Then,

$Δ V (s, 0) = λ + p_{e}^{0} (1 - 2 p) ω,$

(A11)

where $ω = (1 - γ) [V (0, 0) - V (s + 1, 0)] + γ [V (0, 1) - V (s + 1, 1)] \leq 0$ .
Finally, we consider the state $x = (s, 1)$ where $s > 0$ . Following the same trajectory, we have

$\begin{matrix} Δ V (s, 1) = λ + (1 - p_{e}^{1}) (1 - 2 p) ω . \end{matrix}$

According to Corollary 2 and the fact that

p < 0.5

, we can see that

Δ V (s, 0)

and

Δ V (s, 1)

are both a constant

λ

plus a term that is non-increasing in s. As the time penalty function is unbounded, the value function must also be unbounded. Then, combining the three cases, we can conclude the following. For fixed

\hat{r}

, there always exists a threshold

n_{\hat{r}} > 0

such that the optimal action at state

(s, \hat{r})

where

s \geq n_{\hat{r}}

is

a = 1

, otherwise

a = 0

is optimal. Since

\hat{r} \in {0, 1}

, the optimal policy can be fully captured by the pair

(n_{0}, n_{1})

.

In the following, we determine the relationship between

n_{0}

and

n_{1}

. We have

Δ V (s, 1) - Δ V (s, 0) = (1 - p_{e}^{1} - p_{e}^{0}) (1 - 2 p) ω \leq 0 .

At the same time, for the threshold

n_{0}

, we know

Δ V (n_{0}, 0) < 0

. Then, we have

Δ V (n_{0}, 1) \leq Δ V (n_{0}, 0) < 0

. Combined with the fact that

Δ V (s, \hat{r})

is non-increasing in s, we can conclude that the ordering

n_{0} \geq n_{1}

is true.

Appendix F. Proof of Proposition 2

We notice that the dynamic of AoII under threshold policy can be fully captured by a Discrete-Time Markov Chain (DTMC). Then, the expected AoII

{\bar{Δ}}_{n}

and the expected transmission rate

{\bar{ρ}}_{n}

under threshold policy

n = (n_{0}, n_{1})

can be obtained from the stationary distribution of the induced DTMC. Let the states of the induced DTMC be the values of s. We recall that

\hat{r}

is an independent Bernoulli random variable with parameter

γ

. Combined with the results in Section 2.3, we can easily obtain the state transition probabilities of the induced DTMC, which are shown in Figure A1.

Figure A1. DTMC induced by the threshold policy

n = (n_{0}, n_{1})

. In the figure,

c_{1} = (1 - γ) (1 - p) + γ α

and

c_{2} = (1 - γ)

β + γ α

.

Figure A1. DTMC induced by the threshold policy

n = (n_{0}, n_{1})

. In the figure,

c_{1} = (1 - γ) (1 - p) + γ α

and

c_{2} = (1 - γ)

β + γ α

.

The balance equations of the induced DTMC are the following

(1 - p) π_{0} + p \sum_{k = 1}^{n_{1} - 1} π_{k} + (1 - c_{1}) \sum_{k = n_{1}}^{n_{0} - 1} π_{k} + (1 - c_{2}) \sum_{k = n_{0}}^{+ \infty} π_{k} = π_{0} .

p π_{0} = π_{1} .

(1 - p) π_{k - 1} = π_{k} f o r 2 \leq k \leq n_{1} .

c_{1} π_{k - 1} = π_{k} f o r n_{1} + 1 \leq k \leq n_{0} .

c_{2} π_{k - 1} = π_{k} f o r n_{0} + 1 \leq k .

\sum_{k = 0}^{+ \infty} π_{k} = 1 .

Then, we can easily solve the above system of linear equations. After some algebraic manipulation, we obtain the following

π_{0} = \frac{1}{2 + p {(1 - p)}^{n_{1} - 1} [\frac{1}{1 - c_{1}} - \frac{1}{p} + c_{1}^{n_{0} - n_{1}} (\frac{1}{1 - c_{2}} - \frac{1}{1 - c_{1}})]} .

π_{k} = p {(1 - p)}^{k - 1} π_{0} f o r 1 \leq k \leq n_{1} .

π_{k} = p {(1 - p)}^{n_{1} - 1} c_{1}^{k - n_{1}} π_{0} f o r n_{1} + 1 \leq k \leq n_{0} .

π_{k} = p {(1 - p)}^{n_{1} - 1} c_{1}^{n_{0} - n_{1}} c_{2}^{k - n_{0}} π_{0} f o r n_{0} + 1 \leq k .

Equipped with the above results, we proceed with calculating

{\bar{Δ}}_{n}

and

{\bar{ρ}}_{n}

. According to problem (6a), the expected AoII is:

{\bar{Δ}}_{n} = \sum_{k = 0}^{+ \infty} f (k) π_{k} .

Substituting the expressions of

π_{k}

’s, we can get the expression of

{\bar{Δ}}_{n}

. Proposition 1 tells us the following.

For state $(s, \hat{r})$ where $s < n_{1}$ , it is optimal to stay idle (i.e., $a = 0$ ).
For state $(s, \hat{r})$ where $n_{1} \leq s < n_{0}$ , it is optimal to make a transmission attempt only when $\hat{r} = 1$ . We recall that $\hat{r}$ is an independent Bernoulli random variable with parameter $γ$ . Therefore, the expected proportion of time that the system is at state $(s, 1)$ is $γ π_{s}$ .
For state $(s, \hat{r})$ where $s \geq n_{0}$ , it is optimal to make transmission attempt regardless of $\hat{r}$ .

Combined with problem (6b), we have

{\bar{ρ}}_{n} = γ \sum_{k = n_{1}}^{n_{0} - 1} π_{k} + \sum_{k = n_{0}}^{+ \infty} π_{k} .

Substituting the expressions of

π_{k}

’s, we can obtain the closed-form expression of

{\bar{ρ}}_{n}

.

Appendix G. Proof of Proposition 4

We first tackle the Whittle’s indexes at state

(0, \hat{r})

and

(s, 0)

where

s > 0

. To this end, we distinguish between the following cases

We first consider the state $x = (0, \hat{r})$ . By definition, Whittle’s index is the infimum $λ$ such that $V^{0} (x) = V^{1} (x)$ . According to (A10), we can conclude that $W_{x} = 0$ when $x = (0, \hat{r})$ .
Then, we consider the state $x = (s, 0)$ where $s > 0$ . We recall that $p_{e}^{0} = 0$ . Then, we can conclude, from (A11), that $W_{x} = 0$ for all $x = (s, 0)$ where $s > 0$ .

Now, we tackle the Whittle’s index at state

x = (s, 1)

where

s > 0

. For convenience, we denote by

W_{n}

the Whittle’s index at state

x = (n, 1)

. According to the monotonicity of

Δ V (x)

shown in the proof of Proposition 1, we can conclude that threshold policy

n = (+ \infty, n + 1)

is optimal when

V^{0} (n, 1) = V^{1} (n, 1)

. Then, we can prove the following

Lemma A1.

When (9) is satisfied and

V^{0} (n, 1) = V^{1} (n, 1)

,

V (s, 1) = V (s, 0) ≜ V (s)

for

0 \leq s \leq n

.

Proof.

Since the value function satisfies the Bellman equation, it is sufficient to show that

V (s, 1)

and

V (s, 0)

satisfy the same Bellman equation. We recall that the Bellman equation for

V (x)

is given by

V (x) = min_{a \in {0, 1}} V^{a} (x),

where

V^{a} (x) = C (x, a) - θ + \sum_{x^{'}} P_{x, x^{'}} (a) V (x^{'}),

(A12)

and

θ

is the optimal value of the decoupled problem. We recall, from Corollary 3, that the optimal action at state

(s, 0)

is staying idle (i.e.,

a = 0

) for any s. We also know that threshold policy

n = (+ \infty, n + 1)

is optimal when

V^{0} (n, 1) = V^{1} (n, 1)

. Therefore, the optimal actions at states

(s, 0)

and

(s, 1)

where

s \leq n

are the same (i.e.,

a = 0

). Equivalently, we have

V (s, \hat{r}) = V^{0} (s, \hat{r}), f o r s \leq n .

(A13)

According to the system dynamic reported in Section 2.3, we know that the state transition probabilities are independent of

\hat{r}

when

a = 0

. Meanwhile,

\hat{r}

does not affect the instant cost. Let

x_{1} = (s, 1)

and

x_{2} = (s, 0)

. Then, for any

x^{'}

, we have

P_{x_{1}, x^{'}} (0) = P_{x_{2}, x^{'}} (0) .

C (x_{1}, 0) = C (x_{2}, 0) .

Hence, according to (A12), we can see that

V^{0} (s, 0) = V^{0} (s, 1)

for any

s \leq n

. Combined with problem (A13), we can conclude that

V (s, 0) = V (s, 1)

for any

0 \leq s \leq n

. ☐

By definition, Whittle’s index

W_{n}

is the infimum

λ

such that

V^{0} (n, 1) = V^{1} (n, 1)

. In this case, according to Lemma A1,

V (0, 1) = V (0, 0) = V (0)

. Then,

V^{0} (n, 1)

and

V^{1} (n, 1)

can be written as

V^{0} (n, 1) = f (n) - θ + p V (0) + (1 - p) [(1 - γ) V (n + 1, 0) + γ V (n + 1, 1)] .

(A14)

\begin{matrix} V^{1} (n, 1) & = f (n) + W_{n} - θ + (1 - α) V (0) + α [(1 - γ) V (n + 1, 0) + γ V (n + 1, 1)] . \end{matrix}

Without a loss of generality, we assume

V (0) = 0

. Then, equating the two expressions yields

W_{n} = (1 - p - α) (γ V (n + 1, 1) + (1 - γ) V (n + 1, 0)) .

(A15)

Combining problems (A14) and (A15), we conclude that

W_{n}

is

W_{n} = \frac{(1 - p - α) (V^{0} (n, 1) + θ - f (n))}{1 - p} .

Since the optimal action at state

(n, 1)

is

a = 0

, we have

V^{0} (n, 1) = V (n, 1) = V (n)

. Finally, we obtain

W_{n} = \frac{(1 - p - α) (V (n) + θ - f (n))}{1 - p} .

(A16)

Now, we tackle the expression of

V (n)

. When

V^{0} (n, 1) = V^{1} (n, 1)

, the optimal action at state

(s, \hat{r})

where

0 \leq s < n

is staying idle. Then, leveraging Lemma A1, value function

V (s)

where

0 \leq s < n

satisfies the following

V (s) = \{\begin{matrix} - θ + f (0) + p V (1) & w h e n s = 0, \\ - θ + f (s) + (1 - p) V (s + 1) & w h e n 0 < s < n . \end{matrix}

(A17)

By backward induction, we end up with the following equation for

0 < s < n

.

V (s) = \frac{- θ (1 - {(1 - p)}^{n - s})}{p} + \sum_{k = 1}^{n - s} f (n - k) {(1 - p)}^{n - s - k} + {(1 - p)}^{n - s} V (n) .

Letting

s = 1

yields

V (1) = \frac{- θ (1 - {(1 - p)}^{n - 1})}{p} + \sum_{k = 1}^{n - 1} f (n - k) {(1 - p)}^{n - 1 - k} + {(1 - p)}^{n - 1} V (n) .

From problem (A17),

V (1)

also satisfies the following

V (1) = \frac{θ - f (0)}{p} .

Equating the two expressions of

V (1)

, we obtain

V (n) = \frac{- f (0)}{p {(1 - p)}^{n - 1}} + θ (\frac{2}{p {(1 - p)}^{n - 1}} - \frac{1}{p}) - \sum_{k = 1}^{n - 1} f (n - k) {(1 - p)}^{- k} .

(A18)

We recall that, when

V^{0} (n, 1) = V^{1} (n, 1)

, threshold policy

n = (+ \infty, n + 1)

is optimal and both actions at state

x = (n, 1)

are equally desirable. Thus, threshold policy

n = (+ \infty, n)

is also optimal. Then, we know

θ = {\bar{Δ}}_{n} + W_{n} {\bar{ρ}}_{n},

(A19)

where

{\bar{Δ}}_{n}

and

{\bar{ρ}}_{n}

are the expected AoII and the expected transmission rate under threshold policy

n = (+ \infty, n)

, respectively. Finally, combining problems (A16), (A18) and (A19), we obtain

W_{n} = \frac{\frac{- f (0)}{p {(1 - p)}^{n}} + {\bar{Δ}}_{n} \frac{2 - {(1 - p)}^{n}}{p {(1 - p)}^{n}} - {(1 - p)}^{- n} (\sum_{k = 1}^{n} f (k) {(1 - p)}^{k - 1})}{\frac{1}{1 - p - α} - {\bar{ρ}}_{n} \frac{2 - {(1 - p)}^{n}}{p {(1 - p)}^{n}}} .

After some algebraic manipulation, we have

W_{n} = \frac{(1 - c_{1}) \sum_{k = n + 1}^{+ \infty} f (k) c_{1}^{k - n - 1} - {\bar{Δ}}_{n}}{\frac{(1 - c_{1}) (1 - p) - γ (1 - p - α)}{c_{1} (1 - p - α)} + {\bar{ρ}}_{n}},

where

c_{1} = (1 - γ) (1 - p) + γ α

.

In the following, we investigate some properties of Whittle’s index. First of all,

W_{n}

is non-negative since

1 - p - α

and

V (n + 1, \hat{r})

in (A15) are all non-negative. Meanwhile, combining (A15) with the fact that

V (n, \hat{r})

is non-decreasing in n, we can verify that

W_{n}

is non-decreasing in n. Combined with the Whittle’s indexes in two other cases (i.e.,

x = (0, \hat{r})

and

x = (s, 0)

where

s > 0

), we can easily obtain the properties of

W_{x}

as detailed in Proposition 4.

Appendix H. Proof of Proposition 5

We notice that

M_{1} (λ, - 1)

coincides with the decoupled model studied in Section 4.2. When problem (9) is satisfied, the decoupled problem is indexable, and, according to Corollary 3, we only need to show that n is the optimal threshold for the states with

\hat{r} = 1

. We first tackle the case of

λ > 0

. To this end, we divide our discussion into the following cases

For state $(s, 1)$ where $s < n$ , $W_{s} \leq λ$ by definition. As the problem is indexable, we have $D (W_{s}) \subseteq D (λ)$ . We recall that $W_{s} ≜ min {λ^{'} \geq 0 : V^{0} (s, 1) = V^{1} (s, 1)}$ . Equivalently, $W_{s} ≜ min {λ^{'} \geq 0 : (s, 1) \in D (λ^{'})}$ . Then, we know $(s, 1) \in D (W_{s})$ . Combined together, we conclude that $(s, 1) \in D (λ)$ . In other words, the optimal action at state $(s, 1)$ where $s < n$ is to stay idle (i.e., $a = 0$ ).
For state $(s, 1)$ where $s \geq n$ , we first recall that $W_{s} = min {λ^{'} \geq 0 : (s, 1) \in D (λ^{'})}$ . Consequently, for any $λ^{'} < W_{s}$ , we know $(s, 1) \notin D (λ^{'})$ . Meanwhile, we have $W_{s} \geq W_{n} > λ$ by the monotonicity of Whittle’s index and the definition of n. Hence, we can conclude that $(s, 1) \notin D (λ)$ . In other words, the optimal action at state $(s, 1)$ where $s \geq n$ is to make the transmission attempt.

Then, we conclude that n is the optimal threshold for the states with

\hat{r} = 1

when

λ > 0

. In the case of

λ = 0

, according to the proof of Proposition 1, we can easily verify that the optimal threshold is 1.

Appendix I. Proof of Theorem 2

We first make the following definitions. When

M_{1} (λ, - 1)

is at state x and action a is taken, cost

C_{1} (x, a) ≜ f (s)

and

C_{2} (x, a) ≜ λ a

are incurred. We denote the expected

C_{1}

-cost and the expected

C_{2}

-cost under policy

ϕ

as

{\bar{C}}_{1} (ϕ)

and

{\bar{C}}_{2} (ϕ)

, respectively. Let G be a non-empty set of states. For the given state i, we define

R^{*} (i, G)

as the class of policies

ϕ

, for which the following hold

The probability $P_{ϕ} (x_{n} \in G f o r s o m e n \geq 1 | x_{0} = i) = 1$ where $x_{n}$ is the state of $M_{1} (λ, - 1)$ at time n.
The expected time $m_{i G} (ϕ)$ of a first passage from i to G under $ϕ$ is finite.
The expected $C_{1}$ -cost ${\bar{C}}_{1}^{i, G} (ϕ)$ and the expected $C_{2}$ -cost ${\bar{C}}_{2}^{i, G} (ϕ)$ of a first passage form i to G under $ϕ$ are finite.

With the definitions in mind, we proceed with verifying the assumptions given in [27].

For all $d > 0$ , the set $A (d) = {x |$ there exists an action a such that $C_{1} (x, a) + C_{2} (x, a) \leq d}$ is finite: For any state x, the cost satisfies $C_{1} (x, a) + C_{2} (x, a) = f (s) + λ a \geq f (s)$ . The equality holds when $a = 0$ . Then, the states in $A (d)$ must satisfy $f (s) \leq d$ . Combined with the fact that $f (s)$ is a non-decreasing and unbounded function when $s \in N_{0}$ , we can conclude that $A (d)$ is finite.
There exists a stationary policy e such that the induced Markov chain has the following properties: the state space $S$ consists of a single (non-empty) positive recurrent class R and a set U of transient states such that $e \in R^{*} (i, R)$ for $i \in U$ . Moreover, both ${\bar{C}}_{1} (e)$ and ${\bar{C}}_{2} (e)$ on R are finite: We consider the policy under which the base station makes a transmission attempt at every time slot. According to the system dynamic detailed in Section 2.3, we can see that all the states communicate with state $(0, 0)$ and $(0, 0)$ communicates with all other states. Thus, the state space $S$ consists of a single (non-empty) positive recurrent class and the set of transient states can simply be an empty set. ${\bar{C}}_{1} (e)$ and ${\bar{C}}_{2} (e)$ are trivially finite as we can verify using Proposition 2.
Given any two state $x \neq y$ , there exists a policy ϕ such that $ϕ \in R^{*} (x, y)$ : We notice that, under any policy, the maximum increase of s between two consecutive time slots is 1. Meanwhile, when s decreases, it decreases to zero. Combined with the fact that $\hat{r}$ is an independent Bernoulli random variable, we can conclude that there always exists a path between any x and y with positive probability. $m_{x y} (ϕ)$ , ${\bar{C}}_{1}^{x, y} (ϕ)$ , and ${\bar{C}}_{2}^{x, y} (ϕ)$ are trivially finite.
If a stationary policy ϕ has at least one positive recurrent state, then it has a single positive recurrent class R. Moreover, if $x = (0, 0) \notin R$ , then $ϕ \in R^{*} (x, R)$ : Given that $\hat{r}$ is an independent Bernoulli random variable, we can easily conclude from the system dynamic that all the states communicate with state $(0, 0)$ and $(0, 0)$ communicates with all other states under any stationary policy. Therefore, any positive recurrent class must contain state $(0, 0)$ . Thus, there must have only one positive recurrent class which is $R = S$ .
There exists a policy ϕ such that ${\bar{C}}_{1} (ϕ) < \infty$ and ${\bar{C}}_{2} (ϕ) < K$ where $K \in (0, 1]$ : We notice that ${\bar{C}}_{1} (ϕ)$ and ${\bar{C}}_{2} (ϕ)$ are nothing but the expected AoII and the expected transmission rate achieved by $ϕ$ , respectively. Then, we can easily verify that such policy exists using Proposition 2.

As the assumptions are verified, we proceed with introducing the optimal randomized policy for given

λ

. We say a policy is

λ

-optimal if the policy is optimal for

M_{1} (λ, - 1)

. We consider two monotone sequences

λ_{+}^{n} ↓ λ

and

λ_{-}^{n} ↑ λ

. Then, there exist subsequences of

λ_{+}^{n}

and

λ_{-}^{n}

such that the corresponding sequences of optimal policies converge. Then, according to Lemma 3.7 of [27], the limit points, denoted by

n_{λ_{+}}

and

n_{λ_{-}}

, are both

λ

-optimal. By Proposition 3.2 of [27], the Markov chains induced by

n_{λ_{+}}

and

n_{λ_{-}}

both contain a single non-empty positive recurrent class and state

(0, 0)

is positive recurrent in both induced Markov chains. Hence, the base station can choose which policy to follow each time the system reaches state

(0, 0)

while keeping the resulting randomized policy

λ

-optimal as suggested by Lemma 3.9 of [27]. More precisely, we consider the following randomized policy: each time the system reaches state

(0, 0)

, the base station will choose

n_{λ_{-}}

with probability

μ

and

n_{λ_{+}}

with probability

1 - μ

. The chosen policy will be followed until the next choice. We denote such policy as

n_{λ}

and conclude that

n_{λ}

is

λ

-optimal under any

μ \in [0, 1]

.

Appendix J. Proof of Proposition 6

The value function

V (x)

and

V^{i} (x_{i})

must satisfy their own Bellman equations. More precisely

V (x) + θ = min_{a \in A_{N} (- 1)} \{C (x, a) + \sum_{x^{'}} P r (x^{'} ∣ x, a) V (x^{'})\},

V^{i} (x_{i}) + θ_{i} = min_{a_{i} \in {0, 1}} \{C (x_{i}, a_{i}) + \sum_{x_{i}^{'}} P r (x_{i}^{'} ∣ x_{i}, a_{i}) V^{i} (x_{i}^{'})\},

(A20)

where

θ

and

θ_{i}

are the optimal values of

M_{N} (λ, - 1)

and

M_{1}^{i} (λ, - 1)

, respectively. We recall from Section 2.3 that the users are independent when action

a

and current state

x

are given. Thus

P r (x^{'} ∣ x, a) = \prod_{i = 1}^{N} P r (x_{i}^{'} ∣ x, a),

where

x^{'} = (x_{1}^{'}, \dots, x_{N}^{'})

. Then, we have

\sum_{x^{'} - {x_{i}^{'}}} P r (x^{'} - {x_{i}^{'}} ∣ x, a) = \sum_{x^{'} - {x_{i}^{'}}} \prod_{j \neq i} P r (x_{j}^{'} ∣ x, a) = 1 .

We also recall from Section 2.3 that the state of user i depends only on its previous state and the action with respect to user i. Thus

P r (x_{i}^{'} ∣ x, a) = P r (x_{i}^{'} ∣ x_{i}, a_{i}) .

Combined together, we obtain

\begin{matrix} \sum_{i = 1}^{N} \sum_{x_{i}^{'}} P r (x_{i}^{'} ∣ x_{i}, a_{i}) V^{i} (x_{i}^{'}) & = \sum_{i = 1}^{N} \sum_{x_{i}^{'}} [\sum_{x^{'} - {x_{i}^{'}}} \prod_{j \neq i} P r (x_{j}^{'} ∣ x, a)] P r (x_{i}^{'} ∣ x_{i}, a_{i}) V^{i} (x_{i}^{'}) \\ = \sum_{i = 1}^{N} \sum_{x_{i}^{'}} (\sum_{x^{'} - {x_{i}^{'}}} \prod_{i = 1}^{N} P r (x_{i}^{'} ∣ x, a) V^{i} (x_{i}^{'})) \\ = \sum_{x^{'}} P r (x^{'} ∣ x, a) (\sum_{i = 1}^{N} V^{i} (x_{i}^{'})) . \end{matrix}

(A21)

Then, we sum problem (A20) over all users which yields

\sum_{i = 1}^{N} (V^{i} (x_{i}) + θ_{i}) = min_{a} \{\sum_{i = 1}^{N} (C (x_{i}, a_{i}) + \sum_{x_{i}^{'}} P r (x_{i}^{'} ∣ x_{i}, a_{i}) V^{i} (x_{i}^{'}))\} .

We recall that

C (x, a) = \sum_{i = 1}^{N} C (x_{i}, a_{i})

by definition. Then, leveraging problem (A21), we obtain

\sum_{i = 1}^{N} V^{i} (x_{i}) + \sum_{i = 1}^{N} θ_{i} = min_{a \in A_{N} (- 1)} \{C (x, a) + \sum_{x^{'}} P r (x^{'} ∣ x, a) (\sum_{i = 1}^{N} V^{i} (x_{i}^{'}))\} .

Since the solution to the Bellman equation is unique [21], we must have

\sum_{i = 1}^{N} V^{i} (x_{i}) = V (x)

and

\sum_{i = 1}^{N} θ_{i} = θ

. Then, we can conclude that it is optimal for

M_{N} (λ, - 1)

if each user adopts its own optimal policy.

Appendix K. Proof of Theorem 3

In this proof, we class a policy as

λ^{*}

-optimal if it is optimal for

M_{N} (λ^{*}, - 1)

. In Section 4.2, we ensure that, for each user, there exists at least one threshold policy that yields a finite expected AoII. Therefore, we can conclude that, for RP, there exists at least one policy that causes the expected AoII and the expected transmission rate to be both finite. Then, according to Lemma 3.10 of [27], a policy is optimal for RP if

It is $λ^{*}$ -optimal;
The resulting expected transmission rate is equal to M.

We first construct a policy

ϕ_{λ^{*}}

that is

λ^{*}

-optimal. We recall from Proposition 6 that a policy is

λ^{*}

-optimal if it consists of the optimal policies for each

M_{1}^{i} (λ^{*}, - 1)

where

1 \leq i \leq N

. According to Theorem 2, for any i, there exist

n_{λ_{-}^{*}, i}

and

n_{λ_{+}^{*}, i}

that are both optimal for

M_{1}^{i} (λ^{*}, - 1)

. Then, we can construct the policy

ϕ_{λ^{*}}

in the following way.

For user i with $n_{λ_{-}^{*}, i} = n_{λ_{+}^{*}, i} ≜ n_{λ^{*}, i}$ , the threshold policy $n_{λ^{*}, i}$ is used. Then, the deterministic policy $n_{λ^{*}, i}$ is optimal for $M_{1}^{i} (λ^{*}, - 1)$ and

${\bar{ρ}}^{i} (λ^{*}) = {\bar{ρ}}^{i} (λ_{-}^{*}) = {\bar{ρ}}^{i} (λ_{+}^{*}) .$

In this case, the choice of $μ_{i}$ makes no difference.
For user i with $n_{λ_{-}^{*}, i} \neq n_{λ_{+}^{*}, i}$ , the randomized policy $n_{λ^{*}, i}$ as detailed in Theorem 2 is used. Then, for any $μ_{i} \in [0, 1]$ , the randomized policy $n_{λ^{*}, i}$ is optimal for $M_{1}^{i} (λ^{*}, - 1)$ and

${\bar{ρ}}^{i} (λ^{*}) = μ_{i} {\bar{ρ}}^{i} (λ_{-}^{*}) + (1 - μ_{i}) {\bar{ρ}}^{i} (λ_{+}^{*}) .$

Combing the two cases, we conclude that

ϕ_{λ^{*}} = [n_{λ^{*}, 1}, \dots, n_{λ^{*}, N}]

is

λ^{*}

-optimal under any

μ_{i} \in [0, 1]

. Hence, as long as the chosen

μ_{i}

’s realize

\sum_{i = 1}^{N} {\bar{ρ}}^{i} (λ^{*}) = M

, we can conclude that the randomized policy

ϕ_{λ^{*}}

is optimal for RP.

Appendix L. Proof of Proposition 8

We notice that

M_{1}^{i} (λ^{*}, - 1)

coincides with the decoupled model studied in Section 4.2. Therefore, we can use the results in Section 4.2 to prove the properties. Since the users share the same structure, we ignore the user index i for simplicity. According to the definition of

I_{x}

, we have

\begin{matrix} I_{x} & = \sum_{x^{'}} P_{x, x^{'}} (0) V (x^{'}) - \sum_{x^{'}} P_{x, x^{'}} (1) V (x^{'}) - λ^{*} \\ = - Δ V (x) . \end{matrix}

Leveraging the results in the proof of Proposition 1, we have the following

For state $x = (0, \hat{r})$ , $I_{x} = - λ^{*}$ .
For state $x = (s, 0)$ where $s > 0$ , $I_{x} = - λ^{*} - p_{e}^{0} (1 - 2 p) ω$ where $ω = (1 - γ) [V (0, 0) - V (s + 1, 0)] + γ [V (0, 1) - V (s + 1, 1)] \leq 0$ .
For state $x = (s, 1)$ where $s > 0$ , $I_{x} = - λ^{*} - (1 - p_{e}^{1}) (1 - 2 p) ω$ .

From the above three cases, we can easily conclude that

I_{x} \geq - λ^{*}

and the equality holds when

\hat{r} = p_{e}^{0} = 0

or

s = 0

. As is proven in Corollary 2,

V (x)

is non-decreasing in s. Hence, we can conclude that

I_{x}

is also non-decreasing in s. To show that

I_{x}

is monotone in

\hat{r}

, we consider two states

x_{1} = (s, 1)

and

x_{2} = (s, 0)

. Then, we have

I_{x_{2}} - I_{x_{1}} = Δ V (s, 1) - Δ V (s, 0) = (1 - p_{e}^{1} - p_{e}^{0}) (1 - 2 p) ω \leq 0 .

Therefore, we can conclude that

I_{x}

is non-decreasing in

\hat{r}

.

Appendix M

Algorithm A1 Improved Relative Value Iteration
Require:
	MDP $M = (X, P, A, C)$
	Convergence Criteria $ϵ$
1:	procedureRelativeValueIteration( $M$ , $ϵ$ )
2:	Initialize $V_{0} (x) = 0$ ; $ν = 0$
3:	Choose $x^{r e f} \in X$ arbitrarily
4:	while $V_{ν}$ is not converged (RVI converges when the maximum difference between the results of two consecutive iterations is less than $ϵ$ ) do
5:	for $x = (s, \hat{r}) \in X$ do
6:	if ∃ active state $(s_{1}, {\hat{r}}_{1})$ s.t. $s_{1} \leq s$ and ${\hat{r}}_{1} \leq \hat{r}$ then
7:	$a^{*} (x) = 1$
8:	$Q_{ν + 1} (x) = C (x, 1) + \sum_{x^{'}} P_{x x^{'}} (1) V_{ν} (x^{'})$
9:	else
10:	for $a \in A$ do
11:	$H_{x, a} = C (x, a) + \sum_{x^{'}} P_{x x^{'}} (a) V_{ν} (x^{'})$
12:	$a^{*} (x) = arg {min}_{a} {H_{x, a}}$
13:	$Q_{ν + 1} (x) = H_{x, a^{*}}$
14:	for $x \in X$ do
15:	$V_{ν + 1} (x) = Q_{ν + 1} (x) - Q_{ν + 1} (x^{r e f})$
16:	$ν = ν + 1$
	return $n \leftarrow a^{*} (x)$

Algorithm A2 Bisection Search
Require:
	Maximum updates per transmission attempt M
	MDP $M_{N} (λ, - 1) = (X_{N}, A_{N} (- 1), P_{N}, C_{N} (λ))$
	Tolerance $ξ$
	Convergence criteria $ϵ$
1:	procedureBisectionSearch( $M_{N} (λ, - 1)$ , M, $ξ$ , $ϵ$ )
2:	Initialize $λ_{-} = 0$ ; $λ_{+} = 1$
3:	$ϕ_{λ_{+}} \leftarrow (M_{N} (λ_{+}, - 1), ϵ)$ using Section 5.1 and Proposition 6
4:	$\bar{ρ} (λ_{+}) \leftarrow ϕ_{λ_{+}}$ using Proposition 2
5:	while $\bar{ρ} (λ_{+}) \geq M$ do
6:	$λ_{-} = λ_{+}$ ; $λ_{+} = 2 λ_{+}$
7:	$ϕ_{λ_{+}} \leftarrow (M_{N} (λ_{+}, - 1), ϵ)$ using Section 5.1 and Proposition 6
8:	$\bar{ρ} (λ_{+}) \leftarrow ϕ_{λ_{+}}$ using Proposition 2
9:	while $λ_{+} - λ_{-} \geq 2 ξ$ do
10:	$λ = \frac{λ_{+} + λ_{-}}{2}$
11:	$ϕ_{λ} \leftarrow (M_{N} (λ, - 1), ϵ)$ using Section 5.1 and Proposition 6
12:	$\bar{ρ} (λ) \leftarrow ϕ_{λ}$ using Proposition 2
13:	if $\bar{ρ} (λ) > M$ then
14:	$λ_{-} = λ$
15:	else
16:	$λ_{+} = λ$
	return $(λ_{+}^{}, λ_{-}^{}) \leftarrow (λ_{+}, λ_{-})$

References

Maatouk, A.; Kriouile, S.; Assaad, M.; Ephremides, A. The age of incorrect information: A new performance metric for status updates. IEEE/ACM Trans. Netw. 2020, 28, 2215–2228. [Google Scholar] [CrossRef]
Uysal, E.; Kaya, O.; Ephremides, A.; Gross, J.; Codreanu, M.; Popovski, P.; Assaad, M.; Liva, G.; Munari, A.; Soleymani, T.; et al. Semantic communications in networked systems. arXiv 2021, arXiv:2103.05391. [Google Scholar]
Kam, C.; Kompella, S.; Ephremides, A. Age of incorrect information for remote estimation of a binary markov source. In Proceedings of the IEEE INFOCOM 2020-IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), Toronto, ON, Canada, 6–9 July 2020; pp. 1–6. [Google Scholar]
Maatouk, A.; Assaad, M.; Ephremides, A. The age of incorrect information: An enabler of semantics-empowered communication. arXiv 2020, arXiv:2012.13214. [Google Scholar]
Chen, Y.; Ephremides, A. Minimizing Age of Incorrect Information for Unreliable Channel with Power Constraint. arXiv 2021, arXiv:2101.08908. [Google Scholar]
Kriouile, S.; Assaad, M. Minimizing the Age of Incorrect Information for Real-time Tracking of Markov Remote Sources. arXiv 2021, arXiv:2102.03245. [Google Scholar]
Kadota, I.; Sinha, A.; Uysal-Biyikoglu, E.; Singh, R.; Modiano, E. Scheduling policies for minimizing age of information in broadcast wireless networks. IEEE/ACM Trans. Netw. 2018, 26, 2637–2650. [Google Scholar] [CrossRef] [Green Version]
Hsu, Y.P. Age of information: Whittle index for scheduling stochastic arrivals. In Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT), Vail, CO, USA, 17–22 June 2018; pp. 2634–2638. [Google Scholar]
Tripathi, V.; Modiano, E. A whittle index approach to minimizing functions of age of information. In Proceedings of the 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 24–27 September 2019; pp. 1160–1167. [Google Scholar]
Maatouk, A.; Kriouile, S.; Assad, M.; Ephremides, A. On the optimality of the Whittle’s index policy for minimizing the age of information. IEEE Trans. Wirel. Commun. 2020, 20, 1263–1277. [Google Scholar] [CrossRef]
Sun, J.; Jiang, Z.; Krishnamachari, B.; Zhou, S.; Niu, Z. Closed-form Whittle’s index-enabled random access for timely status update. IEEE Trans. Commun. 2019, 68, 1538–1551. [Google Scholar] [CrossRef]
Nguyen, G.D.; Kompella, S.; Kam, C.; Wieselthier, J.E. Information freshness over a Markov channel: The effect of channel state information. Ad Hoc Networks 2019, 86, 63–71. [Google Scholar] [CrossRef]
Talak, R.; Karaman, S.; Modiano, E. Optimizing age of information in wireless networks with perfect channel state information. In Proceedings of the 2018 16th International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOpt), Shanghai, China, 7–11 May 2018; pp. 1–8. [Google Scholar]
Shi, L.; Cheng, P.; Chen, J. Optimal periodic sensor scheduling with limited resources. IEEE Trans. Autom. Control 2011, 56, 2190–2195. [Google Scholar] [CrossRef]
Leong, A.S.; Dey, S.; Quevedo, D.E. Sensor scheduling in variance based event triggered estimation with packet drops. IEEE Trans. Autom. Control 2016, 62, 1880–1895. [Google Scholar] [CrossRef] [Green Version]
Mo, Y.; Garone, E.; Casavola, A.; Sinopoli, B. Stochastic sensor scheduling for energy constrained estimation in multi-hop wireless sensor networks. IEEE Trans. Autom. Control 2011, 56, 2489–2495. [Google Scholar] [CrossRef] [Green Version]
Kaul, S.; Yates, R.; Gruteser, M. Real-time status: How often should one update? In Proceedings of the 2012 Proceedings IEEE INFOCOM, Orlando, FL, USA, 25–30 March 2012; pp. 2731–2735. [Google Scholar]
Leong, A.S.; Ramaswamy, A.; Quevedo, D.E.; Karl, H.; Shi, L. Deep reinforcement learning for wireless sensor scheduling in cyber–physical systems. Automatica 2020, 113, 108759. [Google Scholar] [CrossRef] [Green Version]
Wang, J.; Ren, X.; Mo, Y.; Shi, L. Whittle index policy for dynamic multichannel allocation in remote state estimation. IEEE Trans. Autom. Control 2019, 65, 591–603. [Google Scholar] [CrossRef]
Gittins, J.; Glazebrook, K.; Weber, R. Multi-Armed Bandit Allocation Indices; John Wiley & Sons: Hoboken, NJ, USA, 2011. [Google Scholar]
Russell, S.; Norvig, P. Artificial Intelligence: A Modern Approach, 3rd ed.; Prentice Hall Press: Hoboken, NJ, USA, 2009. [Google Scholar]
Whittle, P. Restless bandits: Activity allocation in a changing world. J. Appl. Probab. 1988, 25, 287–298. [Google Scholar] [CrossRef]
Weber, R.R.; Weiss, G. On an index policy for restless bandits. J. Appl. Probab. 1990, 27, 637–648. [Google Scholar] [CrossRef]
Glazebrook, K.D.; Ruiz-Hernandez, D.; Kirkbride, C. Some indexable families of restless bandit problems. Adv. Appl. Probab. 2006, 38, 643–672. [Google Scholar] [CrossRef]
Larrañaga, M. Dynamic Control of Stochastic and Fluid Resource-Sharing Systems. Ph.D. Thesis, Université de Toulouse, Toulouse, France, 2015. [Google Scholar]
Sennott, L.I. On computing average cost optimal policies with application to routing to parallel queues. Math. Methods Oper. Res. 1997, 45, 45–62. [Google Scholar] [CrossRef]
Sennott, L.I. Constrained average cost Markov decision chains. Probab. Eng. Inf. Sci. 1993, 7, 69–83. [Google Scholar] [CrossRef]
Bertsimas, D.; Niño-Mora, J. Restless bandits, linear programming relaxations, and a primal-dual index heuristic. Oper. Res. 2000, 48, 80–90. [Google Scholar] [CrossRef] [Green Version]
Littman, M.L.; Dean, T.L.; Kaelbling, L.P. On the complexity of solving Markov decision problems. arXiv 2013, arXiv:1302.4971. [Google Scholar]
Verloop, I.M. Asymptotically optimal priority policies for indexable and nonindexable restless bandits. Ann. Appl. Probab. 2016, 26, 1947–1995. [Google Scholar] [CrossRef]

Figure 1. The structure of the communication model.

Figure 2. A sample path of

s_{t}

.

Figure 2. A sample path of

s_{t}

.

Figure 3. Performance when the source processes vary. We choose

p_{i} = 0.05 + \frac{0.4 (i - 1)}{N - 1}

,

f_{i} (s) = s

,

γ_{i} = 0.6

,

p_{e, i}^{0} = p_{e}^{0}

, and

p_{e, i}^{1} = 0.1

for

1 \leq i \leq N

.

Figure 3. Performance when the source processes vary. We choose

p_{i} = 0.05 + \frac{0.4 (i - 1)}{N - 1}

,

f_{i} (s) = s

,

γ_{i} = 0.6

,

p_{e, i}^{0} = p_{e}^{0}

, and

p_{e, i}^{1} = 0.1

for

1 \leq i \leq N

.

Figure 4. Performance when the communication goals vary. We choose

f_{i} (s) = s^{0.5 + \frac{i - 1}{N - 1}}

,

p_{i} = 0.3

,

γ_{i} = 0.6

,

p_{e, i}^{0} = p_{e}^{0}

, and

p_{e, i}^{1} = 0.1

for

1 \leq i \leq N

.

Figure 4. Performance when the communication goals vary. We choose

f_{i} (s) = s^{0.5 + \frac{i - 1}{N - 1}}

,

p_{i} = 0.3

,

γ_{i} = 0.6

,

p_{e, i}^{0} = p_{e}^{0}

, and

p_{e, i}^{1} = 0.1

for

1 \leq i \leq N

.

Figure 5. Performance in systems with random parameters when

N = 5

. The parameters for each user are chosen randomly within the following intervals:

γ \in [0, 1]

,

p \in [0.05, 0.45]

,

p_{e}^{0} \in I

,

p_{e}^{1} \in [0, 0.45]

, and

f (s) = s^{τ}

where

τ \in [0.5, 1.5]

.

Figure 5. Performance in systems with random parameters when

N = 5

. The parameters for each user are chosen randomly within the following intervals:

γ \in [0, 1]

,

p \in [0.05, 0.45]

,

p_{e}^{0} \in I

,

p_{e}^{1} \in [0, 0.45]

, and

f (s) = s^{τ}

where

τ \in [0.5, 1.5]

.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, Y.; Ephremides, A. Scheduling to Minimize Age of Incorrect Information with Imperfect Channel State Information. Entropy 2021, 23, 1572. https://0-doi-org.brum.beds.ac.uk/10.3390/e23121572

AMA Style

Chen Y, Ephremides A. Scheduling to Minimize Age of Incorrect Information with Imperfect Channel State Information. Entropy. 2021; 23(12):1572. https://0-doi-org.brum.beds.ac.uk/10.3390/e23121572

Chicago/Turabian Style

Chen, Yutao, and Anthony Ephremides. 2021. "Scheduling to Minimize Age of Incorrect Information with Imperfect Channel State Information" Entropy 23, no. 12: 1572. https://0-doi-org.brum.beds.ac.uk/10.3390/e23121572

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Scheduling to Minimize Age of Incorrect Information with Imperfect Channel State Information

Abstract

1. Introduction

2. System Overview

2.1. Communication Model

2.2. Age of Incorrect Information

2.3. System Dynamic

2.4. Problem Formulation

3. Structural Properties of the Optimal Policy

4. Whittle’s Index Policy

4.1. Relaxed Problem

4.2. Decoupled Model

4.3. Indexability

4.4. Whittle’s Index Policy

5. Optimal Policy for Relaxed Problem

5.1. Optimal Policy for Single User

5.2. Optimal Policy for RP

6. Indexed Priority Policy

6.1. Primal-Dual Heuristic

6.2. Indexed Priority Policy

7. Numerical Results

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Proof of Lemma 1

Appendix B. Proof of Lemma 2

Appendix C. Proof of Theorem 1

Appendix D. Proof of Corollary 2

Appendix E. Proof of Proposition 1

Appendix F. Proof of Proposition 2

Appendix G. Proof of Proposition 4

Appendix H. Proof of Proposition 5

Appendix I. Proof of Theorem 2

Appendix J. Proof of Proposition 6

Appendix K. Proof of Theorem 3

Appendix L. Proof of Proposition 8

Appendix M

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI