A Derivative-Incorporated Adaptive Gradient Method for Federated Learning

Gao, Huimin; Wu, Qingtao; Cao, Hongyan; Zhao, Xuhui; Zhu, Junlong; Zhang, Mingchuan

doi:10.3390/math11153403

Open AccessArticle

A Derivative-Incorporated Adaptive Gradient Method for Federated Learning

by

Huimin Gao

¹,

Qingtao Wu

¹

,

Hongyan Cao

^2,*,

Xuhui Zhao

¹,

Junlong Zhu

¹ and

Mingchuan Zhang

¹

College of Information Engineering, Henan University of Science and Technology, Luoyang 471023, China

²

China Research Institute of Radiowave Propagation, Qingdao 266107, China

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(15), 3403; https://0-doi-org.brum.beds.ac.uk/10.3390/math11153403

Submission received: 11 June 2023 / Revised: 28 July 2023 / Accepted: 2 August 2023 / Published: 4 August 2023

(This article belongs to the Section Mathematics and Computer Science)

Download

Browse Figures

Versions Notes

Abstract

:

As a new machine learning technique, federated learning has received more attention in recent years, which enables decentralized model training across data silos or edge intelligent devices in the Internet of Things without exchanging local raw data. All kinds of algorithms are proposed to solve the challenges in federated learning. However, most of these methods are based on stochastic gradient descent, which undergoes slow convergence and unstable performance during the training stage. In this paper, we propose a differential adaptive federated optimization method, which incorporates an adaptive learning rate and the gradient difference into the iteration rule of the global model. We further adopt the first-order moment estimation to compute the approximate value of the differential term so as to avoid amplifying the random noise from the input data sample. The theoretical convergence guarantee is established for our proposed method in a stochastic non-convex setting under full client participation and partial client participation cases. Experiments for the image classification task are performed on two standard datasets by training a neural network model, and experiment results on different baselines demonstrate the effectiveness of our proposed method.

Keywords:

federated learning; stochastic gradient descent; learning rate; differential

MSC:

68T07

1. Introduction

With huge amounts of data being generated at an increasing rate, the cost of collecting and handling data is becoming more expensive by a centralized data center. Therefore, collaborative learning [1,2] has gained more attention due to its ability to unite the collective power of devices distributed in different geographic locations. However, the exchange of information among devices may cause the risk of data leakage during the process of training. As a promising solution, federated learning [3,4,5] was proposed to train models across edge smart devices in the Internet of Things without exchanging local raw data, which can balance the need for data privacy and the benefits of collaborative machine learning in a distributed environment. In the federated learning setting, the central server coordinates multiple clients to jointly train a global model from amounts of dispersive local data on different clients with periodic model aggregation. Since there is no exchange and sharing of data between clients and the central server during the training process, federated learning protects the privacy of clients to some extent. Furthermore, the communication manner with the server after multiple local update steps reduces the communication frequency and saves the limited communication resources. Benefitting from the decentralized learning based on local data, federated learning has been widely applied in a mobile edge network [6], healthcare [7], vehicles [8], and other fields.

Federated Averaging (FedAvg) [4] is one of the most prevailing algorithms for federated learning where a subset of clients conduct multiple stochastic gradient update steps based on their local data, and then the updates are sent back to the centralized server and averaged. FedAvg degenerates into the local stochastic gradient descent algorithm [9] when the data on different clients follow the same distribution. Despite gaining lots of attention, the practical implementation of FedAvg faces many challenges. The algorithm suffers from model drift, which hinders it converging to the optimal value of the objective function. The work of [10] generalized FedAvg and proposed FedProx to tackle the statistical and systems heterogeneity problems. In [11], the authors introduced control variates to adjust the model bias across clients, which was caused by the non-identically distributed data, but this required clients to remain stateful, which is unpractical. In addition, several works introduced the momentum technique to improve the convergence performance of stochastic gradient-based methods, which used the accumulated gradient information of previous iterations in the current step. Liu et al. [12] proposed a momentum federated learning method in a deterministic setting. The literature [13] applied a discounted global momentum to embed local iterations so as to accelerate learning and control model drift. In fact, the accumulated stale information may slow down the response to changes toward the update direction, which leads to the models continuing to update in the wrong direction, even though the gradient in the current step is already pointing in the right direction. This phenomenon is called overshoot in the domain of control, which manifests as oscillation and consumes more training time and resources.

Proportional–integral–derivative (PID) control is one of the most widely used control methods in the field of industrial control. The basic idea of PID control is making use of the current, past, and future trend information between the actual system output and the desired output. In order to settle the overshoot problem based on the stochastic gradient method, the study [14] that originated from the PID method proposed an integral–derivative approach which introduces the derivative or difference of the gradient in the training process. And the study [15] proposed a complete PID optimizer to stabilize the training process. On the other hand, due to the advantages of less parameter adjustment and better convergence speed, adaptive-type methods and variants [16,17,18] have been used in federated learning settings and obtained good results. Inspired by these observations, we focus on both the descent direction and the learning rate for performance improvement in the design of an optimization algorithm based on stochastic gradient. We introduce the derivative information to the update rule of a federated model, which represents the future trend information of the gradient change. With reference to the ongoing debate on model-centric AI and data-centric AI, our work belongs to the category of model-centric AI, highlighting the significance of further developing model-centric approaches [19]. In a nutshell, the contributions are elaborated on as follows.

We develop a federated optimization method, named FedDA, which utilizes the trend of gradient change and adaptive learning rate to update the federated global model.
We provide the convergence analysis for the proposed algorithm. It achieves the convergence rate of $O (1 / \sqrt{n T H})$ for full client participation and $O (1 / \sqrt{T} \sqrt[3]{m})$ for partial client participation in a stochastic non-convex setting, where n, m represent the number of clients, H is the number of local updates and T denotes the number of communications.
We conduct the image classification tasks on a standard dataset to compare FedDA and other basineline methods. The experiment results show the effectiveness of our proposed algorithm.

The rest of this paper is organized as follows. We review the related literature in Section 2. Next, the optimization problem and algorithm design are presented in Section 3. Then, the helpful assumptions and theoretical results are displayed in Section 4. The convergence guarantee is built in detailed in Section 5. We compare the proposed algorithm with several baseline methods by extensive experiments in Section 6. We conclude the article in Section 7.

N o t a t i o n .

In this paper, we use

[n]

to denote the set

{1, 2, ∆, n}

for a positive integer n and

S_{t}

is the clients set with size

∣ S_{t} ∣ = m

for

m \in (0, n]

in the communication round t. We use

{∥ \cdot ∥}_{1}

and

∥ \cdot ∥

to represent the

ℓ_{1}

and

ℓ_{2}

norm, respectively. Let

{[\cdot]}_{i, t}^{h}

denote the model parameter of the h-th local update step of the t-th communication round for client i.

2. Related Work

Stochastic gradient descent (SGD) [20] has been extensively applied in training large-scale machine learning models because of only choosing a mini-batch data sample or even a sample for each update round for the past few years. In order to reduce the influence of noise on stochastic gradient, the momentum technique [21,22] was introduced to correct the gradient direction. The literature [23] trained deep recurrent neural networks with momentum and strong initialization. Although its advantages include its simplicity and ease of implementation, the stochastic gradient-type methods depend on the selection of hyperparameters and show a heavy-tailed behavior from empirical observations. Adaptive optimization methods and variants have been studied and applied for their good practical performance in neural network training, such as Adagrad [24], Adam [16] and AMSGrad [25]. Xie et al. [26] proposed a new variant of Adam by reformulating the Nesterov update schemes for accelerate deep neural network training. These methods generally utilize the adaptive learning rate and momentum technique, which adaptively adjust learning rates for each coordinate of model parameters and enjoy excellent convergence performance. Recently, Recht [27] has revealed the connection between PID feedback control and stochastic optimization algorithms, and the former can be viewed as an optimizer that encapsulates gradient and momentum. An et al. [14] proposed a PID controller approach to solve the overshoot problem, which added the gradient difference in the current update step to predict the change trend of the updating direction. The work of [15] proposed a complete PID algorithm by increasing the effect of current information to relieve the oscillations during the training stage. The idea of adaptive PID control [28] permeates into the stochastic optimization by taking advantage of the current, the past and predicted future information of the gradient throughout the training process. These methods are implemented in a centralized manner.

As the size of the data samples increases, distributed training has gained more attention. Despite the great advantage, it incurs privacy issues and a communication bottleneck caused by the bandwidth limits and network delays. FedAvg [4] was proposed by incorporating a local gradient update and model averaging periodically, significantly reducing the communication overhead. After that, many variants of FedAvg were developed and proposed. The convergent upper bound of FedAvg was established by [9] and improved later in [29] under different data distribution cases. Li et al. [10] introduced a proximal term and proposed the FedProx framework to solve the heterogeneity challenge. Karimireddy et al. [11] made use of the control variable to adjust the model bias across clients. Guo et al. [30] presented hybrid local SGD by means of dividing the clients into different groups to achieve a good balance between model accuracy and runtime. Most of these works used stochastic gradient descent to update local models. Some momentum-based approaches are extended to the federated learning setting. Nesterov momentum [31] was used to accelerate learning when the objective function is convex. The work of [32] added the momentum to both the server model aggregation and local model update in order to reduce high variance, which improved the iteration complexity of federated learning. More recently, the studies incorporating momentum and an adaptive technique have emerged in a federated learning setting. FedAdam [17] was first proposed as a federated version of the adaptive gradient-type method with satisfactory performance. Several adaptive gradients methods with calibrations were studied by [33]. Local AMSGrad [34] considers the adaptiveness in the local model, while FedCAMS [18] utilizes the adaptive gradient in the update of the global model. FedExP [35] was developed by using varying pseudo-gradients to adaptively decide the step size of the server. Inspired by the above studies, in this article, we introduce the gradient difference and adaptive learning rate to the server side during the update of the global model and analyze its convergence performance.

3. Problem Formulation and Algorithm Design

A typical federated learning system with n clients or devices is considered where a global model is jointly trained under the orchestration of a central server by decentralized data. The main goal for n devices in a federated learning system is to settle an optimization problem, which can be formulated as:

\begin{matrix} min_{x \in R^{d}} f (x) & = \frac{1}{n} \sum_{i = 1}^{n} f_{i} (x), \end{matrix}

(1)

where

f_{i} (x) ≜ E_{ξ_{i} \sim D_{i}} f_{i} (x; ξ_{i})

denotes the local cost function for the i-th client, which assesses the average discrepancy between the learning model and the ground truth associated with a random data sample

ξ_{i}

that draws distribution

D_{i}

. n is the number of decentralized clients, and d denotes the dimensionality of the model parameters. In practice, the dataset belonging to each client is often generated based on its own local environment, which means the distribution

D_{i}, D_{j}

could be different, i.e.,

D_{i} \neq D_{j}

, when

i \neq j

.

In this section, we present a differential adaptive federated optimization algorithm (FedDA) to solve problem (1), where the central server updates the global model by adopting the differential of the pseudo-gradient and an adaptive learning rate. The pseudo-code of FedDA is summarized in detailed in Algorithm 1. Specifically, at the beginning of communication round t, the centralized server randomly selects a subset

S_{t}

of clients and sends the current global model

x_{t}

to each client i,

i \in S_{t}

. After conducting H stochastic gradient descent steps with the local learning rate as line 6 in Algorithm 1, each client i acquires the local model parameters

x_{i, t}^{H}

and sends its model update to the central server after computing the model difference

Γ_{i, t}

, i.e.,

Γ_{i, t} = x_{i, t}^{H} - x_{t}

, where

x_{t}

is the global model. In fact, by recursion on the local update rule in line 6,

Γ_{i, t}

is the cumulative gradient for the client i. Hence, the aggregated difference

Γ_{t}

as defined in line 10 of Algorithm 1 can be regarded as the pseudo-gradient to update the global model in the sever. Then, based on

Γ_{t}

, the central server computes the momentum term

m_{t}

and

v_{t}

following AMSGrad in [25] where

{\hat{v}}_{t}

is designed to be non-decreasing, avoiding the divergence of adaptive methods, and

β_{1}

and

β_{1}

are momentum factors.

Algorithm 1 Federated Differential Adaptive Optimization Algorithm (FedDA)

Input: Initial point

x_{0}

, learning rate

η_{l}

and

η

,

m_{0} = 0

,

v_{0} = 0

,

d_{0} = 0

.

1:: for $t = 1, \dots, T$ do
2:: for each client $i \in S_{t}$ in parallel do
3:: Set $x_{i, t}^{0} = x_{t}$
4:: for $h = 0, \dots, H - 1$ do
5:: Compute stochastic gradient $g_{i, t}^{h} = \nabla F_{i} (x_{i, t}^{h}, ξ_{i, t}^{h})$
6:: Local update: $x_{i, t}^{h + 1} = x_{i, t}^{h} - η_{l} g_{i, t}^{h}$
7:: end for
8:: Let $Γ_{i, t} = x_{i, t}^{H} - x_{t}$ and upload $Γ_{i, t}$ to the server
9:: end for
10:: Compute $Γ_{t} = \frac{1}{| S_{t} |} \sum_{i \in S_{t}} Γ_{i, t}$ in the server
11:: Update $m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) Γ_{t}$
12:: Update $v_{t} = β_{2} v_{t - 1} + (1 - β_{2}) Γ_{t}^{2}$
13:: Let ${\hat{v}}_{t} = max {{\hat{v}}_{t - 1}, v_{t}}$ and ${\hat{Γ}}_{t} = \frac{Γ_{t}}{\sqrt{{\hat{v}}_{t}} + ϵ}$
14:: Update $d_{t} = β_{1} d_{t - 1} + (1 - β_{1}) ({\hat{Γ}}_{t} - {\hat{Γ}}_{t - 1})$
15:: Update the global model in the server:
$x_{t + 1} = x_{t} + η \frac{m_{t}}{\sqrt{{\hat{v}}_{t}} + ϵ} + ν d_{t}$
16:: Broadcast $x_{t + 1}$ to each client
17:: end for

In particular, we add the

d_{t}

term to the update rule of the global model. It denotes the changing trend of gradient and can be interpreted as the future information along the undate direction. To weaken the impact of noise caused by mini-batch data in each optimization step, we adopt moving average decay on the

d_{t}

term, which incorporates the history value on the current update and calculates a weighted average based on these values as shown in line 14 of Algorithm 1. In addition, we introduce

{\hat{Γ}}_{t} = Γ_{t} / (\sqrt{{\hat{v}}_{t}} + ϵ)

to reduce the interference of outliers and correct the gradient, because the model may be unstable and the difference in the gradient direction by random batch sampling can be large, especially in the initial stage of training. Note that the vector square and vector division throughout the paper are element-wise.

4. Main Theoretical Results

In this section, we provide the theoretical convergence results of our proposed algorithm FedDA. Before presenting the main results, we first make some assumptions required for the analytical proof.

Assumption 1.

(Smoothness) The cost function

f_{i} (x)

is L-smooth with the constant

L > 0

; that is, for ∀

x, y \in R^{d}

and

i \in [n]

, we have

∥ \nabla f_{i} (x) - \nabla f_{i} (y) ∥ \leq L ∥ x - y ∥ .

(2)

Note that Assumption 1 suggests the critical property, i.e.,

f_{i} (x) - f_{i} (y) \leq 〈 \nabla f_{i} (y), x - y 〉 + \frac{L}{2} {∥ x - y ∥}^{2}

. It is a commonly used assumption in theoretical analysis of stochastic non-convex optimization problems [25,36]. Similarly, it is also used in the algorithm analysis of federated learning, such as [18,34].

Assumption 2.

(Unbiased and bounded gradient) For

\forall i \in [n]

and

t \in [T]

, the stochastic gradient

g_{i, t}^{h}

is unbiased, i.e.,

E [g_{i, t}^{h}] = \nabla f_{i} (x_{i, t}^{h})

; each function

f_{i} (x)

has a B-bounded stochastic gradient, i.e.,

∥ g_{i, t}^{h} ∥ \leq B

.

Assumption 2 is a general assumption condition to give the upper bound of the gradient in adaptive gradient-type methods, such as in [17,34].

Assumption 3.

(Bounded variance) For

\forall i \in [n]

and

t \in [T]

, the variance of each stochastic gradient for a local client is bounded, i.e.,

E ∥ g_{i, t}^{h} - \nabla f_{i} (x_{i, t}^{h}) ∥^{2} \leq δ_{l}^{2}

; the deviations among each loss function

f_{i} (x)

are bounded, i.e.,

\frac{1}{n} \sum_{i = 1}^{n} {∥ \nabla f_{i} (x) - \nabla f (x) ∥}^{2} \leq δ_{g}^{2}

.

Note that the local bound

δ_{l}

in Assumption 3 quantifies the random variation of the gradient estimator for each client, and the global bound

δ_{g}

describes the deviation between each loss function

f_{i} (x)

, which is mainly introduced by the heterogeneous data of each client. In particular,

δ_{g} = 0

means that the data of all the clients follow the same distribution, i.e., the homogeneous data setting.

Based on these assumptions, we present the following convergence theorems for the proposed algorithm. The case in which all clients take part in the model training during each round is considered first, i.e.,

| S_{t} | = n

, any

t \in [T]

.

Theorem 1.

(Full Client Participation) Suppose that Assumptions 1–3 hold. Let local learning rate

η_{l}

satisfy

η_{l} \leq min {\frac{1}{8 H L}, \frac{ϵ}{3 H L (H B + ϵ) (η β_{1}^{2} / {(1 - β_{1})}^{2} + η + ν^{2} / η^{2})}}

. Under the full client participation case, the iterations of Algorithm 1 satisfy

\begin{matrix} \frac{1}{T} \sum_{t = 1}^{T} E {∥\nabla f (x_{t})∥}^{2} \leq 4 (η_{l} H B + ϵ) [\frac{f (x_{0}) - f^{*}}{η η_{l} H T} + \frac{C_{1}}{T} + C_{2}], \end{matrix}

(3)

where

C_{1} = \frac{β_{1} B^{2} d}{(1 - β_{1}) ϵ} + \frac{η η_{l} H B^{2} β_{1}^{2} L d}{{(1 - β_{1})}^{2} ϵ^{2}}

and

C_{2} = \frac{5 η_{l}^{2} L^{2} H}{2 ϵ} (δ_{l}^{2} + 6 H δ_{g}^{2}) + \frac{3 η η_{l} β_{1}^{2} L δ_{l}^{2}}{2 n {(1 - β_{1})}^{2} ϵ^{2}} + \frac{6 η_{l} L H B^{2} ν^{2}}{η ϵ^{2}}

, and

f^{*}

represents the optimal function value.

Remark 1.

In Theorem 1, the expected squared gradient norm in the global model is used to measure the convergence in non-convex optimization. We find that the convergence upper bound of FedDA in (3) includes three terms: the first two terms relate to T and vanish to zero when T has a sufficiently large increase. The third term

C_{2}

is independent of T, which depends on the local variance

δ_{l}

and the global variance

δ_{g}

. We further divide

C_{2}

into three parts. The first part in

C_{2}

is the cumulative variance by conducting H local iteration steps which are greatly affected by the data heterogeneity. It will be smaller when each client follows the same data distribution, i.e.,

δ_{g} = 0

. The local variance of the gradient computed by randomly sampling data points for each client contributes to the second part of

C_{2}

. It decreases at the rate

\frac{1}{n}

, which means more data samples from more clients, resulting in the decreasing of noise of stochastic gradients. The last part of

C_{2}

depends on the number of local update H. This requires a sufficiently small ν to offset the influence of the gradient difference to make the last past in

C_{2}

small.

With Theorem 1, we restate the above convergence result for the FedDA algorithm by choosing proper learning rates:

Corollary 1.

Let

η_{l} = O (\frac{1}{\sqrt{T} H})

,

η = O (\sqrt{n H})

. For

T \geq n H

, the convergence rate of the FedDA algorithm under the all client participation case is:

\begin{matrix} \frac{1}{T} \sum_{t = 1}^{T} E {∥\nabla f (x_{t})∥}^{2} \leq O (\frac{1}{\sqrt{n T H}}) . \end{matrix}

(4)

Remark 2.

From Corollary 1, we can see that the convergence rate of the proposed algorithm is

O (\frac{1}{\sqrt{n T H}})

as long as

T \geq n H

. In [37], the generated FedAvg with local and global learning rates achieves a convergence rate of

O (1 / \sqrt{n K T})

, where K denotes the number of local update steps. For federated adaptive optimization, the convergence rate of FedAdam [17] is

O (1 / \sqrt{n K T})

, which is similar to FedYogi when the data are heterogeneous. Thus, the proposed FedDA matches the convergence result of a non-convex problem in federated learning [18].

Next, a partial client participation case is considered, which is more practical in a federated learning setting for each communication round.

S_{t}

denotes the set of clients who participate in the training of the local model and communicate with the central server in round t, and the size of

S_{t}

is m, i.e.,

| S_{t} | = m

,

0 < m \leq n

,

\forall t \in [T]

. We adopt the random sampling strategy without replacement. There is a random selection of clients in addition to the stochastic gradient. Here,

E [\cdot]

is still used to express the expectations for these two kinds of randomness.

Theorem 2.

(Partial Client Participation) Suppose that Assumptions 1–3 hold. Let local learning rate

η_{l}

satisfy

η_{l} \leq min {\frac{1}{8 H L}, \frac{m (n - 1) ϵ^{2}}{n H L (m - 1) (2 η^{2} + 3 ν^{2} + 1 + 3 η^{2} β_{1}^{2} / {(1 - β_{1})}^{2}) (H B + ϵ)}}

. Under the partial client participation case, the iterations of Algorithm 1 satisfy

\begin{matrix} \frac{1}{T} \sum_{t = 1}^{T} E {∥\nabla f (x_{t})∥}^{2} \leq 4 (η_{l} H B + ϵ) [\frac{f (x_{0}) - f^{*}}{η η_{l} H T} + \frac{C_{1}}{T} + C_{4}], \end{matrix}

(5)

where

C_{1} = \frac{β_{1} B^{2} d}{(1 - β_{1}) ϵ} + \frac{η η_{l} β_{1}^{2} L H B^{2} d}{{(1 - β_{1})}^{2} ϵ^{2}}

,

C_{4} = \frac{5 η_{l}^{2} L^{2} H}{2 ϵ} (δ_{l}^{2} + 6 H δ_{g}^{2}) + \frac{6 η_{l} L H B^{2} ν^{2}}{ϵ^{2}} + C_{3} L [\frac{η_{l} δ_{l}^{2}}{m} + \frac{η_{l} (n - m)}{m (n - 1)} \cdot (15 H^{2} L^{2} η_{l}^{2} (δ_{l}^{2} + 6 H δ_{g}^{2}) + 3 H δ_{g}^{2})]

and

C_{3} = \frac{2 η + 3 ν^{2} + 1}{2 ϵ^{2}} + \frac{3 η^{2} β_{1}^{2}}{2 {(1 - β_{1})}^{2} ϵ^{2}}

, and

f^{*}

represents the optimal function value.

Remark 3.

In Theorem 2, the similar convergence upper bound is obtained for partial client participation but with a larger bias term

C_{4}

, which suggests that the convergence of FedDA has not changed fundamentally except introducing extra variance because of low client participation and the use of a difference term, which can be controlled by choosing a small ν.

From Theorem 2, we restate the above result by choosing a specific learning rate:

Corollary 2.

If we choose

η_{l} = O (\frac{1}{\sqrt{T} H})

,

η = O (\sqrt[3]{m})

, the convergence rate of the FedDA algorithm under the partial client participation case is:

\begin{matrix} \frac{1}{T} \sum_{t = 1}^{T} E {∥\nabla f (x_{t})∥}^{2} \leq O (\frac{1}{\sqrt{T} \sqrt[3]{m}}) . \end{matrix}

(6)

From Corollary 2, we find that partial client participation does not have changes in convergence in the sense of order. It implies that a larger number of m can help convergence. In addition, when choosing

m \leq H^{3}

, we have a slightly small convergence bound compared to [18].

5. Performance Analysis

This section first gives useful lemmas and builds a theoretical convergence guarantee for the proposed algorithm FedDA. Here, we define

x_{0} = x_{1}

,

Γ_{0} = 0

; for

t \geq 1

, a virtual sequence

y_{t}

is introduced as follows:

\begin{matrix} y_{t} = x_{t} + \frac{β_{1}}{1 - β_{1}} (x_{t} - x_{t - 1}) - ν {\hat{Γ}}_{t - 1}, \end{matrix}

(7)

and we obtain the following property for the difference of

y_{t}

,

\begin{matrix} y_{t + 1} - y_{t} & = \frac{1}{1 - β_{1}} (x_{t + 1} - x_{t}) - \frac{β_{1}}{1 - β_{1}} (x_{t} - x_{t - 1}) - ν ({\hat{Γ}}_{t} - {\hat{Γ}}_{t - 1}) \\ = \frac{1}{1 - β_{1}} (\frac{η m_{t}}{\sqrt{{\hat{v}}_{t}} + ϵ} + ν d_{t}) - ν ({\hat{Γ}}_{t} - {\hat{Γ}}_{t - 1}) - \frac{β_{1}}{1 - β_{1}} (\frac{η m_{t - 1}}{\sqrt{{\hat{v}}_{t - 1}} + ϵ} + ν d_{t - 1}) \\ = \frac{η β_{1}}{1 - β_{1}} (\frac{1}{\sqrt{{\hat{v}}_{t}} + ϵ} - \frac{1}{\sqrt{{\hat{v}}_{t - 1}} + ϵ}) ⊙ m_{t - 1} + \frac{η}{\sqrt{{\hat{v}}_{t}} + ϵ} ⊙ Γ_{t} . \end{matrix}

(8)

Lemma 1.

Under Assumption 2 and Algorithm 1, we have

∥ \nabla f (x) ∥ \leq B

,

∥ Γ_{t} ∥ \leq η_{l} H B

,

∥ m_{t} ∥ \leq η_{l} H B

,

∥ v_{t} ∥ \leq η_{l}^{2} H^{2} B^{2}

and

∥ d_{t} ∥ \leq \frac{2 (1 - β_{1})}{β_{1}} \frac{η_{l} H B}{ϵ}

.

Proof of Lemma 1.

Because of the bounded stochastic gradient by Assumption 2, we have

\begin{matrix} ∥ \nabla f (x) ∥ & = ∥ E_{ξ} \nabla f (x, ξ) ∥ \leq E_{ξ} ∥ \nabla f (x, ξ) ∥ \leq B . \end{matrix}

(9)

By the definition of

Γ_{t}

in Algorithm 1, the following relation holds regardless of full participation or partial participation cases,

\begin{matrix} ∥ Γ_{t} ∥ & = ∥\frac{1}{| S_{t} |} \sum_{i \in S_{t}} Γ_{i, t}∥ = ∥\frac{1}{| S_{t} |} \sum_{i \in S_{t}} (- η_{l} \sum_{h = 0}^{H - 1} g_{i, t}^{h})∥ \leq \frac{η_{l}}{| S_{t} |} \sum_{i \in S_{t}} \sum_{h = 0}^{H - 1} ∥g_{i, t}^{h}∥ \leq η_{l} H B . \end{matrix}

(10)

Hence, we can estimate the upper bound for

m_{t}

and

v_{t}

based on (10),

\begin{matrix} ∥ m_{t} ∥ \leq (1 - β_{1}) \sum_{j = 1}^{t} β_{1}^{t - j} ∥Γ_{j}∥ \leq η_{l} H B, \end{matrix}

(11)

\begin{matrix} ∥ v_{t} ∥ \leq (1 - β_{2}) \sum_{j = 1}^{t} β_{2}^{t - j} {∥Γ_{j}∥}^{2} \leq η_{l}^{2} H^{2} B^{2} . \end{matrix}

(12)

In addition, we expand

d_{t}

and note

{\hat{Γ}}_{0} = 0

:

\begin{matrix} ∥ d_{t} ∥ & = ∥\frac{1 - β_{1}}{β_{1}} ({\hat{Γ}}_{t} - (1 - β_{1}) \sum_{j = 1}^{t} β_{1}^{t - j} {\hat{Γ}}_{j})∥ \\ \leq \frac{1 - β_{1}}{β_{1}} (∥{\hat{Γ}}_{j}∥ + (1 - β_{1}) \sum_{j = 1}^{t} β_{1}^{t - j} ∥{\hat{Γ}}_{j}∥) \\ \leq \frac{2 (1 - β_{1})}{β_{1}} \frac{η_{l} H B}{ϵ}, \end{matrix}

(13)

where the last inequality holds by

∥ {\hat{Γ}}_{t} ∥ \leq \frac{η_{l} H B}{ϵ}

. □

Lemma 2.

Under Algorithm 1, for the difference sequence of

\frac{1}{\sqrt{{\hat{v}}_{t}} + ϵ} - \frac{1}{\sqrt{{\hat{v}}_{t - 1}} + ϵ}

, we have

\sum_{t = 1}^{T} {∥\frac{1}{\sqrt{{\hat{v}}_{t}} + ϵ} - \frac{1}{\sqrt{{\hat{v}}_{t - 1}} + ϵ}∥}_{1} \leq \frac{d}{ϵ}

and

\sum_{t = 1}^{T} ∥\frac{1}{\sqrt{{\hat{v}}_{t}} + ϵ} - \frac{1}{\sqrt{{\hat{v}}_{t - 1}} + ϵ}∥ \leq \frac{d}{ϵ^{2}}

.

Proof of Lemma 2.

From the definition of

{\hat{v}}_{t} = max {{\hat{v}}_{t - 1}, v_{t}}

, we have

\begin{matrix} \sum_{t = 1}^{T} {∥\frac{1}{\sqrt{{\hat{v}}_{t}} + ϵ} - \frac{1}{\sqrt{{\hat{v}}_{t - 1}} + ϵ}∥}_{1} & = \sum_{t = 1}^{T} {∥\frac{1}{\sqrt{{\hat{v}}_{t - 1}} + ϵ}∥}_{1} - {∥\frac{1}{\sqrt{{\hat{v}}_{t}} + ϵ}∥}_{1} \\ = {∥\frac{1}{\sqrt{{\hat{v}}_{0}} + ϵ}∥}_{1} - {∥\frac{1}{\sqrt{{\hat{v}}_{T}} + ϵ}∥}_{1} \\ \leq \frac{d}{ϵ} . \end{matrix}

(14)

Similarly, according to the definition of the

ℓ_{2}

norm, we have

\begin{matrix} \sum_{t = 1}^{T} {∥\frac{1}{\sqrt{{\hat{v}}_{t - 1}} + ϵ} - \frac{1}{\sqrt{{\hat{v}}_{t}} + ϵ}∥}^{2} & = \sum_{t = 1}^{T} \sum_{j = 1}^{d} {(\frac{1}{\sqrt{{\hat{v}}_{t - 1}} + ϵ} - \frac{1}{\sqrt{{\hat{v}}_{t}} + ϵ})}_{j}^{2} \\ \leq \sum_{t = 1}^{T} \sum_{j = 1}^{d} \frac{1}{ϵ} {(\frac{1}{\sqrt{{\hat{v}}_{t - 1}} + ϵ} - \frac{1}{\sqrt{{\hat{v}}_{t}} + ϵ})}_{j} \\ \leq \sum_{j = 1}^{d} \frac{1}{ϵ} {(\frac{1}{\sqrt{{\hat{v}}_{0}} + ϵ} - \frac{1}{\sqrt{{\hat{v}}_{T}} + ϵ})}_{j} \\ \leq \frac{d}{ϵ^{2}}, \end{matrix}

(15)

where the last inequality has the non-decreasing nature of

{\hat{v}}_{t}

. We finish the proof. □

Lemma 3.

Under the scheme of randomly sampling without replacement, the average of partial model difference

Γ_{t}

is an unbiased estimator of

{\bar{Γ}}_{t}

, i.e.,

E_{S_{t}} [Γ_{t}] = {\bar{Γ}}_{t},

(16)

where

{\bar{Γ}}_{t} = \frac{1}{n} \sum_{i = 1}^{n} Γ_{i, t}

.

Proof of Lemma 3.

For a partial client participation case, let

S_{t} = {l_{1}, ∆, l_{m}}

with

| S_{t} | = m

; because the sampling distribution of each round is identical, we have

\begin{matrix} E_{S_{t}} [Γ_{t}] & = \frac{1}{m} E_{S_{t}} [\sum_{l_{i} \in S_{t}} Γ_{l_{i}, t}] = \frac{1}{m} E_{S_{t}} [\sum_{i = 1}^{m} Γ_{l_{i}, t}] = E_{S_{t}} [Γ_{l_{1}, t}] = \frac{1}{n} \sum_{i = 1}^{n} Γ_{i, t} = {\bar{Γ}}_{t} . \end{matrix}

(17)

□

Lemma 4.

Under Assumptions 1–3 and Algorithm 1, when all clients participate in communication with the server in each round, the following upper bound holds for the average of the aggregated model difference

Γ_{t}

\begin{matrix} E ∥ Γ_{t} ∥^{2} \leq \frac{η_{l}^{2} H δ_{l}^{2}}{n} + \frac{η_{l}^{2}}{n^{2}} E {∥\sum_{i = 1}^{n} \sum_{h = 0}^{H - 1} \nabla f_{i} (x_{i, t}^{h})∥}^{2}, \end{matrix}

(18)

and for a partial participation case, we have

\begin{matrix} E ∥ Γ_{t} ∥^{2} & \leq \frac{η_{l}^{2} H δ_{l}^{2}}{m} + \frac{η_{l}^{2} (n - m)}{m (n - 1)} [15 H^{3} L^{2} η_{l}^{2} (δ_{l}^{2} + 6 H δ_{g}^{2}) + 3 H^{2} δ_{g}^{2} \\ + (90 H^{4} L^{2} η_{l}^{2} + 3 H^{2}) {∥ \nabla f (x_{t}) ∥}^{2}] + \frac{η_{l}^{2} (m - 1)}{m n (n - 1)} E {∥\sum_{i = 1}^{n} \sum_{h = 0}^{H - 1} \nabla f_{i} (x_{i, t}^{h})∥}^{2} . \end{matrix}

(19)

Proof of Lemma 4.

When

| S_{t} | = n

, we have

\begin{matrix} E ∥ Γ_{t} ∥^{2} & = E {∥\frac{1}{n} \sum_{i = 1}^{n} Γ_{i, t}∥}^{2} \\ = E {∥- \frac{η_{l}}{n} \sum_{i = 1}^{n} \sum_{h = 0}^{H - 1} g_{i, t}^{h}∥}^{2} \\ = \frac{η_{l}^{2}}{n^{2}} E {∥\sum_{i = 1}^{n} \sum_{h = 0}^{H - 1} (g_{i, t}^{h} - \nabla f_{i} (x_{i, t}^{h}))∥}^{2} + \frac{η_{l}^{2}}{n^{2}} E {∥\sum_{i = 1}^{n} \sum_{h = 0}^{H - 1} \nabla f_{i} (x_{i, t}^{h})∥}^{2} \\ \leq \frac{η_{l}^{2} H δ_{l}^{2}}{n} + \frac{η_{l}^{2}}{n^{2}} E {∥\sum_{i = 1}^{n} \sum_{h = 0}^{H - 1} \nabla f_{i} (x_{i, t}^{h})∥}^{2}, \end{matrix}

(20)

where the third equation holds by

{E ∥ z ∥}^{2} = {E ∥ z - E z ∥}^{2} + {∥ E z ∥}^{2}

, and the last inequality holds by Assumption 3 and

E ∥ \sum_{= 1}^{n} z_{i} ∥^{2} = \sum_{= 1}^{n} E {∥ z_{i} ∥}^{2}

when

z_{i}

is an independent random variable with zero mean.

When

| S_{t} | = m

, and

m \in (0, n]

, we introduce the indicator function for convenience as follows

\begin{matrix} E ∥ Γ_{t} ∥^{2} & = E {∥\frac{1}{m} \sum_{i \in S_{t}} Γ_{i, t}∥}^{2} \\ = \frac{1}{m^{2}} E {∥\sum_{i = 1}^{n} I_{{i \in S_{t}}} Γ_{i, t}∥}^{2} \\ = \frac{η_{l}^{2}}{m^{2}} E [{∥\sum_{i = 1}^{n} I_{{i \in S_{t}}} \sum_{h = 0}^{H - 1} (g_{i, t}^{h} - \nabla f_{i} (x_{i, t}^{h}))∥}^{2} + {∥\sum_{i = 1}^{n} I_{{i \in S_{t}}} \sum_{h = 0}^{H - 1} \nabla f_{i} (x_{i, t}^{h})∥}^{2}] \\ = \frac{η_{l}^{2}}{m^{2}} E [{∥\sum_{i = 1}^{n} P {i \in S_{t}} \sum_{h = 0}^{H - 1} (g_{i, t}^{h} - \nabla f_{i} (x_{i, t}^{h}))∥}^{2} + {∥\sum_{i = 1}^{n} P {i \in S_{t}} \sum_{h = 0}^{H - 1} \nabla f_{i} (x_{i, t}^{h})∥}^{2}] \\ = \frac{η_{l}^{2}}{n m} E [\sum_{i = 1}^{n} \sum_{h = 0}^{H - 1} {∥g_{i, t}^{h} - \nabla f_{i} (x_{i, t}^{h})∥}^{2}] + \frac{η_{l}^{2}}{m^{2}} E [{∥\sum_{i = 1}^{n} P {i \in S_{t}} \sum_{h = 0}^{H - 1} \nabla f_{i} (x_{i, t}^{h})∥}^{2}] \\ \leq \frac{η_{l}^{2} H δ_{l}^{2}}{m} + \frac{η_{l}^{2}}{m^{2}} E {∥\sum_{i = 1}^{n} P {i \in S_{t}} \sum_{h = 0}^{H - 1} \nabla f_{i} (x_{i, t}^{h})∥}^{2}, \end{matrix}

(21)

where the third equation follows by the definition of variance, i.e.,

{E ∥ z ∥}^{2} = {E ∥ z - E z ∥}^{2} + {∥ E z ∥}^{2}

, the fifth equation holds by the fact that

E ∥ \sum_{= 1}^{n} z_{i} ∥^{2} = \sum_{= 1}^{n} E {∥ z_{i} ∥}^{2}

when

z_{i}

is an independent random variable with zero mean, and

P {i \in S_{t}} = \frac{m}{n}

. The last one is due to the bounded variance in Assumption 3.

Recalling the equation

∥ \sum_{= 1}^{n} z_{i} ∥^{2} = n \sum_{= 1}^{n} ∥ z_{i} ∥^{2} - \frac{1}{2} \sum_{i \neq j} {∥ z_{i} - z_{j} ∥}^{2}

, for the last term in (21), we have

\begin{matrix} E {∥\sum_{i = 1}^{n} P {i \in S_{t}} \sum_{h = 0}^{H - 1} \nabla f_{i} (x_{i, t}^{h})∥}^{2} \\ = \sum_{i \neq j} P {i, j \in S_{t}} 〈\nabla f_{i} (x_{i, t}^{h}), \nabla f_{j} (x_{j, t}^{h})〉 + \sum_{i = 1}^{n} P {i \in S_{t}} {∥\sum_{h = 0}^{H - 1} \nabla f_{i} (x_{i, t}^{h})∥}^{2} \\ = \frac{m}{n} \sum_{i = 1}^{n} {∥\sum_{h = 0}^{H - 1} \nabla f_{i} (x_{i, t}^{h})∥}^{2} + \frac{m (m - 1)}{n (n - 1)} \sum_{i \neq j} 〈\nabla f_{i} (x_{i, t}^{h}), \nabla f_{j} (x_{j, t}^{h})〉 \\ = - \frac{m (m - 1)}{2 n (n - 1)} \sum_{i \neq j} {∥\sum_{h = 0}^{H - 1} (\nabla f_{i} (x_{i, t}^{h}) - \nabla f_{j} (x_{j, t}^{h}))∥}^{2} + \frac{m^{2}}{n} \sum_{i = 1}^{n} {∥\sum_{h = 0}^{H - 1} \nabla f_{i} (x_{i, t}^{h})∥}^{2} \\ = \frac{m (n - m)}{n (n - 1)} \sum_{i = 1}^{n} {∥\sum_{h = 0}^{H - 1} \nabla f_{i} (x_{i, t}^{h})∥}^{2} + \frac{m (m - 1)}{n (n - 1)} {∥\sum_{i = 1}^{n} \sum_{h = 0}^{H - 1} \nabla f_{i} (x_{i, t}^{h})∥}^{2}, \end{matrix}

(22)

where the second equation holds by

P {i \in S_{t}} = \frac{m}{n}

and

P {i, j \in S_{t}} = \frac{m (m - 1)}{n (n - 1)}

, while the third one and last one hold by

∥ \sum_{= 1}^{n} z_{i} ∥^{2} = n \sum_{= 1}^{n} ∥ z_{i} ∥^{2} - \frac{1}{2} \sum_{i \neq j} {∥ z_{i} - z_{j} ∥}^{2}

and

〈 z_{1}, z_{2} 〉 = \frac{1}{2} [∥ z_{1} ∥^{2} + ∥ z_{2} ∥^{2} - ∥ z_{1} - z_{1} ∥^{2}]

, respectively. In addition, by plugging

\mp \nabla f_{i} (x_{t}) \pm \nabla f (x_{t})

into the first term of (22), we obtain

\begin{matrix} \sum_{i = 1}^{n} {∥\sum_{h = 0}^{H - 1} \nabla f_{i} (x_{i, t}^{h})∥}^{2} \\ = \sum_{i = 1}^{n} E ∥ \sum_{h = 0}^{H - 1} [\nabla f_{i} (x_{i, t}^{h}) - \nabla f_{i} (x_{t}) + \nabla f_{i} (x_{t}) - \nabla f (x_{t}) + \nabla f (x_{t})] ∥^{2} \\ \leq 3 H L^{2} \sum_{i = 1}^{n} \sum_{h = 0}^{H - 1} E {∥x_{i, t}^{h} - x_{t}∥}^{2} + 3 n H^{2} (δ_{g}^{2} + ∥ \nabla f (x_{t}) ∥^{2}) \\ \leq (90 n H^{4} L^{2} η_{l}^{2} + 3 n H^{2}) {∥ \nabla f (x_{t}) ∥}^{2} + 3 n H^{2} δ_{g}^{2} + 15 n H^{3} L^{2} η_{l}^{2} (δ_{l}^{2} + 6 H δ_{g}^{2}), \end{matrix}

(23)

where the first inequality follows by Assumptions 1, 3 and

∥ \sum_{i = 1}^{n} z_{i} ∥^{2} \leq n \sum_{i = 1}^{n} {∥ z_{i} ∥}^{2}

, and the last one follows Lemma 3 in [17]. Substituting (22), (23) into (21), we obtain (19), which finishes the proof. □

Lemma 5.

Under Assumptions 1–3 and Algorithm 1, for the full participation case, we have

\begin{matrix} \sum_{t = 1}^{T} E {∥ m_{t} ∥}^{2} \leq \frac{T H η_{l}^{2} δ_{l}^{2}}{n} + \frac{η_{l}^{2}}{n^{2}} \sum_{t = 1}^{T} E {∥\sum_{i = 1}^{n} \sum_{h = 0}^{H - 1} \nabla f_{i} (x_{i, t}^{h})∥}^{2}, \end{matrix}

(24)

and for the partial participation case, we have

\begin{matrix} \sum_{t = 1}^{T} E {∥ m_{t} ∥}^{2} & \leq \frac{T η_{l}^{2} H δ_{l}^{2}}{m} + \frac{η_{l}^{2} (n - m)}{m (n - 1)} [15 T H^{3} L^{2} η_{l}^{2} (δ_{l}^{2} + 6 H δ_{g}^{2}) + 3 T H^{2} δ_{g}^{2} + (90 H^{4} L^{2} η_{l}^{2} + 3 H^{2}) \cdot \\ \sum_{t = 1}^{T} {∥ \nabla f (x_{t}) ∥}^{2}] + \frac{η_{l}^{2} (m - 1)}{m n (n - 1)} \sum_{t = 1}^{T} E {∥\sum_{i = 1}^{n} \sum_{h = 0}^{H - 1} \nabla f_{i} (x_{i, t}^{h})∥}^{2} . \end{matrix}

(25)

Proof.

For the full client participation scheme, i.e.,

| S_{t} | = n

, according to Algorithm 1, we have

\begin{matrix} E ∥ m_{t} ∥^{2} & = E ∥ (1 - β_{1}) \sum_{j = 1}^{t} β_{1}^{t - j} Γ_{j} ∥^{2} \\ \leq {(1 - β_{1})}^{2} \sum_{τ = 1}^{d} E [{(\sum_{j = 1}^{t} β_{1}^{t - j} Γ_{j, τ})}^{2}] \\ \leq {(1 - β_{1})}^{2} \sum_{τ = 1}^{d} E [(\sum_{j = 1}^{t} β_{1}^{t - j}) (\sum_{j = 1}^{t} β_{1}^{t - j} Γ_{j, τ}^{2})] \\ \leq (1 - β_{1}) \sum_{τ = 1}^{d} E (\sum_{j = 1}^{t} β_{1}^{t - j} Γ_{j, τ}^{2}) \\ = (1 - β_{1}) \sum_{j = 1}^{t} β_{1}^{t - j} E {∥ Γ_{j} ∥}^{2} \\ \leq \frac{η_{l}^{2} H δ_{l}^{2}}{n} + \frac{η_{l}^{2}}{n^{2}} E {∥\sum_{i = 1}^{n} \sum_{h = 0}^{H - 1} \nabla f_{i} (x_{i, t}^{h})∥}^{2}, \end{matrix}

(26)

where the first inequality and second equation hold by the definition of the

ℓ_{2}

norm, and the second inequality follows by Cauchy–Schwarz inequality

{(\sum_{k = 1}^{n} a_{k} b_{k})}^{2} \leq \sum_{k = 1}^{n} a_{k}^{2} \sum_{k = 1}^{n} b_{k}^{2}

. The last inequality is due to (18) in Lemma 4 and

β_{1} < 1

. Taking summation over t from 1 to T for (26), we obtain the desired results (24).

For the partial client participation scheme, i.e.,

| S_{t} | = m

, the main difference is using (19) instead of (18) in Lemma 4 when bound to

E ∥ m_{t} ∥^{2}

. Following a similar proof above, we can obtain the result (25). □

Finally, with the previous lemmas, we return to prove Theorem 1.

Proof of Theorem 1.

Due to the smoothness of f, we have

\begin{matrix} E f (y_{t + 1}) - E f (y_{t}) \\ \leq E 〈\nabla f (y_{t}), y_{t + 1} - y_{t}〉 + \frac{L}{2} E {∥y_{t + 1} - y_{t}∥}^{2} \\ = \underset{A_{1}}{\underset{︸}{E 〈\nabla f (x_{t}), \frac{η}{\sqrt{{\hat{v}}_{t}} + ϵ} ⊙ Γ_{t}〉}} - \underset{A_{2}}{\underset{︸}{\frac{η β_{1}}{1 - β_{1}} E 〈\nabla f (y_{t}), (\frac{1}{\sqrt{{\hat{v}}_{t - 1}} + ϵ} - \frac{1}{\sqrt{{\hat{v}}_{t}} + ϵ}) m_{t - 1}〉}} \\ + \underset{A_{3}}{\underset{︸}{\frac{η^{2} L}{2} {∥\frac{Γ_{t}}{\sqrt{{\hat{v}}_{t}} + ϵ} - \frac{β_{1}}{1 - β_{1}} (\frac{1}{\sqrt{{\hat{v}}_{t - 1}} + ϵ} - \frac{1}{\sqrt{{\hat{v}}_{t}} + ϵ}) m_{t - 1}∥}^{2}}} \\ + \underset{A_{4}}{\underset{︸}{E 〈\nabla f (y_{t}) - \nabla f (x_{t}), \frac{η}{\sqrt{{\hat{v}}_{t}} + ϵ} ⊙ Γ_{t}〉}} . \end{matrix}

(27)

We first bound

A_{1}

as following:

\begin{matrix} A_{1} & = E 〈\nabla f (x_{t}), \frac{η}{\sqrt{{\hat{v}}_{t}} + ϵ} ⊙ Γ_{t}〉 \\ = η E 〈\frac{\nabla f (x_{t})}{\sqrt{{\hat{v}}_{t}} + ϵ}, Γ_{t} + η_{l} H \nabla f (x_{t}) - η_{l} H \nabla f (x_{t})〉 \\ = η E 〈\frac{\nabla f (x_{t})}{\sqrt{{\hat{v}}_{t}} + ϵ}, - \frac{η_{l}}{n} \sum_{i = 1}^{n} \sum_{h = 0}^{H - 1} g_{i, t}^{h} + \frac{η_{l} H}{n} \sum_{i = 1}^{n} \nabla f_{i} (x_{t})〉 - η η_{l} H E {∥\frac{\nabla f (x_{t})}{\sqrt{\sqrt{{\hat{v}}_{t}} + ϵ}}∥}^{2} . \end{matrix}

(28)

Note that the first term of (28) can be further estimated:

\begin{matrix} η E 〈\frac{\nabla f (x_{t})}{\sqrt{{\hat{v}}_{t}} + ϵ}, - \frac{η_{l}}{n} \sum_{i = 1}^{n} \sum_{h = 0}^{H - 1} g_{i, t}^{h} + \frac{η_{l} H}{n} \sum_{i = 1}^{n} \nabla f_{i} (x_{t})〉 \\ = η E 〈\frac{\sqrt{η_{l} H} \nabla f (x_{t})}{\sqrt{\sqrt{{\hat{v}}_{t}} + ϵ}}, - \frac{\sqrt{η_{l} H}}{H n \sqrt{\sqrt{{\hat{v}}_{t}} + ϵ}} \sum_{i = 1}^{n} \sum_{h = 0}^{H - 1} (\nabla f_{i} (x_{i, t}^{h}) - \nabla f_{i} (x_{t}))〉 \\ = \frac{η η_{l}}{2 H n^{2}} E {∥\frac{\sum_{i = 1}^{n} \sum_{h = 0}^{H - 1} (\nabla f_{i} (x_{i, t}^{h}) - \nabla f_{i} (x_{t}))}{\sqrt{\sqrt{{\hat{v}}_{t}} + ϵ}}∥}^{2} - \frac{η η_{l}}{2 H n^{2}} E {∥\frac{\sum_{i = 1}^{n} \sum_{h = 0}^{H - 1} \nabla f_{i} (x_{i, t}^{h})}{\sqrt{\sqrt{{\hat{v}}_{t}} + ϵ}}∥}^{2} \\ + \frac{η η_{l} H}{2} E {∥\frac{\nabla f (x_{t})}{\sqrt{\sqrt{{\hat{v}}_{t}} + ϵ}}∥}^{2} \\ \leq \frac{η η_{l} L^{2}}{2 n} \sum_{i = 1}^{n} \sum_{h = 0}^{H - 1} E {∥\frac{x_{i, t}^{h} - x_{t}}{\sqrt{\sqrt{{\hat{v}}_{t}} + ϵ}}∥}^{2} + \frac{η η_{l} H}{2} E {∥\frac{\nabla f (x_{t})}{\sqrt{\sqrt{{\hat{v}}_{t}} + ϵ}}∥}^{2} - \frac{η η_{l}}{2 H n^{2}} E {∥\frac{\sum_{i = 1}^{n} \sum_{h = 0}^{H - 1} \nabla f_{i} (x_{i, t}^{h})}{\sqrt{\sqrt{{\hat{v}}_{t}} + ϵ}}∥}^{2} \\ \leq \frac{3 η η_{l} H}{4} E {∥\frac{\nabla f (x_{t})}{\sqrt{\sqrt{{\hat{v}}_{t}} + ϵ}}∥}^{2} + \frac{5 η η_{l}^{3} L^{2} H^{2}}{2 ϵ} (δ_{l}^{2} + 6 H δ_{g}^{2}) - \frac{η η_{l}}{2 H n^{2}} E {∥\frac{\sum_{i = 1}^{n} \sum_{h = 0}^{H - 1} \nabla f_{i} (x_{i, t}^{h})}{\sqrt{\sqrt{{\hat{v}}_{t}} + ϵ}}∥}^{2}, \end{matrix}

(29)

where the second equation is from the equality

〈 a, b 〉 = \frac{1}{2} {(∥ a ∥}^{2} + {∥ b ∥}^{2} {- ∥ a - b ∥}^{2})

, the first inequality follows the fact that

∥ \sum_{i = 1}^{n} a_{i} ∥^{2} \leq n \sum_{i = 1}^{n} {∥ a_{i} ∥}^{2}

and the smoothness of function

f_{i}

. Lemma 3 in [17] and

η_{l} \leq \frac{1}{8 H L}

result in the last inequality.

Plugging (29) into (28), we have

\begin{matrix} A_{1} & \leq - \frac{η η_{l} H}{4} E {∥\frac{\nabla f (x_{t})}{\sqrt{\sqrt{{\hat{v}}_{t}} + ϵ}}∥}^{2} + \frac{5 η η_{l}^{3} L^{2} H^{2}}{2 ϵ} (δ_{l}^{2} + 6 H δ_{g}^{2}) \\ - \frac{η η_{l}}{2 H n^{2}} E {∥\frac{\sum_{i = 1}^{n} \sum_{h = 0}^{H - 1} \nabla f_{i} (x_{i, t}^{h})}{\sqrt{\sqrt{{\hat{v}}_{t - 1}} + ϵ}}∥}^{2} . \end{matrix}

(30)

We further estimate the term

A_{2}

.

\begin{matrix} A_{2} & = - \frac{η β_{1}}{1 - β_{1}} E 〈\nabla f (y_{t}), (\frac{m_{t - 1}}{\sqrt{{\hat{v}}_{t - 1}} + ϵ} - \frac{m_{t - 1}}{\sqrt{{\hat{v}}_{t}} + ϵ})〉 \\ \leq \frac{η β_{1}}{1 - β_{1}} ∥ \nabla f (y_{t}) ∥ ∥(\frac{m_{t - 1}}{\sqrt{{\hat{v}}_{t - 1}} + ϵ} - \frac{m_{t - 1}}{\sqrt{{\hat{v}}_{t}} + ϵ})∥ \\ \leq \frac{η_{l} η β_{1} H B^{2}}{1 - β_{1}} {∥(\frac{1}{\sqrt{{\hat{v}}_{t - 1}} + ϵ} - \frac{1}{\sqrt{{\hat{v}}_{t}} + ϵ})∥}_{1}, \end{matrix}

(31)

where the first inequality is from

〈 a, b 〉 \leq ∥ a ∥ ∥ b ∥

, and the second one holds by Lemma 1.

For the term

A_{3}

in (27), applying the inequality

{∥ a + b ∥}^{2} \leq {2 (∥ a ∥}^{2} + {∥ b ∥}^{2})

and (11) in Lemma 1 yields

\begin{matrix} A_{3} & = \frac{η^{2} L}{2} {∥- \frac{β_{1}}{1 - β_{1}} (\frac{m_{t - 1}}{\sqrt{{\hat{v}}_{t - 1}} + ϵ} - \frac{m_{t - 1}}{\sqrt{{\hat{v}}_{t}} + ϵ}) + \frac{Γ_{t}}{\sqrt{{\hat{v}}_{t}} + ϵ}∥}^{2} \\ \leq \frac{η^{2} β_{1}^{2} L}{{(1 - β_{1})}^{2}} {∥(\frac{m_{t - 1}}{\sqrt{{\hat{v}}_{t - 1}} + ϵ} - \frac{m_{t - 1}}{\sqrt{{\hat{v}}_{t}} + ϵ})∥}^{2} + L ∥ \frac{η Γ_{t}}{\sqrt{{\hat{v}}_{t}} + ϵ} ∥^{2} \\ \leq \frac{η^{2} η_{l}^{2} H^{2} B^{2} β_{1}^{2} L}{{(1 - β_{1})}^{2}} ∥ \frac{1}{\sqrt{{\hat{v}}_{t - 1}} + ϵ} - \frac{1}{\sqrt{{\hat{v}}_{t}} + ϵ} ∥^{2} + L ∥ \frac{η Γ_{t}}{\sqrt{{\hat{v}}_{t}} + ϵ} ∥^{2} . \end{matrix}

(32)

Using similar techniques to bound the term

T_{4}

, we have

\begin{matrix} A_{4} & = E 〈\nabla f (y_{t}) - \nabla f (x_{t}), \frac{η}{\sqrt{{\hat{v}}_{t}} + ϵ} ⊙ Γ_{t}〉 \\ \leq E ∥ \nabla f (y_{t}) - \nabla f (x_{t}) ∥ ∥ \frac{η}{\sqrt{{\hat{v}}_{t}} + ϵ} ⊙ Γ_{t} ∥ \\ \leq L E ∥y_{t} - x_{t}∥ ∥\frac{η}{\sqrt{{\hat{v}}_{t}} + ϵ} ⊙ Γ_{t}∥ \\ \leq \frac{L}{2} ∥ \frac{β_{1}}{1 - β_{1}} (\frac{η m_{t - 1}}{\sqrt{{\hat{v}}_{t - 1}} + ϵ} + ν d_{t - 1}) - \frac{ν Γ_{t - 1}}{\sqrt{{\hat{v}}_{t - 1}} + ϵ} ∥^{2} + \frac{L η^{2}}{2} ∥ \frac{Γ_{t}}{\sqrt{{\hat{v}}_{t}} + ϵ} ∥^{2} \\ \leq \frac{3 η^{2} L}{2} \frac{β_{1}^{2}}{{(1 - β_{1})}^{2}} ∥ \frac{m_{t - 1}}{\sqrt{{\hat{v}}_{t - 1}} + ϵ} ∥^{2} + \frac{3 L ν^{2}}{2} \frac{β_{1}^{2}}{{(1 - β_{1})}^{2}} {∥d_{t - 1}∥}^{2} + \frac{L η^{2}}{2} ∥ \frac{Γ_{t}}{\sqrt{{\hat{v}}_{t}} + ϵ} ∥^{2} \\ + \frac{3 L ν^{2}}{2} ∥ \frac{Γ_{t - 1}}{\sqrt{{\hat{v}}_{t - 1}} + ϵ} ∥^{2} \\ \leq \frac{3 η^{2} L}{2 ϵ^{2}} \frac{β_{1}^{2}}{{(1 - β_{1})}^{2}} {∥m_{t - 1}∥}^{2} + \frac{3 L ν^{2}}{2 ϵ^{2}} ∥ Γ_{t - 1} ∥^{2} + \frac{η^{2} L}{2 ϵ^{2}} {∥ Γ_{t} ∥}^{2} + \frac{6 L η_{l}^{2} H^{2} B^{2} ν^{2}}{ϵ^{2}} . \end{matrix}

(33)

Summing from

t = 1, ∆, T

and using the inequality in (24) of

\sum_{t = 1}^{T} {∥ m_{t} ∥}^{2}

, we have

\begin{matrix} \sum_{t = 1}^{T} T_{4} & \leq \frac{3 η^{2} L}{2 ϵ^{2}} \frac{β_{1}^{2}}{{(1 - β_{1})}^{2}} \frac{T η_{l}^{2} H δ_{l}^{2}}{n} + \frac{L (η^{2} + 3 ν^{2})}{2 ϵ^{2}} \sum_{t = 1}^{T} E {∥ Γ_{t} ∥}^{2} + \frac{6 T L η_{l}^{2} H^{2} B^{2} ν^{2}}{ϵ^{2}} \\ + \frac{3 η^{2} L}{2 ϵ^{2}} \frac{β_{1}^{2}}{{(1 - β_{1})}^{2}} \frac{η_{l}^{2}}{n^{2}} \sum_{t = 1}^{T} E {∥\sum_{i = 1}^{n} \sum_{h = 0}^{H - 1} \nabla f_{i} (x_{i, t}^{h})∥}^{2} . \end{matrix}

(34)

Plugging (30)–(34) into (27) and summing over t from 1 to T, we have

\begin{matrix} E f (y_{T + 1}) - E f (y_{1}) \\ \leq - \frac{η η_{l} H}{4} \sum_{t = 1}^{T} E {∥\frac{\nabla f (x_{t})}{\sqrt{\sqrt{{\hat{v}}_{t}} + ϵ}}∥}^{2} - \frac{η η_{l}}{2 H n^{2}} \sum_{t = 1}^{T} E {∥\frac{\sum_{i = 1}^{N} \sum_{h = 0}^{H - 1} \nabla f_{i} (x_{i, t}^{h})}{\sqrt{\sqrt{{\hat{v}}_{t}} + ϵ}}∥}^{2} \\ + \frac{η η_{l} β_{1} H B^{2}}{1 - β_{1}} \sum_{t = 1}^{T} {∥\frac{1}{\sqrt{{\hat{v}}_{t - 1}} + ϵ} - \frac{1}{\sqrt{{\hat{v}}_{t}} + ϵ}∥}_{1} + \frac{η^{2} η_{l}^{2} H^{2} B^{2} β_{1}^{2} L}{{(1 - β_{1})}^{2}} \sum_{t = 1}^{T} ∥ \frac{1}{\sqrt{{\hat{v}}_{t - 1}} + ϵ} - \frac{1}{\sqrt{{\hat{v}}_{t}} + ϵ} ∥^{2} \\ + \frac{(2 ϵ + 1) L η^{2} + 3 L ν^{2}}{2 ϵ^{2}} \sum_{t = 1}^{T} E {∥ Γ_{t} ∥}^{2} + \frac{3 η^{2} η_{l}^{2} β_{1}^{2} L H T δ_{l}^{2}}{2 n {(1 - β_{1})}^{2} ϵ^{2}} + \frac{5 η η_{l}^{3} L^{2} H^{2} T}{2 ϵ} (δ_{l}^{2} + 6 H δ_{g}^{2}) \\ + \frac{3 η^{2} L}{2 ϵ^{2}} \frac{β_{1}^{2}}{{(1 - β_{1})}^{2}} \frac{η_{l}^{2}}{n^{2}} \sum_{t = 1}^{T} E {∥\sum_{i = 1}^{n} \sum_{h = 0}^{H - 1} \nabla f_{i} (x_{i, t}^{h})∥}^{2} + \frac{6 T η_{l}^{2} L H^{2} B^{2} ν^{2}}{ϵ^{2}} . \end{matrix}

(35)

By using Lemma 2 and rearranging the terms, we have

\begin{matrix} E f (y_{T + 1}) - E f (y_{1}) \\ \leq - \frac{η η_{l} H}{4 (η_{l} H B + ϵ)} \sum_{t = 1}^{T} E {∥\nabla f (x_{t})∥}^{2} + \frac{η η_{l} β_{1} H B^{2} d}{(1 - β_{1}) ϵ} + \frac{3 η^{2} η_{l}^{2} β_{1}^{2} L H T δ_{l}^{2}}{2 n {(1 - β_{1})}^{2} ϵ^{2}} + \frac{6 T η_{l}^{2} L H^{2} B^{2} ν^{2}}{ϵ^{2}} \\ - (\frac{η η_{l}}{2 H n^{2} (H B + ϵ)} - \frac{3 η^{2} η_{l}^{2} β_{1}^{2} L}{2 n^{2} {(1 - β_{1})}^{2} ϵ^{2}} - \frac{3 L (η^{2} + ν^{2}) η_{l}^{2}}{2 n^{2} ϵ^{2}}) \sum_{t = 1}^{T} E {∥\sum_{i = 1}^{n} \sum_{h = 0}^{H - 1} \nabla f_{i} (x_{i, t}^{h})∥}^{2} \\ + \frac{5 η η_{l}^{3} L^{2} H^{2} T}{2 ϵ} (δ_{l}^{2} + 6 H δ_{g}^{2}) + \frac{η^{2} η_{l}^{2} H^{2} B^{2} β_{1}^{2} L d}{{(1 - β_{1})}^{2} ϵ^{2}} \\ \leq - \frac{η η_{l} H}{4 (η_{l} H B + ϵ)} \sum_{t = 1}^{T} E {∥\nabla f (x_{t})∥}^{2} + \frac{η η_{l} β_{1} H B^{2} d}{(1 - β_{1}) ϵ} + \frac{5 η η_{l}^{3} L^{2} H^{2} T}{2 ϵ} (δ_{l}^{2} + 6 H δ_{g}^{2}) \\ + \frac{3 η^{2} η_{l}^{2} β_{1}^{2} L H T δ_{l}^{2}}{2 n {(1 - β_{1})}^{2} ϵ^{2}} + \frac{η^{2} η_{l}^{2} H^{2} B^{2} β_{1}^{2} L d}{{(1 - β_{1})}^{2} ϵ^{2}} + \frac{6 T η_{l}^{2} L H^{2} B^{2} ν^{2}}{ϵ^{2}}, \end{matrix}

(36)

where the first inequality follows by

\frac{1}{η_{l} H B + ϵ} \geq \frac{1}{H B + ϵ}

and

ϵ \leq 1

, and the last one holds by the fact that

\frac{η η_{l}}{2 H n^{2} (H B + ϵ)} - \frac{3 η^{2} η_{l}^{2} β_{1}^{2} L}{2 n^{2} {(1 - β_{1})}^{2} ϵ^{2}} - \frac{3 L (η^{2} + ν^{2}) η_{l}^{2}}{2 n^{2} ϵ^{2}} \geq 0

; that is,

η_{l} \leq \frac{η ϵ^{2}}{3 H L (H B + ϵ) (η^{2} β_{1}^{2} / {(1 - β_{1})}^{2} + η^{2} + ν^{2})}

.

Dividing by T both sides of (36), we obtain

\begin{matrix} \frac{η η_{l} H}{4 (η_{l} H B + ϵ) T} \sum_{t = 1}^{T} E {∥\nabla f (x_{t})∥}^{2} & \leq \frac{f (y_{1}) - E f (y_{T + 1})}{T} + \frac{η η_{l} β_{1} H B^{2} d}{T (1 - β_{1}) ϵ} + \frac{η^{2} η_{l}^{2} H^{2} B^{2} β_{1}^{2} L d}{T {(1 - β_{1})}^{2} ϵ^{2}} \\ + \frac{3 η^{2} η_{l}^{2} β_{1}^{2} L H δ_{l}^{2}}{2 n {(1 - β_{1})}^{2} ϵ^{2}} + \frac{5 η η_{l}^{3} L^{2} H^{2}}{2 ϵ} (δ_{l}^{2} + 6 H δ_{g}^{2}) \\ + \frac{6 η_{l}^{2} L H^{2} B^{2} ν^{2}}{ϵ^{2}} . \end{matrix}

(37)

Therefore,

\begin{matrix} \frac{1}{T} \sum_{t = 1}^{T} E {∥\nabla f (x_{t})∥}^{2} & \leq 4 (η_{l} H B + ϵ) [\frac{f (y_{1}) - f^{*}}{η η_{l} H T} + \frac{β_{1} B^{2} d}{T (1 - β_{1}) ϵ} + \frac{3 η η_{l} β_{1}^{2} L δ_{l}^{2}}{2 n {(1 - β_{1})}^{2} ϵ^{2}} \\ + \frac{5 η_{l}^{2} L^{2} H}{2 ϵ} (σ_{l}^{2} + 6 H σ_{g}^{2}) + \frac{6 η_{l} L H B^{2} ν^{2}}{η ϵ^{2}}] . \end{matrix}

(38)

When using

C_{1} = \frac{β_{1} B^{2} d}{T (1 - β_{1}) ϵ} + \frac{η η_{l} H B^{2} β_{1}^{2} L d}{T {(1 - β_{1})}^{2} ϵ^{2}}

and

C_{2} = \frac{5 η_{l}^{2} L^{2} H}{2 ϵ} (δ_{l}^{2} + 6 H δ_{g}^{2}) + \frac{3 η η_{l} β_{1}^{2} L δ_{l}^{2}}{2 n {(1 - β_{1})}^{2} ϵ^{2}} + \frac{6 η_{l} L H B^{2} ν^{2}}{η ϵ^{2}}

, (38) can be expressed as follows

\begin{matrix} \frac{1}{T} \sum_{t = 1}^{T} E {∥\nabla f (x_{t})∥}^{2} \leq 4 (η_{l} H B + ϵ) [\frac{f (y_{1}) - f^{*}}{η η_{l} H T} + \frac{C_{1}}{T} + C_{2}] . \end{matrix}

(39)

The proof of Theorem 1 is finished. □

Next, we consider the case of partial client participation. Different from full client participation, the aggregated model difference

Γ_{t}

over the subset

S_{t}

is

\frac{1}{m} \sum_{i \in S_{t}} Γ_{i, t}

in round t. For convenience and the consistency of notations, we denote

{\bar{Γ}}_{t}

as the average of the aggregated global model difference hereafter, i.e.,

{\bar{Γ}}_{t} = \frac{1}{n} \sum_{i = 1}^{n} Γ_{i, t}

. In fact,

Γ_{t}

is an unbiased estimate of

{\bar{Γ}}_{t}

by Lemma 3.

Proof of Theorem 2.

According to Assumption 1 and taking expectation over the randomness in round t, we have

\begin{matrix} E f (y_{t + 1}) - E f (y_{t}) \\ \leq E 〈\nabla f (y_{t}), y_{t + 1} - y_{t}〉 + \frac{L}{2} E {∥y_{t + 1} - y_{t}∥}^{2} \\ = \underset{A_{1}^{'}}{\underset{︸}{E 〈\nabla f (x_{t}), \frac{η}{\sqrt{{\hat{v}}_{t}} + ϵ} ⊙ Γ_{t}〉}} - \underset{A_{2}^{'}}{\underset{︸}{\frac{η β_{1}}{1 - β_{1}} E 〈\nabla f (y_{t}), (\frac{m_{t - 1}}{\sqrt{{\hat{v}}_{t - 1}} + ϵ} - \frac{m_{t - 1}}{\sqrt{{\hat{v}}_{t}} + ϵ})〉}} \\ + \underset{A_{3}^{'}}{\underset{︸}{\frac{η^{2} L}{2} {∥\frac{Γ_{t}}{\sqrt{{\hat{v}}_{t}} + ϵ} - \frac{β_{1}}{1 - β_{1}} (\frac{m_{t - 1}}{\sqrt{{\hat{v}}_{t - 1}} + ϵ} - \frac{m_{t - 1}}{\sqrt{{\hat{v}}_{t}} + ϵ})∥}^{2}}} \\ + \underset{A_{4}^{'}}{\underset{︸}{E 〈\nabla f (y_{t}) - \nabla f (x_{t}), \frac{η}{\sqrt{{\hat{v}}_{t}} + ϵ} ⊙ Γ_{t}〉}} . \end{matrix}

(40)

Since

E_{S_{t}} [Γ_{t}] = {\bar{Γ}}_{t}

by Lemma 3, the term

A_{1}^{'}

can be estimated by the same upper bound as in (30). Similar to (31) and (32), we obtain the estimator for

A_{2}^{'}

and

A_{3}^{'}

, respectively.

\begin{matrix} \sum_{t = 1}^{T} A_{2}^{'} & \leq \frac{η η_{l} β_{1} H B^{2}}{1 - β_{1}} \sum_{t = 1}^{T} E ∥\frac{1}{\sqrt{{\hat{v}}_{t - 1}} + ϵ} - \frac{1}{\sqrt{{\hat{v}}_{t}} + ϵ}∥, \end{matrix}

(41)

and

\begin{matrix} \sum_{t = 1}^{T} A_{3}^{'} \leq \frac{η^{2} η_{l}^{2} H^{2} B^{2} β_{1}^{2} L}{{(1 - β_{1})}^{2}} \sum_{t = 1}^{T} ∥ \frac{1}{\sqrt{{\hat{v}}_{t - 1}} + ϵ} - \frac{1}{\sqrt{{\hat{v}}_{t}} + ϵ} ∥^{2} + \frac{η^{2} L}{ϵ^{2}} \sum_{t = 1}^{T} E {∥ Γ_{t} ∥}^{2}, \end{matrix}

(42)

and

\begin{matrix} \sum_{t = 1}^{T} A_{4}^{'} \leq \frac{3 η^{3} L}{2 ϵ^{2}} \frac{β_{1}^{2}}{{(1 - β_{1})}^{2}} \sum_{t = 1}^{T} E {∥m_{t}∥}^{2} + \frac{6 T L η_{l}^{2} η H^{2} B^{2} ν^{2}}{ϵ^{2}} + \frac{L η (1 + 3 ν^{2})}{2 ϵ^{2}} \sum_{t = 1}^{T} E {∥ Γ_{t} ∥}^{2} . \end{matrix}

(43)

Summing over t from 1 to T for (40) and plugging (30), (41)–(43), we have

\begin{matrix} E f (y_{T + 1}) - E f (y_{1}) \\ \leq \sum_{t = 1}^{T} (A_{1}^{'} + A_{2}^{'} + A_{3}^{'} + A_{4}^{'}) \\ \leq - \frac{η η_{l} H}{4} \sum_{t = 1}^{T} E {∥\frac{\nabla f (x_{t})}{\sqrt{\sqrt{{\hat{v}}_{t}} + ϵ}}∥}^{2} + \frac{5 η η_{l}^{3} L^{2} H^{2} T}{2 ϵ} (δ_{l}^{2} + 6 H δ_{g}^{2}) + \frac{η η_{l} β_{1} H B^{2} d}{(1 - β_{1}) ϵ} \\ - \frac{η η_{l}}{2 H n^{2}} \sum_{t = 1}^{T} E {∥\frac{\sum_{i = 1}^{n} \sum_{h = 0}^{H - 1} \nabla f_{i} (x_{i, t}^{h})}{\sqrt{\sqrt{{\hat{v}}_{t - 1}} + ϵ}}∥}^{2} + \frac{2 L η^{2} + (3 ν^{2} + 1) L η}{2 ϵ^{2}} \sum_{t = 1}^{T} E {∥ Γ_{t} ∥}^{2} \\ + \frac{η^{2} η_{l}^{2} H^{2} B^{2} β_{1}^{2} L d}{{(1 - β_{1})}^{2} ϵ^{2}} + \frac{3 η^{3} β_{1}^{2} L}{2 {(1 - β_{1})}^{2} ϵ^{2}} \sum_{t = 1}^{T} E {∥ m_{t} ∥}^{2} + \frac{6 T L η_{l}^{2} η H^{2} B^{2} ν^{2}}{ϵ^{2}}, \end{matrix}

(44)

where the inequality results from Lemma 2. Applying (19) and (25), we obtain

\begin{matrix} E f (y_{T + 1}) - E f (y_{1}) \\ \leq - \frac{η η_{l} H}{4 (η_{l} H B + ϵ)} \sum_{t = 1}^{T} E {∥\nabla f (x_{t})∥}^{2} + \frac{5 η η_{l}^{3} L^{2} H^{2} T}{2 ϵ} (δ_{l}^{2} + 6 H δ_{g}^{2}) + \frac{η η_{l} β_{1} H B^{2} d}{(1 - β_{1}) ϵ} \\ - [\frac{η η_{l}}{2 H n^{2} (H B + ϵ)} - \frac{C_{3} η_{l}^{2} η L (m - 1)}{m n (n - 1)}] \sum_{t = 1}^{T} E {∥\sum_{i = 1}^{n} \sum_{h = 0}^{H - 1} \nabla f_{i} (x_{i, t}^{h})∥}^{2} \\ + C_{3} η L T [\frac{η_{l}^{2} H δ_{l}^{2}}{m} + \frac{η_{l}^{2} (n - m)}{m (n - 1)} (15 H^{3} L^{2} η_{l}^{2} (δ_{l}^{2} + 6 H δ_{g}^{2}) + 3 H^{2} δ_{g}^{2})] \\ + \frac{C_{3} η_{l}^{2} η L (n - m)}{m (n - 1)} (90 H^{4} L^{2} η_{l}^{2} + 3 H^{2}) \sum_{t = 1}^{T} {∥ \nabla f (x_{t}) ∥}^{2} + \frac{η^{2} η_{l}^{2} β_{1}^{2} H^{2} B^{2} L d}{{(1 - β_{1})}^{2} ϵ^{2}} \\ + \frac{6 T L η_{l}^{2} η H^{2} B^{2} ν^{2}}{ϵ^{2}}, \end{matrix}

(45)

where

C_{3} = \frac{2 η + 3 ν^{2} + 1}{2 ϵ^{2}} + \frac{3 η^{2} β_{1}^{2}}{2 {(1 - β_{1})}^{2} ϵ^{2}}

. We rearrange the terms and let

η_{l}

satisfy the condition

\frac{η η_{l}}{2 H n^{2} (H B + ϵ)} - \frac{C_{3} η_{l}^{2} η L (m - 1)}{m n (n - 1)} \geq 0

, that is

η_{l} \leq \frac{m (n - 1) ϵ^{2}}{n H L (m - 1) (2 η^{2} + 3 ν^{2} + 1 + 3 η^{2} β_{1}^{2} / {(1 - β_{1})}^{2}) (H B + ϵ)}

. In addition,

η_{l}

also satisfies

\frac{η η_{l} H}{4 (η_{l} H B + ϵ)} - \frac{C_{3} η_{l}^{2} η L (n - m)}{m (n - 1)} (90 H^{4} L^{2} η_{l}^{2} + 3 H^{2}) \geq \frac{η η_{l} H}{8 (η_{l} H B + ϵ)}

. Thus, from (45), we have

\begin{matrix} \frac{η η_{l} H}{8 T (η_{l} H B + ϵ)} \sum_{t = 1}^{T} E {∥\nabla f (x_{t})∥}^{2} \\ \leq \frac{f (y_{1}) - E f (y_{T + 1})}{T} + \frac{5 η η_{l}^{3} L^{2} H^{2}}{2 ϵ} (δ_{l}^{2} + 6 H δ_{g}^{2}) + \frac{η η_{l} β_{1} H B^{2} d}{T (1 - β_{1}) ϵ} \\ + C_{3} η L [\frac{η_{l}^{2} H δ_{l}^{2}}{m} + \frac{η_{l}^{2} (n - m)}{m (n - 1)} (15 H^{3} L^{2} η_{l}^{2} (δ_{l}^{2} + 6 H δ_{g}^{2}) + 3 H^{2} δ_{g}^{2})] \\ + \frac{6 L η_{l}^{2} η H^{2} B^{2} ν^{2}}{ϵ^{2}} + \frac{η^{2} η_{l}^{2} β_{1}^{2} L H^{2} B^{2} L d}{T {(1 - β_{1})}^{2} ϵ^{2}} . \end{matrix}

(46)

Therefore,

\begin{matrix} \frac{1}{T} \sum_{t = 1}^{T} E {∥\nabla f (x_{t})∥}^{2} \leq 8 (η_{l} H B + ϵ) [\frac{f (y_{1}) - f^{*}}{η η_{l} H T} + \frac{C_{1}}{T} + C_{4}], \end{matrix}

(47)

where

C_{1} = \frac{β_{1} B^{2} d}{(1 - β_{1}) ϵ} + \frac{η η_{l} β_{1}^{2} L H B^{2} d}{{(1 - β_{1})}^{2} ϵ^{2}}

and

C_{4} = \frac{5 η_{l}^{2} L^{2} H}{2 ϵ} (δ_{l}^{2} + 6 H δ_{g}^{2}) + \frac{6 η_{l} L H B^{2} ν^{2}}{ϵ^{2}} + C_{3} L [\frac{η_{l} δ_{l}^{2}}{m} + \frac{η_{l} (n - m)}{m (n - 1)} (15 H^{2} L^{2} η_{l}^{2} (δ_{l}^{2} + 6 H δ_{g}^{2}) + 3 H δ_{g}^{2})]

.

The proof of Theorem 2 is finished. □

6. Experiments

In this section, the experiment for the image classification task is performed to evaluate the performance of the proposed algorithm. We first give the performance comparison between the proposed algorithm and other baseline algorithms. Then, we explore the effect of the number of local epochs and the number of clients on the algorithm performance.

We evaluate the performance of the proposed algorithm by conducting multi-class image classification tasks on the CIFAR-10 and CIFAR-100 datasets [38] with the convolutional neural network model RestNet-18 [39], respectively. The CIFAR-10 dataset consists of 60,000 32 × 32 pixels color pictures, which include 50,000 training images and 10,000 test images in a total of 10 categories with 6000 images in each category. The CIFAR-100 dataset has the same number of training samples and test samples with CIFAR-10, except it has 100 categories with 500 training images and 100 images test images per category. We first divide the training samples and test samples randomly and evenly among multiple clients. Based on the training set of each client, a local model is trained by an optimizer and sent to the centralized server. Then, the aggregated model is evaluated on a test set, which implies the generalization ability of the aggregated model. In the experiment, all the federated optimization algorithms including our proposed algorithm are implemented on a workstation equipped with two Intel(R) Xeon(R) Silver 4114 CPUs and two NVIDIA GeForce GTX 1080 Ti GPUs. The experiment environment is based on PyTorch 1.11.0. We consider a federated learning system with 100 clients in all and 0.1 as the ratio of client participation, which means 10 clients out of 100 are selected for training in each communication round. Specifically, each client randomly selected by the centralized server uses the same SGD optimizer and conducts 3 local epoch steps, and the local batch size is set to 20. The local learning rate and global learning rate are set to

0.01

and 1 by grid searching for the best performance. The parameter

ν

is searched from

{10^{- 2}, 10^{- 6}, 10^{- 8}, 10^{- 10}, 10^{- 12}}

for the best performance. In addition, momentum factors

β_{1}, β_{2}

are set to

0.9

and

0.99

by default, which follows the adaptive algorithms [16,18]. We compare the proposed algorithm with the classic federated average algorithm (FedAvg) [4] and federated adaptive-type algorithms [17] including FedAdam, FedYogi and FedAMS [18] as baselines. The experimental results are demonstrated in following figures.

Figure 1 and Figure 2 provide the convergence results for FedDA and other baselines on the CIFAR-10 dataset and CIFAR-100 dataset with the RestNet-18 model in terms of training loss and test accuracy. In order to show the difference between different curves clearly, we zoom in on the last few rounds as subgraphs and place them in the blank space of the figure. For the CIFAR-10 dataset, from the convergence curve of training loss against the communication rounds in Figure 1, we find that FedDA goes down faster with the increase of communication rounds, which suggests faster convergence than other baselines. Specifically, FedDA has similar performance with FedYogi and FedAMS in the middle of training, while it works slightly better in training loss and has the highest test accuracy among all the baseline algorithms in the end. These results empirically demonstrate the effectiveness of the proposed algorithm. The similar experiment results are obtained from Figure 2 for the CIFAR-100 dataset. As shown, the proposed FedDA converges slightly faster at the beginning and final stages of training, and it achieves a higher test accuracy, which suggests that FedDA behaves well.

As shown in Figure 3, the training loss and test accuracy of different client numbers are compared while other parameters remain the same. In this setting, we explore the impact of the client number n involved in training on the performance of FedDA by choosing different parameters n from

{5, 10, 20, 50}

. From the left figure of Figure 3, we observe that the convergence curve with

n = 50

has the fastest convergence rate. On the other hand, a smaller n requires more communication rounds to achieve the same target accuracy, as shown in the right figure of Figure 2.

In Figure 4, we investigate the impact of local epoch number H on the performance of FedDA by training the RestNet-18 model on the CIFAR-10 dataset while keeping the other parameters the same. We choose different H values from

{1, 3, 5, 10}

. From the learning curve of training loss against the communication round in Figure 4, we can find that the local training with a larger H can help accelerate the convergence. From the right figure of Figure 4, we observe that to achieve the same target accuracy, a larger local epoch H requires a smaller communication round.

7. Conclusions

In this paper, motivated by the idea of PID, we introduce the gradient difference to the stochastic optimization method and propose a differential adaptive method for federated learning, which aims to improve the performance while training the neural network by correcting the decent direction and adaptively updatingthe model parameters. We then analyze the convergence performance for our algorithm under full client participation and partial client participation cases in a stochastic non-convex optimization setting, where the convergence rate of

O (1 / \sqrt{T})

is achieved. Experiments are conducted on the CIFAR-10 dataset and CIFAR-100 dataset with the RestNet-18 model, respectively. The experiment results demonstrate that the proposed method FedDA and federated adaptive methods (FedYogi, FedAMS) have a better performance than FedAvg. FedDA surpasses the other baselines in terms of initial and final training loss, and it achieves the highest test accuracy in the end. Compared to other baselines, the proposed algorithm does not incur additional communication cost, while these performance gains are attained by increasing the computational cost of the difference terms. Thus, it is important to achieve a good trade-off between model performance and algorithm efficiency. In addition, the risk of information leakage exists when uploading model differences from clients to the centralized server. Hence, methods of privacy protection could be combined in the local training at the clients’ side. The balance between privacy levels and model accuracy can be considered as future research work.

Author Contributions

Conceptualization, H.G. and H.C.; methodology, H.G. and H.C.; software, X.Z. and Q.W.; validation, X.Z., J.Z. and M.Z.; formal analysis, H.G. and Q.W.; investigation, H.C. and M.Z.; resources, Q.W., X.Z., J.Z. and M.Z.; data curation, J.Z. and M.Z.; writing—original draft preparation, H.G. and Q.W.; writing—review and editing, Q.W. and J.Z.; visualization, X.Z.; supervision, J.Z. and M.Z.; project administration, J.Z. and M.Z.; funding acquisition, Q.W. and M.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grants No. 61976243 and No. 61971458, in part by the Leading talents of science and technology in the Central Plain of China under Grant No. 224200510004, in part by the Science & Technology Innovation Talents in the University of Henan Province of China under Grant No. 22HASTIT014, in part by the basic research projects in the University of Henan Province, China under Grants No. 23ZX003, and in part by the International Cooperation Project of Henan Province under No. 232102521005.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Portelli, K.; Anagnostopoulos, C. Leveraging Edge Computing through Collaborative Machine Learning. In Proceedings of the 5th International Conference on Future Internet of Things and Cloud Workshops, FiCloud Workshops, Prague, Czech Republic, 21–23 August 2017; IEEE Computer Society: New York, NY, USA, 2017; pp. 164–169. [Google Scholar]
Hu, Y.; Niu, D.; Yang, J.; Zhou, S. FDML: A Collaborative Machine Learning Framework for Distributed Features. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; ACM: New York, NY, USA, 2019; pp. 2232–2240. [Google Scholar]
Konečný, J.; McMahan, H.B.; Yu, F.X.; Richtárik, P.; Suresh, A.T.; Bacon, D. Federated Learning: Strategies for Improving Communication Efficiency. arXiv 2016, arXiv:1610.05492. [Google Scholar]
McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Lauderdale, FL, USA, 20–22 April 2017; Volume 54, pp. 1273–1282. [Google Scholar]
Pinto Neto, E.C.; Sadeghi, S.; Zhang, X.; Dadkhah, S. Federated Reinforcement Learning in IoT: Applications, Opportunities and Open Challenges. Appl. Sci. 2023, 13, 6497. [Google Scholar] [CrossRef]
Huang, X.; Han, L.; Li, D.; Xie, K.; Zhang, Y. A reliable and fair federated learning mechanism for mobile edge computing. Comput. Netw. 2023, 226, 109678. [Google Scholar] [CrossRef]
Salim, M.M.; Park, J.H. Federated Learning-Based Secure Electronic Health Record Sharing Scheme in Medical Informatics. IEEE J. Biomed. Health Inform. 2023, 27, 617–624. [Google Scholar] [CrossRef] [PubMed]
Kong, X.; Gao, H.; Shen, G.; Duan, G.; Das, S.K. FedVCP: A Federated-Learning-Based Cooperative Positioning Scheme for Social Internet of Vehicles. IEEE Trans. Comput. Soc. Syst. 2022, 9, 197–206. [Google Scholar] [CrossRef]
Stich, S.U. Local SGD Converges Fast and Communicates Little. In Proceedings of the 7th International Conference on Learning Representations, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Li, T.; Sahu, A.K.; Zaheer, M.; Sanjabi, M.; Talwalkar, A.; Smith, V. Federated Optimization in Heterogeneous Networks. In Proceedings of the Machine Learning and Systems, Austin, TX, USA, 2–4 March 2020. [Google Scholar]
Karimireddy, S.P.; Kale, S.; Mohri, M.; Reddi, S.J.; Stich, S.U.; Suresh, A.T. SCAFFOLD: Stochastic Controlled Averaging for Federated Learning. In Proceedings of the 37th International Conference on Machine Learning, Virtual Event, 13–18 July 2020; Volume 119, pp. 5132–5143. [Google Scholar]
Liu, W.; Chen, L.; Chen, Y.; Zhang, W. Accelerating Federated Learning via Momentum Gradient Descent. IEEE Trans. Parallel Distrib. Syst. 2020, 31, 1754–1766. [Google Scholar] [CrossRef] [Green Version]
Ozfatura, E.; Ozfatura, K.; Gündüz, D. FedADC: Accelerated Federated Learning with Drift Control. In Proceedings of the IEEE International Symposium on Information Theory, Melbourne, VA, Australia, 12–20 July 2021; pp. 467–472. [Google Scholar]
An, W.; Wang, H.; Sun, Q.; Xu, J.; Dai, Q.; Zhang, L. A PID Controller Approach for Stochastic Optimization of Deep Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8522–8531. [Google Scholar]
Shi, L.; Zhang, Y.; Wang, W.; Cheng, J.; Lu, H. Rethinking The Pid Optimizer For Stochastic Optimization Of Deep Networks. In Proceedings of the IEEE International Conference on Multimedia and Expo, London, UK, 6–10 July 2020; pp. 1–6. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Reddi, S.J.; Charles, Z.; Zaheer, M.; Garrett, Z.; Rush, K.; Konečný, J.; Kumar, S.; McMahan, H.B. Adaptive Federated Optimization. In Proceedings of the 9th International Conference on Learning Representations, Virtual Event, 3–7 May 2021. [Google Scholar]
Wang, Y.; Lin, L.; Chen, J. Communication-Efficient Adaptive Federated Learning. In Proceedings of the International Conference on Machine Learning, Baltimore, MA, USA, 17–23 July 2022; Volume 162, pp. 22802–22838. [Google Scholar]
Hamid, O.H. Data-Centric and Model-Centric AI: Twin Drivers of Compact and Robust Industry 4.0 Solutions. Appl. Sci. 2023, 13, 2753. [Google Scholar] [CrossRef]
Robbins, H.; Monro, S. A stochastic approximation method. Ann. Math. Stat. 1951, 22, 400–407. [Google Scholar] [CrossRef]
Polyak, B.T. Some methods of speeding up the convergence of iteration methods. Ussr Comput. Math. Math. Phys. 1964, 4, 1–17. [Google Scholar] [CrossRef]
Nesterov, Y. A method of solving a convex programming problem with convergence rate O (1/k²). Sov. Math. Dokl. 1983, 269, 372–376. [Google Scholar]
Sutskever, I.; Martens, J.; Dahl, G.E.; Hinton, G.E. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning, Atlanta, GA, USA, 17–19 June 2013; pp. 1139–1147. [Google Scholar]
Duchi, J.C.; Hazan, E.; Singer, Y. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. J. Mach. Learn. Res. 2011, 12, 2121–2159. [Google Scholar]
Reddi, S.J.; Kale, S.; Kumar, S. On the Convergence of Adam and Beyond. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, USA, 30 April–3 May 2018. [Google Scholar]
Xie, X.; Zhou, P.; Li, H.; Lin, Z.; Yan, S. Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models. arXiv 2022, arXiv:2208.06677. [Google Scholar]
Recht, B. A tour of reinforcement learning: The view from continuous control. Annu. Rev. Control. Robot. Auton. Syst. 2019, 2, 253–279. [Google Scholar] [CrossRef] [Green Version]
Weng, B.; Sun, J.; Sadeghi, A.; Wang, G. AdaPID: An Adaptive PID Optimizer for Training Deep Neural Networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Singapore, 23–27 May 2022; pp. 3943–3947. [Google Scholar]
Khaled, A.; Mishchenko, K.; Richtárik, P. Tighter Theory for Local SGD on Identical and Heterogeneous Data. In Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, Palermo, Italy, 26–28 August 2020; Volume 108, pp. 4519–4529. [Google Scholar]
Guo, Y.; Sun, Y.; Hu, R.; Gong, Y. Hybrid Local SGD for Federated Learning with Heterogeneous Communications. In Proceedings of the 10th International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
Qu, Z.; Lin, K.; Kalagnanam, J.; Li, Z.; Zhou, J.; Zhou, Z. Federated Learning’s Blessing: FedAvg has Linear Speedup. arXiv 2020, arXiv:2007.05690. [Google Scholar]
Das, R.; Acharya, A.; Hashemi, A.; Sanghavi, S.; Dhillon, I.S.; Topcu, U. Faster non-convex federated learning via global and local momentum. In Proceedings of the 38th Conference on Uncertainty in Artificial Intelligence, Eindhoven, The Netherlands, 1–5 August 2022; Volume 180, pp. 496–506. [Google Scholar]
Tong, Q.; Liang, G.; Bi, J. Calibrating the adaptive learning rate to improve convergence of ADAM. Neurocomputing 2022, 481, 333–356. [Google Scholar] [CrossRef] [PubMed]
Chen, X.; Li, X.; Li, P. Toward Communication Efficient Adaptive Gradient Method. In Proceedings of the FODS ’20: ACM-IMS Foundations of Data Science Conference, Virtual Event, 19–20 October 2020; pp. 119–128. [Google Scholar]
Jhunjhunwala, D.; Wang, S.; Joshi, G. FedExP: Speeding up Federated Averaging Via Extrapolation. arXiv 2023, arXiv:2301.09604. [Google Scholar]
Zhuang, J.; Tang, T.; Ding, Y.; Tatikonda, S.C.; Dvornek, N.C.; Papademetris, X.; Duncan, J.S. AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients. In Proceedings of the Advances in Neural Information Processing Systems, Virtual Event, 6–12 December 2020. [Google Scholar]
Yang, H.; Fang, M.; Liu, J. Achieving Linear Speedup with Partial Worker Participation in Non-IID Federated Learning. In Proceedings of the 9th International Conference on Learning Representations, Virtual Event, 3–7 May 2021. [Google Scholar]
Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features From Tiny Images; Technical Report; University of Toronto: Toronto, ON, USA, 2009. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]

Figure 1. Comparison for FedDA and other baselines on the CIFAR-10 dataset with the RestNet-18 model.

Figure 2. Comparison for FedDA and other baselines on the CIFAR-100 dataset with the RestNet-18 model.

Figure 3. Training loss and test accuracy with respect to client number on CIFAR-10 dataset.

Figure 4. Training loss and test accuracy with respect to different local epoch numbers on the CIFAR-10 dataset.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, H.; Wu, Q.; Cao, H.; Zhao, X.; Zhu, J.; Zhang, M. A Derivative-Incorporated Adaptive Gradient Method for Federated Learning. Mathematics 2023, 11, 3403. https://0-doi-org.brum.beds.ac.uk/10.3390/math11153403

AMA Style

Gao H, Wu Q, Cao H, Zhao X, Zhu J, Zhang M. A Derivative-Incorporated Adaptive Gradient Method for Federated Learning. Mathematics. 2023; 11(15):3403. https://0-doi-org.brum.beds.ac.uk/10.3390/math11153403

Chicago/Turabian Style

Gao, Huimin, Qingtao Wu, Hongyan Cao, Xuhui Zhao, Junlong Zhu, and Mingchuan Zhang. 2023. "A Derivative-Incorporated Adaptive Gradient Method for Federated Learning" Mathematics 11, no. 15: 3403. https://0-doi-org.brum.beds.ac.uk/10.3390/math11153403

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Derivative-Incorporated Adaptive Gradient Method for Federated Learning

Abstract

1. Introduction

2. Related Work

3. Problem Formulation and Algorithm Design

4. Main Theoretical Results

5. Performance Analysis

6. Experiments

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI