Wireless Network Optimization for Federated Learning with Model Compression in Hybrid VLC/RF Systems

Huang, Wuwei; Yang, Yang; Chen, Mingzhe; Liu, Chuanhong; Feng, Chunyan; Poor, H. Vincent

doi:10.3390/e23111413

Open AccessArticle

Wireless Network Optimization for Federated Learning with Model Compression in Hybrid VLC/RF Systems^†

¹

Beijing Key Laboratory of Network System Architecture and Convergence, School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing 100876, China

²

Department of Electrical and Computer Engineering, Princeton University, Princeton, NJ 08544, USA

^*

Author to whom correspondence should be addressed.

^†

This paper is an extended version of our paper published in the Proceedings of the IEEE Wireless Communications and Networking Conference, Nanjing, China, 29 March–1 April.

Entropy 2021, 23(11), 1413; https://0-doi-org.brum.beds.ac.uk/10.3390/e23111413

Submission received: 27 August 2021 / Revised: 16 October 2021 / Accepted: 17 October 2021 / Published: 27 October 2021

(This article belongs to the Special Issue Machine Learning for Communications)

Download

Browse Figures

Versions Notes

Abstract

:

In this paper, the optimization of network performance to support the deployment of federated learning (FL) is investigated. In particular, in the considered model, each user owns a machine learning (ML) model by training through its own dataset, and then transmits its ML parameters to a base station (BS) which aggregates the ML parameters to obtain a global ML model and transmits it to each user. Due to limited radio frequency (RF) resources, the number of users that participate in FL is restricted. Meanwhile, each user uploading and downloading the FL parameters may increase communication costs thus reducing the number of participating users. To this end, we propose to introduce visible light communication (VLC) as a supplement to RF and use compression methods to reduce the resources needed to transmit FL parameters over wireless links so as to further improve the communication efficiency and simultaneously optimize wireless network through user selection and resource allocation. This user selection and bandwidth allocation problem is formulated as an optimization problem whose goal is to minimize the training loss of FL. We first use a model compression method to reduce the size of FL model parameters that are transmitted over wireless links. Then, the optimization problem is separated into two subproblems. The first subproblem is a user selection problem with a given bandwidth allocation, which is solved by a traversal algorithm. The second subproblem is a bandwidth allocation problem with a given user selection, which is solved by a numerical method. The ultimate user selection and bandwidth allocation are obtained by iteratively compressing the model and solving these two subproblems. Simulation results show that the proposed FL algorithm can improve the accuracy of object recognition by up to 16.7% and improve the number of selected users by up to 68.7%, compared to a conventional FL algorithm using only RF.

Keywords:

federated learning; model compression; visible light communication

1. Introduction

Federated learning (FL), which allows edge devices to cooperatively train a shared machine learning model without transmitting private data, is an emerging distributed machine learning technique [1,2]. The FL training process needs to iteratively transmit machine learning parameters over wireless links. However, due to dynamic wireless channels and imperfect wireless transmission, the performance of FL will be significantly affected by wireless communication. In addition, due to limited communication resources, the number of users that can participate in FL is limited.

1.1. Related Work

A number of prior studies in [3,4,5,6,7,8,9,10,11,12,13,14,15,16] have investigated important problems related to wireless network optimization of FL. The works in [3,4,5,6] provided a comprehensive survey of existing studies and summarized open problems in FL. One key challenge is the contradiction between the huge communication costs required by FL parameter transmission and the limited available communication resources [5]. Therefore, on one hand, the existing studies in [7,8,9,10,11,12] proposed to compress the FL model parameters to reduce the communication cost. In particular, the authors of [7] proposed a sparsification and quantization method that compresses the trained FL model. In addition, they also proposed a low rank method and a random mask method, which directly learns a model from a restricted space. In [8], the authors combined quantization and sparsification to implement sketched updates with low sparsity rate. The work in [9] proposed a sparse ternary compression (STC) method that incorporates gradient sparsification, ternary quantization, and lossless encoding, which further improves the compression gains. The authors of [10] proposed to introduce the STC in [9] into structured updates, and thus they focused on compression during the training phase, instead of compressing the trained FL model. Overall, as demonstrated in [7,10], FL model compression can significantly reduce communication costs with minor impact on training accuracy.

On the other hand, the studies in [13,14,15,16] proposed to optimize resource allocation to improve communication efficiency in FL. In [13], the authors studied a joint learning, wireless resource allocation, and user selection optimization problem to improve FL performance. In [14], the trade-off of time and energy consumption, and the trade-off of computational and communication delay have been studied. Meanwhile, the authors of [15] optimized a joint computation and transmission problem, whose goal is to minimize the total energy consumption with communication constraints such as limited computational resources and transmission energy. In addition, the authors of [16] formulated an optimization problem of resource allocation and introduced the use of artificial neural networks (ANNs) to predict the unselected users’ FL model parameters to improve the FL convergence speed and training loss.

Based on the above research, we can observe that the existing studies focus on improving the communication efficiency either by reducing the transmission data amount or optimizing wireless resource allocation to improve the data rate of each user. As a complement of RF, visible light communication (VLC) has advantages including having large, license-free bandwidth, high energy efficiency, and being free of interference to the RF systems. Introducing VLC into FL can significantly supplement the communication resources for FL training. In addition, model compression can be introduced to further reduce the resources needed to transmit FL model parameters and increase the number of users participating in FL training.

1.2. Contribution

The main contribution of this paper is a novel hybrid VLC/RF FL algorithm, that jointly optimizes user selection, bandwidth allocation, and model compression (USBA-MC). To the best of our knowledge, this is the first work that introduces the use of VLC techniques for FL performance optimization. The contributions are summarized as follows.

We propose a USBA-MC algorithm over a hybrid VLC/RF system. In the USBA-MC algorithm, each user obtains a local FL model by training through its own dataset and transmits the model parameters to a base station (BS). The BS aggregates the received local models to generate a global FL model and transmits it back to each user. For the considered FL model, the performance is significantly affected by wireless factors such as available bandwidth and users’ channel state information. This formulates a joint user selection and bandwidth allocation problem, whose goal is to minimize the FL training loss.
To solve this problem, we first introduce a model compression method to reduce the size of FL model parameters that are transmitted over wireless links. To this end, we first sort the model parameters and design a threshold selection mechanism according to the sparsity rate. Then, we cut off the redundant model parameters based on the threshold and, thus, compress an FL model of each user.
Following the model compression, we separate the joint user selection and bandwidth allocation problem into two subproblems. The first subproblem is a user selection problem with a given bandwidth allocation, which is solved by a traversal algorithm. The second subproblem is a bandwidth allocation problem with a given user selection, which is solved by a numerical method. The ultimate user selection and bandwidth allocation are obtained by iteratively compressing the model and solving these two subproblems.

Simulation results show that the proposed FL algorithm can improve the accuracy of object recognition by up to 16.7% and improve the number of selected users by up to 68.7%, compared to a conventional FL algorithm using only RF.

The remainder of this paper is organized as follows. In Section 2, we introduce the hybrid VLC/RF system model. Section 3 introduces a model compression method. The joint user selection, bandwidth allocation, and model compression algorithm is described in Section 4. Simulation results are presented and discussed in Section 5. Finally, Section 6 draws some important conclusions.

2. System Model and Problem Formulation

In this section, we first introduce a hybrid VLC/RF system for FL. Then, we introduce the computational model and the communication models of RF and VLC systems. Finally, based on the established model, we introduce a user selection and bandwidth allocation problem.

2.1. FL Model

In this model, each user n stores a local dataset

D_{n}

with

D_{n}

being the number of training data samples. Therefore, the total number of training data samples of all users is

D = \sum_{n = 1}^{N} D_{n}

. We assume that the training data samples of user n can be expressed by

{x_{n}, y_{n}}

with

x_{n} = [x_{n 1}, \dots, x_{n D_{n}}]

and

y_{n} = [y_{n 1}, \dots, y_{n D_{n}}]

, where each

x_{n i}

is an input vector of the FL algorithm and

y_{n i}

is the output of

x_{n i}

.

For each user, the FL training purpose is to find the model parameter

ω

that minimizes the loss function:

\begin{matrix} J_{n} (ω) : = \frac{1}{D_{n}} \sum_{i \in D_{n}} f_{i} (ω), \end{matrix}

(1)

where

f_{i} (ω)

is a loss function that captures the performance of the FL algorithm. For example, for a linear regression FL, the loss function is

f_{i} (ω) = \frac{1}{2} {({x_{i}}^{T} ω - y_{i})}^{2}

[14].

All users aim to minimize the following global loss function:

\begin{matrix} min_{ω \in R^{d}} J (ω) : = \sum_{n = 1}^{N} \frac{D_{n}}{D} J_{n} (ω) . \end{matrix}

(2)

To solve (2), the BS will first transmit the global FL model parameters to its users, and users will use the received global FL model parameters to train their local FL models. Then, the users will transmit their local FL model parameters to the BS to update the global FL model. For strongly convex objective

J (ω)

, the maximum number of global iterations that an FL algorithm needs to converge is [17]

\begin{matrix} K (ε, θ) = \frac{o (log (\frac{1}{ε}))}{1 - θ}, \end{matrix}

(3)

where

ε

is the accuracy of global model and

θ

is the accuracy of local model. We consider a fixed global accuracy

ε

.

2.2. FL Based on Hybrid VLC/RF System

Due to the limited wireless bandwidth, only a subset of users can be selected for FL training, which can seriously degrade the training accuracy. To enable more users to join the FL training process, we design a hybrid VLC/RF system. The system structure is shown in Figure 1.

The considered system consists of one BS, home gateways, and users cooperatively performing an FL algorithm for data analysis and inference. Denote the total users by a set

N

of N users. Denote the indoor users by a set

N_{1}

of

N_{1}

users and the outdoor users by a set

N_{2}

of

N_{2}

users. In this model, the BS will send the global FL model parameters to outdoor users by RF. Meanwhile, the BS transmits the global model parameters to the home gateways which are connected to the indoor VLC access points (APs). Then, the VLC APs transmit the global FL model parameters to indoor users through the visible light signal. Assuming that the BS and home gateways are connected by fiber on which bit errors can be negligible.

In indoor scenarios, each VLC AP consists of an LED lamp. Each user is served by the AP that provides the strongest signal. In addition, we assume that all indoor users can be covered by visible lights. We also assume that there is a central unit (CU) which controls both VLC and RF systems. Note that there is no interference between the RF and VLC systems, which is a key benefit of introducing VLC for the deployment of FL over wireless networks.

2.3. Computational Model

Let

c_{n}

be the number of CPU cycles for user n to process one sample of data. As the data size of each training data sample is equal, the number of CPU cycles required for user n to execute one local iteration is

c_{n} D_{n}

. Denote the CPU-cycle frequency of user n by

f_{n}

. Then, the energy consumption of user n updating its local FL model in one global iteration can be expressed as follows:

\begin{matrix} E_{n}^{P} = \frac{ν α_{n} c_{n} D_{n}}{2} {f_{n}}^{2} log (1 / θ), \end{matrix}

(4)

where

n = 1, 2, . . ., N

,

\frac{α_{n}}{2}

is the effective capacitance coefficient of the computing chipset of user n, and

ν

is a positive constant that depends on the data size of training data sample and the number of conditions in the local problem [14].

Furthermore, the computational time per local iteration of user n can be denoted as

\frac{c_{n} D_{n}}{f_{n}}, n = 1, 2, . . ., N

. The computational time, however, depends on the number of local iterations, which is upper bounded by

o (log (1 / θ))

. Therefore, the required computational time of user n for data processing is

\begin{matrix} t_{n}^{P} = \frac{ν c_{n} D_{n} log (1 / θ)}{f_{n}} . \end{matrix}

(5)

2.4. RF Transmission Model

We use the orthogonal frequency division multiple access (OFDMA) technique for both uplink and downlink RF transmissions. The uplink rate of user n is given by

\begin{matrix} r_{n}^{U} = \sum_{i = 1}^{R^{U}} r_{n, i}^{U} B^{U} {log}_{2} (1 + \frac{P_{n} h_{n}}{\sum_{i^{^{'}} \in U_{n}^{^{'}}} P_{i^{^{'}}} h_{i^{^{'}}} + B^{U} N_{0}^{R} F}), \end{matrix}

(6)

where

r_{n}^{U} = [r_{n, 1}^{U}, . . ., r_{n, R^{U}}^{U}]

is a resource block allocation vector and

R^{U}

is the total number of RBs that the BS can allocate to the users.

r_{n, i}^{U} \in {0, 1}

and

\sum_{i = 1}^{R^{U}} r_{n, i}^{U} = 1

;

r_{n, i}^{U} = 1

implies that RB i is allocated to user n; otherwise, we have

r_{n, i}^{U} = 0

;

U_{n}^{^{'}}

represents the set of users that are located at the other service areas and transmit data over RB i;

B^{U}

is the bandwidth of each RB and

P_{n}

is the transmit power of user n;

h_{n}

is the channel gain between user n and the BS;

N_{0}^{R} F

is the noise power spectral density;

\sum_{i^{^{'}} \in U_{n}^{^{'}}} P_{i^{^{'}}} h_{i^{^{'}}}

is the interference caused by the users that are located in other service areas and use the same RB.

On the other hand, the downlink data rate of the BS transmitting global FL model parameters to each user n is given by

\begin{matrix} r_{n}^{D} = \sum_{i = 1}^{R^{D}} r_{n, i}^{D} B^{D} {log}_{2} (1 + \frac{P_{B} h_{n}}{\sum_{j \in B^{^{'}}} P_{B} h_{n j} + B^{D} N_{0}^{R} F}), \end{matrix}

(7)

where

B^{D}

is the bandwidth of each RB that the BS used to transmit the global FL model to each user n;

r_{n}^{D} = [r_{n, 1}^{D}, . . ., r_{n, R^{D}}^{D}]

is a RB allocation vector with

R^{D}

being the total number of RBs that the BS can be used for FL parameter transmission.

r_{n, i}^{D} \in {0, 1}

and

\sum_{i = 1}^{R^{D}} r_{n, i}^{D} = 1

;

r_{n, i}^{D} = 1

indicates that RB i is allocated to user n; otherwise, we have

r_{n, i}^{D} = 0

;

P_{B}

is the transmit power of the BS;

B^{^{'}}

is the set of other BSs that cause interference to the BS that performs the FL algorithm;

h_{n j}

is the channel gain between user n and BS j. Let

B_{R}

be the total RF bandwidth, and we have

R^{U} \times B^{U} + R^{D} \times B^{D} \leq B_{R}

. For simplicity, we assume

B^{U} = B^{D}

which means the bandwidth of an uplink resource block is equal to that of a downlink RB.

Denote the data size in bit of an FL model that each user needs to upload by

s^{L}

. To upload the local FL model within transmission delay requirement

t_{n}^{U}

, we have

t_{n}^{U} r_{n}^{U} \geq s^{L}

. Meanwhile, the required energy of user n transmitting FL parameters is

E_{n}^{M} = t_{n}^{U} P_{n}

. Similarly, we assume that the data size in bit of the global parameters which are transmitted to users is

s^{G}

. To download the global FL model within transmission delay

t_{n}^{D}

, we have

t_{n}^{D} r_{n}^{D} \geq s^{G}

.

2.5. VLC Transmission Model

The optical channel gain of a line-of-sight (LoS) channel can be expressed as [18]

\begin{matrix} u = \{\begin{matrix} \frac{(m + 1) A_{p}}{2 π d^{2}} T_{s} (θ) g (θ) {cos}^{m} (φ) cos (θ), 0 < θ \leq Θ_{F}, \\ 0, θ > Θ_{F}, \end{matrix} \end{matrix}

(8)

where

m = - \frac{1}{{log}_{2} (cos (θ_{1 / 2}))}

is the Lambertian index which is a function of the half-intensity radiation angle

θ_{1 / 2}

,

A_{p}

is the receiver’s physical area of the photo-diode, d is the distance from the VLC AP to the optical receiver,

φ

is the angle of irradiation and

θ

is the angle of incidence,

Θ_{F}

is the half angle of the receiver’s file of view (FoV),

T_{s} (θ)

is the gain of the optical filter, and the concentrator gain

g (θ)

can be written as

\begin{matrix} g (θ) = \{\begin{matrix} \frac{n_{0}^{2}}{{sin}^{2} Θ_{F}}, 0 < θ \leq Θ_{F}, \\ 0, θ > Θ_{F}, \end{matrix} \end{matrix}

(9)

where

n_{0}

is the refractive index. For a given user n connected to a VLC AP k, the signal-to-interference-plus-noise ratio (SINR) can be written as

\begin{matrix} s_{n k} = \frac{{(γ u_{n k} P_{v})}^{2}}{N_{0}^{V} L C B + \sum_{l \neq k} {(γ u_{n l} P_{v})}^{2}}, \end{matrix}

(10)

where

γ

is the optical to electric conversion efficiency,

P_{v}

is the transmitted optical power of a VLC AP,

N_{0}^{V L C}

is the noise power spectral density,

u_{n k}

is the channel gain between user n and the VLC AP k,

u_{n l}

is the channel gain between user n and the interfering VLC AP l, and B is the bandwidth of each VLC RB. Each user is served by a single VLC AP which has the largest SINR for the user. In the VLC systems, optical OFDMA is employed. It is known that the input signal of the LEDs is amplitude constrained. Therefore, the classical Shannon capacity formula for complex and average power constrained signal is not applicable in VLC. Therefore, the lower bound of achievable data rate is used, which can be expressed as [19]

\begin{matrix} r_{n} = \sum_{i = 1}^{R^{V}} r_{n, i}^{V} \frac{B}{2} {log}_{2} (1 + \frac{2}{π e} s_{n}), \end{matrix}

(11)

where

s_{n}

is the largest SINR which is evaluated as

s_{n} = max {s_{n 1}, . . ., s_{n K}}

, where K is the total number of VLC APs;

r_{n}^{V} = [r_{n, 1}^{V}, . . ., r_{n, R^{V}}^{V}]

is a RB allocation vector with

R^{V}

being the total number of VLC RBs,

r_{n, i}^{V} \in {0, 1}

and

\sum_{i = 1}^{R^{V}} r_{n, i}^{V} = 1

;

r_{n, i}^{V} = 1

indicates that RB i is allocated to user n; otherwise, we have

r_{n, i}^{V} = 0

. Similarly, we have:

R^{V} \times B \leq B_{V}

, where

B_{V}

is the total bandwidth of VLC.

As the data size of global parameters is

s^{G}

, the downlink transmission delay of indoor user n in each global iteration will be

t_{d n} = \frac{s^{G}}{r_{n}}

.

2.6. Problem Formulation

Next, we introduce the optimization problem. Our goal is to minimize the global loss function under time, energy, and bandwidth allocation constraints. The minimization problem is given by

\begin{matrix} min_{B, B^{D}, B^{U}, S} J (ω) \end{matrix}

(12)

\begin{matrix} s . t . & R^{U} \times B^{U} + R^{D} \times B^{D} \leq B_{R}, \end{matrix}

(12a)

\begin{matrix} R^{V} \times B \leq B_{V}, \end{matrix}

(12b)

\begin{matrix} t_{d n} + t_{n}^{U} + t_{n}^{P} + t_{d} \leq T_{r o u n d,} \forall n \in S_{1}, \end{matrix}

(12c)

\begin{matrix} t_{n}^{D} + t_{n}^{U} + t_{n}^{P} \leq T_{r o u n d,} \forall n \in S_{2}, \end{matrix}

(12d)

\begin{matrix} S_{1} \cup S_{2} = S, \end{matrix}

(12e)

\begin{matrix} E_{n}^{M} + E_{n}^{P} \leq γ_{n E}, \forall n \in N, \end{matrix}

(12f)

\begin{matrix} R^{U} = |S|, R^{D} = |S_{2}|, R^{V} = |S_{1}|, \end{matrix}

(12g)

where

S

denotes the set of selected users participating in FL,

S_{1}

denotes the set of selected indoor users,

S_{2}

denotes the set of selected outdoor users, and

|.|

denotes the cardinality of a set. In addition,

T_{r o u n d}

is the time threshold for each round and

t_{d}

denotes the delay between BS and the home gateway. In addition,

γ_{n E}

is the energy constraint of user n. (12a) and (12b) are the bandwidth constraints of RF link and VLC link, respectively. Constraint (12c) is the delay constraint of each round for all selected indoor users while (12d) is the delay constraint of each round for all selected outdoor users. (12e) denotes the set of selected users. In addition, (12f) is the energy consumption requirement of performing an FL algorithm.

3. Model Compression

In this section, we first analyze the optimization problem (12) so as to figure out how the communication factors affect the FL performance. Then, we introduce a model compression method to reduce the size of FL model parameters that are transmitted over wireless links so as to increase the number of users that participate in FL.

3.1. Problem Analysis

To simplify problem (12), we first provide the following lemma:

Lemma 1.

Given the transmit power of each user, the optimization problem (12) can be transformed into an optimization problem aiming to maximize the total size of data samples of the selected users, which can be denoted as

\begin{matrix} max_{B, B^{D}, B^{U}, S} \sum_{n = 1}^{N_{1}} \sum_{i = 1}^{R^{V}} D_{n} r_{n, i}^{V} + \sum_{n = N_{1} + 1}^{N} \sum_{i = 1}^{R^{D}} D_{n} r_{n, i}^{D} \end{matrix}

(13)

\begin{matrix} s . t . & R^{U} \times B^{U} + R^{D} \times B^{D} \leq B_{R}, \end{matrix}

(13a)

\begin{matrix} R^{V} \times B \leq B_{V}, \end{matrix}

(13b)

\begin{matrix} t_{d n} + t_{n}^{U} + t_{n}^{P} + t_{d} \leq T_{r o u n d,} \forall n \in S_{1}, \end{matrix}

(13c)

\begin{matrix} t_{n}^{D} + t_{n}^{U} + t_{n}^{P} \leq T_{r o u n d,} \forall n \in S_{2}, \end{matrix}

(13d)

\begin{matrix} S_{1} \cup S_{2} = S, \end{matrix}

(13e)

\begin{matrix} E_{n}^{M} + E_{n}^{P} \leq γ_{n E}, \forall n \in N, \end{matrix}

(13f)

\begin{matrix} R^{U} = |S|, R^{D} = |S_{2}|, R^{V} = |S_{1}|, \end{matrix}

(13g)

Proof.

Minimizing the global loss function is equivalent to minimizing the gap between the global loss function

J (ω_{t})

at time t and the optimal global loss function

J (ω^{*})

. According to the Theorem 1 in [13], the gap is caused by the packet error rate (PER) and the number of selected users. Here, we do not consider the packet errors and hence, we have

q_{i} = 0

. Using the same simplification method in [13], the optimization problem can be transformed to problem (13). This ends the proof. □

3.2. Model Weights Compression

From problem (13), we observe that as the number of users that implement FL increases, the gap decreases, and the performance is improved. This is coincide with the experimental conclusions in [20]. To maximize the number of users in FL, we introduce a model compression method to reduce the transmission delay, energy, and bandwidth, so as to increase the number of users that participate in FL. In particular, the FL model has data redundancy during training and, thus, we prune the connections with small weight updates to reduce the size of transmission model parameters. Meanwhile, although the model compression will lose a part of model information, the experiments in [9,10,21] have proved that appropriate compression methods do not significantly affect the convergence speed and accuracy under proper sparsity rates. In this section, we first introduce a compression method with non-fixed thresholds. Then, we analyze the impact of the model compression on the optimization problem (13).

An FL model needs to be carefully compressed without affecting the global model training. The change of weights in a model can be used to evaluate their importance [22]. Therefore, an appropriate pruning threshold is the key for FL model compression. To ensure that the gradients of an FL model are in the same order, we first normalize the gradients in each layer [23]. In particular, the gradients of an FL model can be given by

G_{n}^{τ} = T r a i n (W_{n}^{τ}, D_{n}) - W_{n}^{τ},

(14)

where

G_{n}^{τ} \in R^{d_{1} \times d_{2}}

is the gradients of user n at iteration

τ

,

W_{n}^{τ} \in R^{d_{1} \times d_{2}}

is the trained local model weights, and

τ \in {1, \dots, T}

is a global iteration;

d_{1}

and

d_{2}

represent the output and input dimensions, respectively; and

T r a i n (W_{n}^{τ}, D_{n})

refers to the trained model weights of user n. For a given sparsity rate, we obtain a threshold according to the sorted gradients. In particular, the weights less than the threshold are set to 0, while those larger than the threshold are set to 1. This process can be expressed by a sparsifying filter mask

M_{n} \in R^{d_{1} \times d_{2}}

for user n. Therefore, the compressed local model weights can be written as

W_{n, C} = W_{n} \otimes M_{n},

(15)

where

W_{n} \in R^{d_{1} \times d_{2}}

is the local model weights of user n and ⊗ is the Hadamard product. Similarly, the compressed global model weights can be expressed as

W_{C} = W \otimes M,

(16)

where

W \in R^{d_{1} \times d_{2}}

is the global model weights and

M \in R^{d_{1} \times d_{2}}

is the sparsifying filter mask for the BS. From (16), we observe that each user receives the same sparse global model.

The gradients that are not transmitted to the BS or the users are called residuals [24], which will be used for the local model training and the global FL model generation. Therefore, residuals can be used to mitigate the errors caused by the sparsification and accelerate the FL convergence speed [21]. In particular, the residuals of user n can be defined by

R_{n}^{T} = \sum_{τ = 1}^{T} (G_{n}^{τ} - G_{n, C}^{τ}) = R_{n}^{T - 1} + G_{n}^{T} - G_{n, C}^{T},

(17)

where

R_{n}^{T} \in R^{d_{1} \times d_{2}}

is the accumulation of the residuals at iteration T, and

G_{n, C}^{τ} \in R^{d_{1} \times d_{2}}

denotes compressed

G_{n}^{τ}

. Similarly, residuals of the BS can be defined by

R^{T} = \sum_{τ = 1}^{T} (G^{τ} - G_{C}^{τ}) = R^{T - 1} + G^{T} - G_{C}^{T},

(18)

where

G^{τ} \in R^{d_{1} \times d_{2}}

is the model gradients at the BS, and

G_{C}^{τ} \in R^{d_{1} \times d_{2}}

is the compressed

G^{τ}

.

During transmissions, users only need to transmit the positions of non-zero parameters and their values. Through receiving these information, the BS can recover the model, and get the sparsifying filter masks. We assume the initial model weights is

W^{I} \in R^{d_{1} \times d_{2}}

, the final output is

W^{F} \in R^{d_{1} \times d_{2}}

and the matrix with all elements of 1 is

1

. The overall process of model compression is shown in Algorithm 1.

Let

p_{n}

and p be the sparsity rate corresponding to

M_{n}

and

M

. Then, the size of the compressed local FL model and the compressed global FL model can be expressed as

s_{C}^{L} = s^{L} \cdot p_{n}

and

s_{C}^{G} = s^{G} \cdot p

, respectively. Using the proposed compression scheme, the optimization problem (13) can be rewritten by

\begin{matrix} max_{B, B^{D}, B^{U}, S} \sum_{n = 1}^{N_{1}} \sum_{i = 1}^{R^{V}} D_{n} r_{n, i}^{V} + \sum_{n = N_{1} + 1}^{N} \sum_{i = 1}^{R^{D}} D_{n} r_{n, i}^{D} \end{matrix}

(19)

\begin{matrix} s . t . & R^{U} \times B^{U} + R^{D} \times B^{D} \leq B_{R}, \end{matrix}

(19a)

\begin{matrix} R^{V} \times B \leq B_{V}, \end{matrix}

(19b)

\begin{matrix} t_{d n, C} + t_{n, C}^{U} + t_{n}^{P} + t_{d, C} \leq T_{r o u n d,} \forall n \in S_{1}, \end{matrix}

(19c)

\begin{matrix} t_{n, C}^{D} + t_{n, C}^{U} + t_{n}^{P} \leq T_{r o u n d,} \forall n \in S_{2}, \end{matrix}

(19d)

\begin{matrix} S_{1} \cup S_{2} = S, \end{matrix}

(19e)

\begin{matrix} E_{n, C}^{M} + E_{n}^{P} \leq γ_{n E}, \forall n \in N, \end{matrix}

(19f)

\begin{matrix} R^{U} = |S|, R^{D} = |S_{2}|, R^{V} = |S_{1}|, \end{matrix}

(19g)

where

t_{d n, C} = \frac{s_{C}^{G}}{r_{n}}

,

t_{n, C}^{U} = \frac{s_{C}^{L}}{r_{n}^{U}}

,

t_{n, C}^{D} = \frac{s_{C}^{G}}{r_{n}^{D}}

and

E_{n, C}^{M} = t_{n, C}^{U} \cdot P_{n}

. Accordingly, we denote the compression algorithm that reduces the communication costs in each iteration by MC (

s^{L}, s^{G}

).

Algorithm 1: FL Model Compression

1:: Input: $W^{I}$
2:: for $τ \in {1, \dots, T}$ do
3:: for $n \in {1, \dots, N}$ do
4:: Client n does:
5:: $(W_{C}^{t - 1}, M) \leftarrow D o w n l o a d_{B S \to n} (W_{C}^{t - 1})$
6:: $W_{n}^{t} \leftarrow W_{C}^{t - 1} + (1 - M) \otimes W_{n}^{t - 1}$
7:: $G_{n}^{t} \leftarrow T r a i n (W_{n}^{t}, D_{n}) + R_{n}^{t - 1} - W_{n}^{t}$
8:: $M_{n} \leftarrow C o m p r e s s (G_{n}^{t})$
9:: $G_{n, C}^{t} \leftarrow G_{n}^{t} \otimes M_{n}$
10:: $W_{n, C}^{t} \leftarrow W_{n}^{t} \otimes M_{n}$
11:: $R_{n}^{t} \leftarrow G_{n}^{t} - G_{n, C}^{t}$
12:: $S a v e (R_{n}^{t}, W_{n}^{t})$
13:: $U p l o a d_{n \to B S} (W_{n, C}^{t})$
14:: end for
15:: BS does:
16:: for $n \in S = S_{1} \cup S_{2}$ do
17:: $(W_{n, C}^{t}, M_{n}) \leftarrow W_{n, C}^{t}$
18:: end for
19:: $W_{C}^{t} \leftarrow A g g r e g a t e (\frac{1}{| S |} \sum_{n \in S} W_{n, C}^{t}) + R^{t - 1}$
20:: $W^{t} \leftarrow W_{C}^{t} + (1 - M_{1}) \otimes (1 - M_{2}) \dots \otimes (1 - M_{n}) \otimes W^{t - 1}$
21:: $G^{t} \leftarrow W^{t} - W^{t - 1}$
22:: $M \leftarrow C o m p r e s s (G^{t})$
23:: $G_{C}^{t} \leftarrow G^{t} \otimes M$
24:: $W_{C}^{t} \leftarrow W^{t} \otimes M$
25:: $R^{t} \leftarrow G^{t} - G_{C}^{t}$
26:: $S a v e (R^{t}, W^{t})$
27:: $T r a n s m i t s_{B S \to n} (W_{C}^{t})$
28:: end for
29:: Return $W^{F}$

4. The Proposed Algorithm

To solve (19), in this section, we propose a joint user selection, bandwidth allocation, and model compression algorithm, USBA-MC, which divides problem (19) into two subproblems and solve them iteratively. In particular, we first fix the bandwidth allocation and optimize user selection. Then, the problem of bandwidth allocation is formulated and solved with the obtained subset of the selected users. The model compression and these subproblems are iteratively solved until a convergent solution of (19) is obtained.

4.1. Optimal User Selection

Given the bandwidth of each RB, (19) can be further simplified as

\begin{matrix} max_{S} \sum_{n = 1}^{N_{1}} \sum_{i = 1}^{R^{V}} D_{n} r_{n, i}^{V} + \sum_{n = N_{1} + 1}^{N} \sum_{i = 1}^{R^{D}} D_{n} r_{n, i}^{D} \end{matrix}

(20)

\begin{matrix} s . t . & t_{d n, C} + t_{n, C}^{U} + t_{n}^{P} + t_{d, C} \leq T_{r o u n d,} \forall n \in S_{1}, \end{matrix}

(20a)

\begin{matrix} t_{n, C}^{D} + t_{n, C}^{U} + t_{n}^{P} \leq T_{r o u n d,} \forall n \in S_{2}, \end{matrix}

(20b)

\begin{matrix} S_{1} \cup S_{2} = S, \end{matrix}

(20c)

\begin{matrix} E_{n, C}^{M} + E_{n}^{P} \leq γ_{n E}, \forall n \in N, \end{matrix}

(20d)

\begin{matrix} R^{U} = |S|, R^{D} = |S_{2}|, R^{V} = |S_{1}| . \end{matrix}

(20e)

We can observe from (20) that if the bandwidth of each RB is fixed, the subset of selected users is determined by the users’ computing power and channel condition. We denote the algorithm that optimizes user selection under fixed bandwidth allocation by Algorithm 2.

Algorithm 2: User Selection Algorithm GetS(

B^{U}

,

B^{D}

,B)

1:: Input: $N_{1}, N_{2}$
2:: for $n \in N_{1}$ do
3:: if $t_{d n, C} + t_{n, C}^{U} + t_{n}^{P} + t_{d, C} \leq T_{r o u n d,}$ and $E_{n, C}^{M} + E_{n}^{P} \leq γ_{n E}$ then
4:: $S_{1} \leftarrow n$
5:: end if
6:: end for
7:: for $n \in N_{2}$ do
8:: if $t_{n, C}^{D} + t_{n, C}^{U} + t_{n}^{P} \leq T_{r o u n d,}$ and $E_{n, C}^{M} + E_{n}^{P} \leq γ_{n E}$ then
9:: $S_{2} \leftarrow n$
10:: end if
11:: end for

4.2. Optimal RB Bandwidth

With an obtained subset of users, we need to find the optimal B,

B^{U}

, and

B^{D}

that can further optimize the capability of the hybrid VLC/RF systems. Note that the larger the bandwidth of a RB is, the smaller the delay can be, implying more users can be potentially selected. Based on this observation, the optimal RB bandwidth allocation problems are

\begin{matrix} max B^{U} \end{matrix}

(21)

\begin{matrix} s . t . & R^{U} \times B^{U} + R^{D} \times B^{D} \leq B_{R}, \end{matrix}

(21a)

\begin{matrix} B^{U} = B^{D}, \end{matrix}

(21b)

\begin{matrix} R^{U} = |S|, R^{D} = |S_{2}|, \end{matrix}

(21c)

and

\begin{matrix} max B \end{matrix}

(22)

\begin{matrix} s. t. & R^{V} \times B \leq B_{V}, \end{matrix}

(22a)

\begin{matrix} R^{V} = |S_{1}| . \end{matrix}

(22b)

Lemma 2.

The maximum bandwidth of a RB can be obtained when

R^{U} \times B^{U} + R^{D} \times B^{D} = B_{R}

and

R^{V} \times B = B_{V}

.

Proof.

We use the contradiction method to prove Lemma 2. First, we assume that maximum

B_{0}^{U}

,

B_{0}^{D}

, and

B_{0}

exist when (21a) and (22a) are not equal. Therefore, we have

\begin{matrix} B_{0}^{U} = B_{0}^{D} < \frac{B_{R}}{|S| + |S_{2}|}, \end{matrix}

(23)

and

\begin{matrix} B_{0} < \frac{B_{V}}{|S_{1}|} . \end{matrix}

(24)

However, when (21a) and (22a) are equal,

B_{1}^{U}

,

B_{1}^{D}

, and

B_{1}

satisfy the following equations:

\begin{matrix} B_{1}^{U} = B_{1}^{D} = \frac{B_{R}}{|S| + |S_{2}|}, \end{matrix}

(25)

and

\begin{matrix} B_{1} = \frac{B_{V}}{|S_{1}|} . \end{matrix}

(26)

Obviously,

B_{1}^{U} = B_{1}^{D} > B_{0}^{U} = B_{0}^{D}

and

B_{1} > B_{0}

, which contradicts the assumption. This ends the proof. □

Therefore, we have

\begin{matrix} B^{U} = B^{D} = \frac{B_{R}}{R^{U} + R^{D}} = \frac{B_{R}}{|S| + |S_{2}|}, \end{matrix}

(27)

and

\begin{matrix} B = \frac{B_{V}}{R^{V}} = \frac{B_{V}}{|S_{1}|} . \end{matrix}

(28)

4.3. Iterative Solution

In each iteration, we first use the proposed model compression method to reduce the transmission delay and energy. Then, we update the selected users based on the constraints, using GetS(

B^{U}

,

B^{D}

,B). Finally, the bandwidth allocation is obtained by the given selected users, which is denoted by GetB(

S

). The iteration ends when both the user selection and bandwidth allocation remain fixed. Obviously, the algorithm can always reach convergence after a certain number of iterations. We summarize the proposed USBA-MC algorithm in Algorithm 3.

Algorithm 3: USBA-MC Algorithm

1:: Input: $B_{0}, B_{0}^{D}, B_{0}^{U}$
2:: $(t_{n, C}^{U}, t_{n, C}^{D}, t_{d n, C}, E_{n, C}^{M}) \leftarrow MC (s^{L}, s^{G})$
3:: $S^{0} \leftarrow$ GetS( $B_{0}^{U}, B_{0}^{D}, B_{0}$ )
4:: for $τ \in {1, \dots, T}$ do
5:: $(B_{τ}^{U}, B_{τ}^{D}, B_{τ}) \leftarrow$ GetB( $S^{τ - 1}$ )
6:: $(t_{n, C}^{U}, t_{n, C}^{D}, t_{d n, C}, E_{n, C}^{c o m}) \leftarrow MC (s^{L}, s^{G})$
7:: $S^{τ} \leftarrow$ GetS( $B_{τ}^{U}, B_{τ}^{D}, B_{τ}$ )
8:: if $S^{τ} = = S^{τ - 1}$ and $(B_{τ}^{U}, B_{τ}^{D}, B_{τ}) = = (B_{τ - 1}^{U}, B_{τ - 1}^{D}, B_{τ - 1})$ then
9:: break
10:: end if
11:: end for

4.4. Convergence, Implementation, and Complexity Analysis

(1) Convergence Analysis: We first analyze the convergence of the proposed algorithm. Let the indoor user selection vector be

s_{1} = [s_{1}^{1}, \dots, s_{1}^{N}]

and outdoor user selection vector be

s_{2} = [s_{2}^{1}, \dots, s_{2}^{N}]

, where

s_{1}^{n} = 1 / s_{2}^{n} = 1

indicates user n performs the FL algorithm; otherwise, we have

s_{1}^{n} = 0 / s_{2}^{n} = 0

. Assume that the gradient

\nabla J (ω (s_{1}, s_{2}))

of

J (ω (s_{1}, s_{2}))

is uniformly Lipschitz continuous with respect to

ω (s_{1}, s_{2})

[25]. Therefore, we have

| | \nabla J (ω_{t + 1} (s_{1}, s_{2})) - \nabla J (ω_{t} (s_{1}, s_{2})) | | \leq L | | ω_{t + 1} (s_{1}, s_{2}) - ω_{t} (s_{1}, s_{2}) | |,

(29)

where

ω_{t} (s_{1}, s_{2})

is the global model at step t, L is a positive constant, and

| | \cdot | |

denotes the norm. Assume that

J (ω (s_{1}, s_{2}))

is strongly convex with positive parameter

μ

. Therefore, we have

\begin{matrix} J (ω_{t + 1} (s_{1}, s_{2})) \geq J (ω_{t} (s_{1}, s_{2})) + {(ω_{t + 1} (s_{1}, s_{2}) - ω_{t} (s_{1}, s_{2}))}^{T} \nabla J (ω_{t} (s_{1}, s_{2})) \\ + \frac{μ}{2} | | ω_{t + 1} (s_{1}, s_{2}) - ω_{t} (s_{1}, s_{2}) {| |}^{2} . \end{matrix}

(30)

We also assume that

J (ω (s_{1}, s_{2}))

is twice-continuously differentiable. Moreover, we assume

| | \nabla f_{i} (ω_{t} (s_{1}, s_{2})) | | \leq ϑ_{1} + ϑ_{2} \nabla | | J (ω_{t} (s_{1}, s_{2})) {| |}^{2}

with

ϑ_{1} \geq 0

and

ϑ_{2} \geq 0

. The above assumptions are easy to satisfy, such as the loss function that is linear or logistic regression [25]. The expected convergence rate of the proposed algorithm can be obtained by the following lemma:

Lemma 3.

Given the optimal global FL model

ω^{*}

. The convergent upper bound of

E [J (ω_{t + 1} (s_{1}, s_{2})) - J (ω^{*})]

applicable to the considered hybrid VLC/RF system satisfies

\begin{matrix} E [J (ω_{t + 1} (s_{1}, s_{2})) - J (ω^{*})] \leq \frac{2 ϑ_{1}}{L D} (\sum_{n = 1}^{N_{1}} \sum_{i = 1}^{R^{V}} D_{n} (1 - r_{n, i}^{V}) \\ + \sum_{n = N_{1} + 1}^{N} \sum_{i = 1}^{R^{D}} D_{n} (1 - r_{n, i}^{D})) \frac{1}{1 - F}, \end{matrix}

(31)

where

F = 1 - \frac{μ}{L} + \frac{4 μ ϑ_{2}}{L D} (\sum_{n = 1}^{N_{1}} \sum_{i = 1}^{R^{V}} D_{n} (1 - r_{n, i}^{V}) + \sum_{n = N_{1} + 1}^{N} \sum_{i = 1}^{R^{D}} D_{n} (1 - r_{n, i}^{D}))

.

Proof.

As

s_{1}^{n} + s_{2}^{n} = \{\begin{matrix} 1, & \begin{matrix} s_{1}^{n} = 1, s_{2}^{n} = 0 & or & s_{1}^{n} = 0, s_{2}^{n} = 1 \end{matrix} \\ 0, & \begin{matrix} s_{1}^{n} = 0, s_{2}^{n} = 0 \end{matrix} \end{matrix}

, we have

1 - (s_{1}^{n} + s_{2}^{n}) = \{\begin{matrix} 0, & \begin{matrix} s_{1}^{n} = 1, s_{2}^{n} = 0 & or & s_{1}^{n} = 0, s_{2}^{n} = 1 \end{matrix} \\ 1, & \begin{matrix} s_{1}^{n} = 0, s_{2}^{n} = 0 \end{matrix} \end{matrix}

. Therefore, we have

1 - (s_{1}^{n} + s_{2}^{n}) \geq 0

with

n = [1, . . ., N]

. Then, the upper bound of

E [J (ω_{t + 1} (s_{1}, s_{2})) - J (ω^{*})]

can be obtained according to the Theorem 1 in [13] as

\begin{matrix} E [J (ω_{t + 1} (s_{1}, s_{2})) - J (ω^{*})] \leq F^{t} E [J (ω_{0}) - J (ω^{*})] \\ + \frac{2 ϑ_{1}}{L D} \sum_{n = 1}^{N} D_{n} (1 - (s_{1}^{n} + s_{2}^{n})) \frac{1 - F^{t}}{1 - F}, \end{matrix}

(32)

where

F = 1 - \frac{μ}{L} + \frac{4 μ ϑ_{2}}{L D} \sum_{n = 1}^{N} D_{n} (1 - (s_{1}^{n} + s_{2}^{n}))

. As each resource block will be assigned to a participated user, the upper bound can be further converted into

\begin{matrix} E [J (ω_{t + 1} (s_{1}, s_{2})) - J (ω^{*})] \leq F^{t} E [J (ω_{0}) - J (ω^{*})] \\ + \frac{2 ϑ_{1}}{L D} (\sum_{n = 1}^{N_{1}} \sum_{i = 1}^{R^{V}} D_{n} (1 - r_{n, i}^{V}) + \sum_{n = N_{1} + 1}^{N} \sum_{i = 1}^{R^{D}} D_{n} (1 - r_{n, i}^{D})) \frac{1 - F^{t}}{1 - F}, \end{matrix}

(33)

where

F = 1 - \frac{μ}{L} + \frac{4 μ ϑ_{2}}{L D} (\sum_{n = 1}^{N_{1}} \sum_{i = 1}^{R^{V}} D_{n} (1 - r_{n, i}^{V}) + \sum_{n = N_{1} + 1}^{N} \sum_{i = 1}^{R^{D}} D_{n} (1 - r_{n, i}^{D}))

. From (33), we can observe that when

F < 1

,

F^{t}

approximates to 0 as t increases. Therefore,

E [J (ω_{t + 1} (s_{1}, s_{2})) - J (ω^{*})] = \frac{1}{D} (\sum_{n = 1}^{N_{1}} \sum_{i = 1}^{R^{V}} D_{n} (1 - r_{n, i}^{V}) + \sum_{n = N_{1} + 1}^{N} \sum_{i = 1}^{R^{D}} D_{n} (1 - r_{n, i}^{D})) \frac{1}{1 - F}

and the FL algorithm converges. When

\frac{4 μ ϑ_{2}}{L D} (\sum_{n = 1}^{N_{1}} \sum_{i = 1}^{R^{V}} D_{n} (1 - r_{n, i}^{V}) + \sum_{n = N_{1} + 1}^{N} \sum_{i = 1}^{R^{D}} D_{n} (1 - r_{n, i}^{D})) < \frac{μ}{L}

,

F < 1

. As

\frac{1}{D} (\sum_{n = 1}^{N_{1}} \sum_{i = 1}^{R^{V}} D_{n} (1 - r_{n, i}^{V}) + \sum_{n = N_{1} + 1}^{N} \sum_{i = 1}^{R^{D}} D_{n} (1 - r_{n, i}^{D})) \leq 1

, we only need to let

ϑ_{2} < \frac{1}{4}

.

ϑ_{2}

can satisfy this condition, as

ϑ_{2}

can be any value that satisfies

ϑ_{2} \geq 0

. This completes the proof. □

From Lemma 3, we can also observe there is a gap

\frac{2 ϑ_{1}}{L D} (\sum_{n = 1}^{N_{1}} \sum_{i = 1}^{R^{V}} D_{n} (1 - r_{n, i}^{V}) + \sum_{n = N_{1} + 1}^{N} \sum_{i = 1}^{R^{D}}

D_{n} (1 - r_{n, i}^{D}))

between

E [J (ω_{t + 1} (s_{1}, s_{2}))]

and

E [J (ω^{*})]

. As the number of participated users increases, the gap decreases. Meanwhile, as the number of users increases, the value of F also decreases, which improves the convergence speed of the FL algorithm.

(2) Implementation Analysis: Then, we analyze the implementation of the proposed algorithm. To find the optimal user selection set

S

, the BS must first calculate the total delay and the energy consumption

E_{n, C}^{M} + E_{n}^{P}

of each user. In our system, the total delay includes the RF link delay

t_{n, C}^{D} + t_{n, C}^{U} + t_{n}^{P}

and the VLC link delay

t_{d n, C} + t_{n, C}^{U} + t_{n}^{P} + t_{d, C}

. In order to calculate the total delay, the BS must know the model size required by FL algorithm and the computational time. The size of the FL model depends on the learning task and the sparsity rate. Before implementing an FL algorithm, the BS must first transmit the task information and model information to each user and set model sparsity rate. Therefore, the BS will know the FL model size and sparsity rate before training. In order to calculate the energy consumption and the computational time, the BS also needs to know the users’ device information such as transmit power and CPU. In an FL algorithm, the BS can learn the device information when users initially connect to the BS. Given the total delay

t_{n, C}^{D} + t_{n, C}^{U} + t_{n}^{P}

,

t_{d n, C} + t_{n, C}^{U} + t_{n}^{P} + t_{d, C}

, and the energy consumption

E_{n, C}^{M} + E_{n}^{P}

, the BS can compute

S_{1}

and

S_{2}

using GetS(

B^{U}

,

B^{D}

,B). Given

S_{1}

and

S_{2}

, the BS can compute the bandwidth of each RB using GetB(

S

). As the function in (20) is linear, the USBA-MC algorithm can determine a user selection set

S

to improve FL training loss.

(3) Complexity Analysis: With regards to the complexity of the USBA-MC algorithm, we first analyze the complexity of the model compression algorithm. In the model compression, the complexity depends on the number of model parameters. Let

W_{O}

be the number of model parameters, the complexity of the model compression algorithm is

O (W_{O} log W_{O})

[26]. Then, we analyze the complexity of the traversal algorithm. Since the total number of users is N, the complexity of the traversal algorithm is

O (N)

[27]. In addition, the complexity of the numerical method is

O (1)

since we only need to allocate RBs according to the user set. Assume that the number of global iterations is T, and the total complexity of the USBA-MC algorithm can be expressed as

O (T W_{O} log W_{O})

.

5. Simulation Results and Analysis

Consider a circular network area having a radius

r = 50

m with one BS at its center. There are

N = 50

uniformly distributed users, and 80% of the users are in indoors and 20% of them are in outdoors. The system specifications are summarized in Table 1. The following two baselines are considered: (a) the USBA algorithm in a hybrid VLC/RF system [28] and (b) the FL algorithm in RF-only system. To comprehensively evaluate the performance of the proposed USBA-MC algorithm in federated learning systems, we conduct experiments related to three learning tasks: (a) the prediction task of housing price, (b) identification task of identifying the handwritten digits from 0 to 9, and (c) identification task of classifying 10 categories of color images.

In the housing price prediction task, our goal is to compare the performance of the proposed USBA-MC algorithm under different sparsity rates, and compare the performance of USBA-MC, baselines (a) and (b). The dataset used to train the FL algorithm is Boston house price dataset (http://lib.stat.cmu.edu/datasets/boston (accessed on 27 March 2021)) that is randomly allocated to users equally. In this task, each user trains an FNN with one hidden layer composed of 10 neurons.

In the identification task of handwritten digits, we train FNNs using MNIST dataset [29]. The size of neuron weight matrices are 784 × 200, 200 × 200, and 200 × 10. Sixty-thousand handwritten digits are used to train the network and 10,000 handwritten digits are used to test it.

Finally, we train CNNs on CIFAR-10 [30] to investigate the performance of USBA-MC algorithm with different sparsity rates on non-IID data. The size of neuron weight matrices are 5 × 5 × 3 × 64, 5 × 5 × 64 × 64, 2304 × 384, 384 × 192 and 192 × 10. Fifty-thousand images are used to train the network and 10,000 images are used to test it.

5.1. Performance over Different Sparsity Rates

Figure 2 shows the performance of the proposed USBA-MC algorithm in two learning tasks under different sparsity rates. We use the coefficient of determination (

R^{2}

) to measure the quality of the model in the task of predicting housing price, and use the accuracy of classification to measure the performance in the task of identifying handwritten digits. Moreover, we calculated the average of 10 experiments to ensure the reliability of the experimental results. It shows that the

R^{2}

values first increase and then decrease with the sparsity rate. This is because the model information will be lost with low sparsity rate. In particular, the best sparsity rate is 0.4 for predicting housing price and 0.2 for identifying handwritten digital.

Figure 3 compares the predictive performance of USBA-MC, baselines (a) and (b). The green line is the true values of data samples, and the sparsity rate of USBA-MC is set to 0.4. Before training, we randomly select 18 samples to form a test set for testing. In Figure 3, we can observe the proposed USBA-MC algorithm can achieve better performance than baselines (a) and (b). In particular, the proposed FL algorithm can improve the

R^{2}

by up to 11% and 15%, compared to baselines (a) and (b).

Figure 4 compares the identification performance in the tasks of identifying handwritten digits. It shows USBA-MC is better than baselines (a) and (b) in most global communication rounds, and the final accuracies of these algorithms are 96.52%, 96.45%, and 96.39%, respectively. This is because USBA-MC introduces visible light communication and reduce the size of transmission model, which can increase the number of selected users, and further improving the FL performance.

5.2. Number of Selected Users

This subsection evaluates the performance of USBA-MC in user selection. We first compare the number of users selected under different bandwidth conditions and resource constraints.

Figure 5 shows how the number of selected users changes as the total number of users varies in different systems. It can be observed that with the increase of the total users, the selected users in three algorithms will also increase. However, compared with baselines (a) and (b), USBA-MC algorithm enables more users to participate in the training process. This trend is more obvious with the increase of total number. For instance, when the total number is 150, the USBA-MC can improve the number of selected users, by, respectively, up to 37.8% and 68.7% compared to baseline (a) and (b). Table 2 shows the ratio of selected users, we can find that USBA-MC always has the highest ratio in all cases. Figure 5 also compares the user selection under different VLC and RF bandwidths. It can be observed that the proposed USBA-MC algorithm is better than baselines (a) and (b) under all bandwidth settings.

Figure 6 compares number of selected users under different transmission data sizes. We can observe that the selected users decrease when the data size increases. Table 3 shows the selection ratio of different algorithms with different data sizes when the total user number is 150. The result shows that USBA-MC can achieve better system performance than the other algorithms. This advantage is important when the model size becomes larger due to the complex neural network.

5.3. Non-IID Data

In this subsection, we explore the accuracy of USBA-MC algorithm with non-IID data [31]. To obtain a non-IID dataset, we use the same method as in [31].

As shown in Figure 7, the model is trained on the dataset of non-IID nature. We can clearly observe the advantages of USBA-MC compared with other algorithms. In terms of stability and accuracy, the USBA-MC algorithm achieves the best performance. In USBA-MC, a low sparsity rate will increase the stability of the system and improve the final accuracy. However, when the sparsity rate is 0.2, USBA-MC has lower accuracy and higher stability compared to 0.4. This is because decreasing sparsity rate will increase the loss of model information. Moreover, the model will not converge when the sparsity rate is too low. Therefore, there is a trade-off between the sparsity rate and the model performance.

Figure 8 shows the use of the proposed USBA-MC algorithm for image identification. In this simulation, the BS uses broadcast techniques to transmit the global model and the local models are trained by CIFAR-10. As shown in Figure 8, the proposed USBA-MC algorithm can still achieve the best performance among the three algorithms in terms of both accuracy and stability. In particular, the USBA-MC can improve the accuracy by up to 3.27% and 6.35%, compared to baselines (a) and (b).

We analyze the gain of USBA-MC when the data set is non-IID. According to Section 3 in [31], the weight divergence will reduce the accuracy in non-IID dataset. The weight divergence is caused by the distance between the data distribution on each user and the population distribution. Such distance can be evaluated with the earth mover’s distance (EMD). According to the central limit theorem (CLT) of normal distribution [32], as the number of local models increases, the mean of EMD will be approximated by a normal distribution. Therefore, the weight divergence of the trained global model will be smaller and the model performance will be better with the increase of the number of users. Compared with the accuracy of the transmission model, the system is more sensitive to the number of selected users. Therefore, the increase of users will improve the robustness and accuracy of the global model.

Figure 9 selects the model accuracy of the last 10 global communications to obtain the average and variance of the accuracy. It can be observed that the accuracy first increase and then decrease with the sparsity rate. The USBA-MC can improve the accuracy by up to 7% and 16.7%, compared to baselines (a) and (b) when the sparsity rate is 0.4. We can also observe that as the number of users increases, the model will be more stable until it cannot converge.

6. Conclusions

This paper has proposed the introduction of VLC into conventional RF systems for supporting FL. We have formulated a joint user selection and bandwidth allocation problem for FL in a hybrid VLC/RF system. To solve this problem, we first used a model compression method to reduce the size of FL model parameters that are transmitted over wireless links, and then we separated the optimization problem into two subproblems. The first subproblem is a user selection problem with a given bandwidth allocation, which is solved by a traversal algorithm. The second subproblem is a bandwidth allocation problem with a given user selection, which is solved by a numerical method. The convergent solution is obtained by iteratively compressing the model and solving these two subproblems. Simulation results have demonstrated that the USBA-MC algorithm outperforms USBA and FL in RF-only systems.

Author Contributions

Conceptualization, W.H., M.C. and Y.Y.; methodology, W.H. and H.V.P.; software, W.H., M.C. and C.L.; validation, W.H., M.C., Y.Y. and C.L.; formal analysis, W.H., H.V.P. and C.L.; investigation, W.H., M.C., Y.Y. and C.L.; resources, H.V.P. and C.F.; data curation, W.H., Y.Y. and C.L.; writing—original draft preparation, W.H.; writing—review and editing, M.C., Y.Y. and H.V.P.; visualization, W.H. and M.C.; supervision, H.V.P. and C.F.; project administration, M.C. and Y.Y.; funding acquisition, H.V.P. and C.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Fundamental Research Funds for the Central Universities (2021RC03) , National Natural Science Foundation of China (61901047), National Natural Science Foundation of China (61871047), Beijing Natural Science Foundation (4204106), and Proof-of-concept project of Zhongguancun Open Laboratory (202103001).

Data Availability Statement

Publicly available datasets were analyzed in this study. These data can be found here: http://lib.stat.cmu.edu/datasets/boston, http://yann.lecun.com/exdb/mnist, and http://www.cs.toronto.edu/kriz/cifar.html (accessed on 27 March 2021).

Conflicts of Interest

The authors declare no conflict of interest.

References

McMahan, H.B.; Moore, E.; Ramage, D.; Hampson, S.; Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the International Comference on Artificial Intelligence and Statistics, Fort Lauderdate, FL, USA, 20–22 April 2017. [Google Scholar]
Konecny, J.; McMahan, H.B.; Ramage, D.; Richtarik, P. Federated optimization: Distributed machine learning for on-device intelligence. arXiv 2016, arXiv:1610.02527. [Google Scholar]
Niknam, S.; Dhillon, H.S.; Reed, J.H. Federated learning for wireless communications: Motivation, opportunities and challenges. IEEE Commun. Mag. 2020, 58, 46–51. [Google Scholar] [CrossRef]
Lim, W.Y.B.; Luong, N.C.; Hoang, D.T.; Jiao, Y.; Liang, Y.C.; Yang, Q.; Niyato, D.; Miao, C. Federated learning in mobile edge networks: A comprehensive survey. IEEE Commun. Surv. Tutor. 2020, 22, 2031–2063. [Google Scholar] [CrossRef] [Green Version]
Li, T.; Sahu, A.K.; Talwalkar, A.; Smith, V. Federated learning: Challenges, methods, and future directions. IEEE Signal Process. Mag. 2020, 37, 50–60. [Google Scholar] [CrossRef]
Chen, M.; Gündüz, D.; Huang, K.; Saad, W.; Bennis, M.; Feljan, A.V.; Poor, H.V. Distributed learning in wireless networks: Recent progress and future challenges. arXiv 2021, arXiv:2104.02151. [Google Scholar]
Konečnỳ, J.; McMahan, H.B.; Yu, F.X.; Richtárik, P.; Suresh, A.T.; Bacon, D. Federated learning: Strategies for improving communication efficiency. arXiv 2016, arXiv:1610.05492. [Google Scholar]
Tsuzuku, Y.; Imachi, H.; Akiba, T. Variance-based gradient compression for efficient distributed deep learning. arXiv 2018, arXiv:1802.06058. [Google Scholar]
Sattler, F.; Wiedemann, S.; Müller, K.R.; Samek, W. Robust and communication-efficient federated learning from non-iid data. IEEE Trans. Neural Netw. Learn Syst. 2019, 31, 3400–3413. [Google Scholar] [CrossRef] [Green Version]
Xu, J.; Du, W.; Jin, Y.; He, W.; Cheng, R. Ternary compression for communication-efficient federated learning. IEEE Trans. Neural Netw. Learn Syst. 2020, 1–15. [Google Scholar] [CrossRef]
Shao, R.; Liu, H.; Liu, D. Privacy preserving stochastic channel-based federated learning with neural network pruning. arXiv 2019, arXiv:1910.02115. [Google Scholar]
Chen, M.; Shlezinger, N.; Poor, H.V.; Eldar, Y.C.; Cui, S. Communication-efficient federated learning. Proc. Nat. Acad. Sci. USA 2021, 118, e2024789118. [Google Scholar] [CrossRef]
Chen, M.; Yang, Z.; Saad, W.; Yin, C.; Poor, H.V.; Cui, S. A joint learning and communications framework for federated learning over wireless networks. IEEE Trans. Wireless Commun. 2021, 20, 269–283. [Google Scholar] [CrossRef]
Tran, N.H.; Bao, W.; Zomaya, A.; Nguyen, M.N.H.; Hong, C.S. Federated learning over wireless networks: Optimization model design and analysis. In Proceedings of the IEEE Conference on Computer Communications, Paris, France, 29 April–2 May 2019. [Google Scholar]
Yang, Z.; Chen, M.; Saad, W.; Hong, C.S.; Bahaei, M.S. Energy efficient federated learning over wireless communication networks. IEEE Trans. Wireless Commun. 2021, 20, 1935–1949. [Google Scholar] [CrossRef]
Chen, M.; Poor, H.V.; Saad, W.; Cui, S. Convergence time optimization for federated learning over wireless networks. IEEE Trans. Wireless Commun. 2021, 20, 2457–2471. [Google Scholar] [CrossRef]
Ma, C.; Konečný, J.; Jaggi, M.; Smith, V.; Jordan, M.I.; Richtárik, P.; Takáč, M. Distributed optimization with arbitrary local solvers. Optim. Methods Softw. 2017, 32, 813–848. [Google Scholar] [CrossRef] [Green Version]
Yang, Y.; Zeng, Z.; Cheng, J.; Guo, C.; Feng, C. A relay-assisted OFDM system for VLC uplink transmission. IEEE Trans. Commun. 2019, 67, 6268–6281. [Google Scholar] [CrossRef]
Pham, T.V.; Pham, A.T. Coordination/cooperation strategies and optimal zero-forcing precoding design for multi-user multi-cell VLC networks. IEEE Trans. Commun. 2019, 67, 4240–4251. [Google Scholar] [CrossRef]
Hsu, T.H.; Qi, H.; Brown, M. Measuring the effects of non-identical data distribution for federated visual classification. arXiv 2019, arXiv:1909.06335. [Google Scholar]
Sattler, F.; Wiedemann, S.; Müller, K.R.; Samek, W. Sparse binary compression: Towards distributed deep learning with minimal communication. In Proceedings of the International Joint Conference on Neural Networks, Budapest, Hungary, 14–19 July 2019; pp. 1–8. [Google Scholar]
Xie, H.; Qin, Z. A lite distributed semantic communication system for internet of things. IEEE J. Sel. Areas Commun. 2020, 39, 142–153. [Google Scholar] [CrossRef]
Aji, A.F.; Heafield, K. Sparse communication for distributed gradient descent. arXiv 2017, arXiv:1704.05021. [Google Scholar]
Lin, Y.; Han, S.; Mao, H.; Wang, Y.; Dally, W.J. Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv 2017, arXiv:1712.01887. [Google Scholar]
Friedlander, M.P.; Schmidt, M. Hybrid deterministic-stochastic methods for data fitting. SIAM J. Sci. Comput. 2012, 34, A1380–A1405. [Google Scholar] [CrossRef] [Green Version]
Jiang, Y.; Wang, S.; Valls, V.; Ko, B.J.; Lee, W.H.; Leung, K.K.; Tassiulas, L. Model pruning enables efficient federated learning on edge devices. arXiv 2019, arXiv:1909.12326. [Google Scholar]
Cheung, T.Y. Graph traversal techniques and the maximum flow problem in distributed computation. IEEE Trans. Softw. Eng. 1983, SE-9, 504–512. [Google Scholar] [CrossRef]
Liu, C.; Guo, C.; Yang, Y.; Chen, M.; Poor, H.V.; Cui, S. Optimization of user selection and bandwidth allocation for federated learning in VLC/RF systems. In Proceedings of the IEEE Wireless Communications and Networking Conference, Nanjing, China, 29 March–1 April 2021; pp. 1–6. [Google Scholar]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE Inst. Electr. Electron. Eng. 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
Krizhevsky, A.; Nair, V.; Hinton, G. The CIFAR-10 Dataset. Available online: http://www.cs.toronto.edu/kriz/cifar.html (accessed on 27 March 2021).
Zhao, Y.; Li, M.; Lai, L.; Suda, N.; Civin, D.; Chandra, V. Federated learning with non-iid data. arXiv 2018, arXiv:1806.00582. [Google Scholar]
Rosenblatt, M. A central limit theorem and a strong mixing condition. Proc. Nat. Acad. Sci. USA 1956, 42, 43. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Illustration of FL based on a hybrid VLC/RF system.

Figure 2. The accuracy achieved by different sparsity rates of USBA-MC in the Boston housing dataset and MNIST dataset. (a) Boston housing dataset. (b) MNIST dataset.

Figure 3. Comparison of accuracy in the Boston housing dataset trained by a BP neural network.

Figure 4. Comparison of accuracy in MNIST dataset trained by BP neural network.

Figure 5. Comparison of user selection under different bandwidth settings with different numbers of users.

Figure 6. Comparison of user selection under different transmission data sizes with different numbers of users. (a) Data Size of 1 M. (b) Data Size of 3 M. (c) Data Size of 5 M. (d) Data Size of 7 M.

Figure 7. The accuracy of USBA-MC with different sparsity rates, baseline (a) and (b) on CIFAR-10 with non-IID nature.

Figure 8. Comparison of accuracy in CIFAR-10 dataset with broadcast channel.

Figure 9. The average and variance of accuracy of USBA-MC with different sparsity rates, baselines. (a) Average of accuracy. (b) Variance of accuracy.

Table 1. Simulation parameters.

Parameter	Value
Transmitted optical power per VLC AP, $P_{v}$	9 W
Modulation bandwidth for LED lamp, B	40 MHz
The physical area of a PD, $A_{p}$	1 cm $^{2}$
Half-intensity radiation angle, $θ_{1 / 2}$	60 deg.
Gain of optical filter, $T_{s} (θ)$	1.0
Receiver FOV semi-angle, $Θ_{F}$	90 deg.
Refractive index, n	1.5
Optical to electric conversion efficiency, $γ$	0.53 A/W
Noise power spectral density, $N_{0}^{V L C}, N_{0}^{R F}$	$10^{- 21}$ A $^{2}$ /Hz
RF total bandwidth, $B_{R}$	20 MHz
Transmit power of BS, $P_{B}$	1 W
The number of users, N	50
Delay requirement, $T_{r o u n d}$	2.5 s
Energy consumption requirement, $γ_{n E}$	2 J
Energy consumption coefficient, $α$	$2 \times 10^{- 28}$
user update size, s	1 Mb

Table 2. Selection ratio of users with different total numbers.

Total Number of User	USBA-MC	Baseline (a)	Baseline (b)
50	78%	66%	64%
100	74%	58%	51%
150	75.33%	54.67%	44.67%

Table 3. Selection ratio of users with different data sizes.

Data Size	USBA-MC	Baseline (a)	Baseline (b)
1 M	73.8%	50.07%	41%
3 M	59.6%	27.13%	23.6%
5 M	51.47%	12.13%	11.07%
7 M	44.07%	5.67%	3.6%

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, W.; Yang, Y.; Chen, M.; Liu, C.; Feng, C.; Poor, H.V. Wireless Network Optimization for Federated Learning with Model Compression in Hybrid VLC/RF Systems. Entropy 2021, 23, 1413. https://0-doi-org.brum.beds.ac.uk/10.3390/e23111413

AMA Style

Huang W, Yang Y, Chen M, Liu C, Feng C, Poor HV. Wireless Network Optimization for Federated Learning with Model Compression in Hybrid VLC/RF Systems. Entropy. 2021; 23(11):1413. https://0-doi-org.brum.beds.ac.uk/10.3390/e23111413

Chicago/Turabian Style

Huang, Wuwei, Yang Yang, Mingzhe Chen, Chuanhong Liu, Chunyan Feng, and H. Vincent Poor. 2021. "Wireless Network Optimization for Federated Learning with Model Compression in Hybrid VLC/RF Systems" Entropy 23, no. 11: 1413. https://0-doi-org.brum.beds.ac.uk/10.3390/e23111413

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Wireless Network Optimization for Federated Learning with Model Compression in Hybrid VLC/RF Systems^†

Abstract

1. Introduction

1.1. Related Work

1.2. Contribution

2. System Model and Problem Formulation

2.1. FL Model

2.2. FL Based on Hybrid VLC/RF System

2.3. Computational Model

2.4. RF Transmission Model

2.5. VLC Transmission Model

2.6. Problem Formulation

3. Model Compression

3.1. Problem Analysis

3.2. Model Weights Compression

4. The Proposed Algorithm

4.1. Optimal User Selection

4.2. Optimal RB Bandwidth

4.3. Iterative Solution

4.4. Convergence, Implementation, and Complexity Analysis

5. Simulation Results and Analysis

5.1. Performance over Different Sparsity Rates

5.2. Number of Selected Users

5.3. Non-IID Data

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Wireless Network Optimization for Federated Learning with Model Compression in Hybrid VLC/RF Systems †

Abstract

1. Introduction

1.1. Related Work

1.2. Contribution

2. System Model and Problem Formulation

2.1. FL Model

2.2. FL Based on Hybrid VLC/RF System

2.3. Computational Model

2.4. RF Transmission Model

2.5. VLC Transmission Model

2.6. Problem Formulation

3. Model Compression

3.1. Problem Analysis

3.2. Model Weights Compression

4. The Proposed Algorithm

4.1. Optimal User Selection

4.2. Optimal RB Bandwidth

4.3. Iterative Solution

4.4. Convergence, Implementation, and Complexity Analysis

5. Simulation Results and Analysis

5.1. Performance over Different Sparsity Rates

5.2. Number of Selected Users

5.3. Non-IID Data

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Wireless Network Optimization for Federated Learning with Model Compression in Hybrid VLC/RF Systems^†