Deep Reinforcement Learning for Intelligent Reflecting Surface-assisted D2D Communications

In this paper, we propose a deep reinforcement learning (DRL) approach for solving the optimisation problem of the network's sum-rate in device-to-device (D2D) communications supported by an intelligent reflecting surface (IRS). The IRS is deployed to mitigate the interference and enhance the signal between the D2D transmitter and the associated D2D receiver. Our objective is to jointly optimise the transmit power at the D2D transmitter and the phase shift matrix at the IRS to maximise the network sum-rate. We formulate a Markov decision process and then propose the proximal policy optimisation for solving the maximisation game. Simulation results show impressive performance in terms of the achievable rate and processing time.


I. INTRODUCTION
Device-to-device (D2D) communications play a critical role in 5G networks by allowing users to communicate directly without the involvement of base stations.It helps reduce the latency and improve the information transmission efficiency [1], [2].In [1], the optimised power allocation at the D2D transmitters was proposed to maximise the energy efficiency (EE) performance, by following a machine learning-based approach.In [2], the D2D transmitters harvest energy through the simultaneous wireless information and power transfer protocol (SWIPT).Then, a game theory approach was proposed to solve the power allocation and power splitting at SWIPT with pricing strategies for maximising the network performance.
Intelligent reflecting surface (IRS), referring to the technology of massive elements of flexible reflection capability that are controlled by an intelligent unit, has recently attracted great attention from the research community as an efficient means to expand wireless coverage.The IRS can manage the incoming signal by a controller, which allows to efficiently adapt the angle of passive reflection from the transmitters toward the receivers [3]- [6].In [4], the IRS harvests energy from the access point (AP) and uses it for reflecting the signal in two phases.The AP beamforming vector, the IRS's phase scheduling, and the passive beamforming were optimised to maximise the information rate.In [5], a channel estimation scheme for a multi-user multiple-input multiple-output (MIMO) system has been designed with the support of double IRS panels.K. K. Nguyen, A. Masaracchia, C. Yin, and T. Q. Duong are Queen's University Belfast, UK (e-mail: {knguyen02,a.masaracchia,cyin01,trung.q.duong}@qub.ac.uk).L. D. Nguyen is with Duy Tan University, Vietnam (email: dinhlonghcmut@gmail.com).O. A. Dobre is with Memorial University, Canada (e-mail: odobre@mun.ca)Some research works have investigated the efficiency of the IRS in assisting the D2D communications [7], [8].In [7] and [8], two sub-problems with fixed passive beamforming vector and fixed phase shift matrix were considered.To solve the power allocation optimisation with the fixed phase shift matrix, the authors in [7] used the gradient descent method while the authors in [8] employed the Dinkelbach method.For the phase shift optimisation, a local search algorithm was proposed in [7] while fractional programming was utilised in [8].However, these approaches assume a discrete phase shift and only reach a sub-optimal solution.Moreover, these works only consider perfect conditions, e.g., channel state information (CSI).In addition, these algorithms cause large delays due to high computational complexity.
Very recently, deep reinforcement learning (DRL) has been applied as an effective solution for solving complicated problems in wireless networks [9]- [14].In [9], we defined the discrete power level and used the DRL algorithm to choose the transmit power at the D2D transmitter for maximising the EE.In [11], discrete and continuous action spaces were considered for the beamforming vector and the IRS phase shift in multiple-input single-output (MISO) communications.Then, two DRL algorithms were used to maximise the total throughput.In [12], a method based on the DRL was used for optimising the unmanned aerial vehicle (UAV)'s altitude and the IRS diagonal matrix to minimise the sum age-ofinformation.In [13], the authors used the DRL technique to maximise the signal-to-noise ratio.
In this paper, we propose a DRL algorithm for solving the joint power allocation and phase shift matrix optimisation in the IRS-assisted D2D communications.Firstly, we conceive a D2D communication system with the support of the IRS.The D2D channel is a combination of the direct link and the reflective link.The IRS is used for mitigating the interference and enhancing the information transmission channel.Secondly, we formulate a Markov decision process (MDP) [15] for the network throughput maximisation in the IRS-assisted D2D communications, in which the optimisation variables are the power at the D2D users and the phase shifts at the IRS.Then, a DRL algorithm is used to search for an optimal policy for maximising the network sum-rate.Finally, we compare the efficiency of our proposed methods with other schemes in terms of the achievable network sum-rate.

II. SYSTEM MODEL AND PROBLEM FORMULATION
We consider an IRS-assisted wireless network with N pairs of D2D users distributed randomly and an IRS panel, as shown arXiv:2108.02892v1[eess.SP] 6 Aug 2021 in Fig. 1.Each pair of D2D users comprises of a singleantenna D2D transmitter (D2D-Tx) and a single-antenna D2D receiver (D2D-Rx).An IRS panel with K reflective elements is deployed to enhance the signal from the D2D-Tx to the associated D2D-Rx and mitigate the interference from other D2D-Txs.The IRS with reflective elements maps the receiver's signal by the value of the phase shift matrix controlled by an intelligent unit.The received signal at the D2D-Rx is composed of a direct signal and a reflective one.
We denote the position of the nth D2D-Tx at time step t as X t n (Tx) = x t n (Tx), y t n (Tx) , n = 1, . . ., N and that of the mth D2D-Rx as X t m (Rx) = x t m (Rx)), y t m (Rx) , m = 1, . . ., N .The IRS is fixed at the position (x t IRS , y t IRS , z t IRS ).The phase shift value of each element in the IRS belongs to [0, 2π].We denote the direct channel from the nth D2D-Tx to the mth D2D-Rx at time step t by h t nm , and the reflective channel by H t nm .The phase shift matrix at the IRS at time step t is defined by

Information Transmission Interference
] represent the amplitude and the phase shift value, respectively.In this paper, we assume that the amplitudes of all elements are set to η t k = 1.The distance between the nth D2D-Tx and the mth D2D-Rx at time step t is defined as (1) Similarly, the distance between the nth D2D-Tx and the IRS is d t n,IRS and the distance between the IRS and the mth D2D-Rx is d t IRS,m at time step t.The direct channel is formulated as where β 0 and ĥn are the channel power gain at the reference distance d 0 = 1 m and κ 0 is the path-loss exponent in the D2D link.Here, we assume that the small-scale fading follows the Nakagami-m distribution with m as the fading severity parameter.
The reflective channel via the IRS from the nth D2D-Tx toward the mth D2D-Rx is considered as a Rician fading channel at time step t described by where β 1 is the Rician factor, and hLoS nm , hNLoS nm are the lineof-sight (LoS) and the non-line-of-sight (NLoS) components for the reflected channel, respectively.Specifically, the LoS component is defined as [7] where θ ∈ [0, 2π] is the random phase.The NLoS component is defined as where κ 1 is the path loss exponent for the NLoS component and the small-scale fading ĥNLoS nm ∼ CN (0, 1) is i.i.d.complex Gaussian distribution with zero mean and unit variance.
The received signal at the nth D2D-Rx at time step t can be written as where p t n is the transmit power at the nth D2D-Tx at time step t, u t n is the transmitted symbol from the nth D2D-Tx, and ∼ N (0, α 2 ) is the complex additive white Gaussian noise.
Accordingly, the received signal-to-interference-plus-noise ratio (SINR) at the nth D2D-Rx can be represented as The achievable sum-rate at the nth D2D pair during time step t is defined as where B is the bandwidth.
In this paper, we aim at optimising the power allocation of all N pairs of D2D users P = {p 1 , p 2 , . . ., p N } and the phase shift matrix Φ of the IRS to maximise the network sum-rate while satisfying all the constraints.The considered network optimisation can be formulated as follows: where P max is the maximum transmit power at the D2D-Tx and the constraint R t n ≥ r min , ∀n ∈ N indicates the qualityof-service (QoS) of the D2D communications.

III. JOINT OPTIMISATION OF POWER ALLOCATION AND PHASE SHIFT MATRIX
Given the optimisation problem (9), we formulate the MDP with the agent, the state space S, the action space A, the transition probability P, the reward function R and the discount factor ζ. Let us denote P ss (a) as the probability when the agent takes action a t ∈ A at the state s = s t ∈ S and transfers to the next state s = s t+1 ∈ S. In particular, we formulate the MDP game as follows: • State space: The channel gain of the D2D users forms the state space as • Action space: The D2D-Txs adjust the transmit power and the IRS changes the phase shift for maximising the expected reward.Thus, The action space for the D2D users and the IRS is considered as follows: • Reward function: The agent needs to find an optimal policy for maximising the reward.In our problem, our objective is to maximise the network sum-rate; thus, the reward function is defined as (12) By following the MDP, the agent interacts with the environment and receives the response to achieve the best expected reward.Particularly, the state of the agent at time step t is s t .The agent chooses and executes the action a t under the policy π.The environment responds with the reward r t .After taking the action a t , the agent moves to the new state s t+1 with probability P ss (a).The interactions are iteratively executed and the policy is updated for the optimal reward.
In this paper, we propose a DRL approach to search for an optimal policy for maximising the reward value in (12).The optimal policy can be obtained by modifying the estimation of the value function or directly by the objective.We use an on-policy algorithm for our work, namely proximal policy optimisation (PPO) with the clipping surrogate technique [16].Consider the probability ratio of the current policy and obtained policy p t θ = π(s,a;θ) π(s,a;θ old ) , we need to find the optimal policy to maximise the total expected reward as follows: where E[•] is the expectation operation and A π (s, a) = Q π (s, a) − V π (s) denotes the advantage function [17]; V π (s) denotes the state-value function while Q π (s, a) is the actionvalue function.
In the PPO method, we limit the current policy such that it does not go far from the obtained policy by using different techniques, e.g., the clipping technique and Kullback-Leiber [17].In this work, we use the clipping surrogate method to prevent the excessive modification of the objective value, as follows: where is a hyperparameter.When the advantage A π (s, a) is positive, the term (1 + ) takes action.Meanwhile, for the negative case of the advantage A π (s, a), the term (1 − ) sets a ceiling to limit the objective value.Moreover, for the advantage function A π (s, a), we use [18]: where the state-value function V π (s) is obtained at the state s under the policy π as follows: To train the policy network, we store the transition into a mini-batch memory D and then use the stochastic policy gradient (SGD) method to maximise the objective.By denoting the policy parameter by θ, it is updated as The PPO algorithm for joint optimisation of the transmit power and the phase shift matrix in the IRS-aided D2D communications is presented in Algorithm 1, where M denotes the maximum number of episodes and T is the number of iterations during a period of time.

IV. SIMULATION RESULTS
For numerical results, we use Tensorflow 1.13.1 [19].The IRS is deployed at (0, 0, 0), while the D2D devices are randomly distributed within a circle of 100 m from the center.The maximum distance between the D2D-Tx and the associated D2D-Rx is set to 10 m.We assume d/λ = 1/2, and set the learning rate for the PPO algorithm to 0.0001.For the neural networks, we initialise two hidden layers with 128 and 64 units, respectively.All other parameters are provided in Table I.We consider the following algorithms in the numerical results.
• The proposed algorithm: We use the PPO algorithm with the clipping surrogate technique to solve the joint optimisation of the power allocation and the phase shift matrix of the IRS.

• Maximum power transmission (MPT):
The D2D-Tx transmits information with maximum power, P max .We use the PPO algorithm to optimise the phase shift matrix of the IRS panel.
Algorithm 1 Proposed approach based on the PPO algorithm for the IRS-assisted D2D communications.1: Initialise the policy π with the parameter θ π 2: Initialise other parameters 3: for episode = 1, . . ., M do 4: Receive initial observation state s 0 5: for iteration = 1, . . ., T do Obtain the action a t at state s t by following the current policy 7: Execute the action a t 8: Receive the reward r t according to (12) 9: Observe the new state s t+1 10: Update the state s t = s t+1 11: Collect set of partial trajectories with D transitions 12: Estimate the advantage function according to (15) 13: end for 14: Update policy parameters using SGD with mini-batch D 15: end for • Random phase shift matrix selection (RPS): We optimise the power allocation at the D2D-Tx with random selection of the phase shift matrix Φ. • Without IRS: The D2D-Tx transmits information without the support of the IRS.We optimise the power allocation by using the PPO algorithm.Firstly, we compare the achievable network sum-rate provided by our proposed algorithm with that of other schemes.Fig. 2 plots the sum-rate versus different numbers of the IRS elements, K, where the number of D2D pairs is set to N = 5.As can be observed from this figure, the PPO algorithm-based technique outperforms other schemes and is followed by the MPT technique.The RP and WithoutIRS schemes show poorer performance in terms of the network sum-rate.The achievable network sum-rate using our proposed algorithm and MPT improves with increasing the number of IRS elements.The results show that with the monotonic increase in the value of K, the communication quality between the D2D-Tx and associated D2D-Rx is enhanced, while the interference from other D2D-Txs is suppressed.
Next, the performance of the previously mentioned four schemes is compared while varying the number of D2D pairs, N , in Fig. 3.We set the number of IRS element to K = 20 and take the average over 500 episodes to obtain the results.Our proposed algorithm shows better performance, followed by MPT.With higher number of D2D users, N ≥ 6, the performance attained by the proposed algorithm still increases while it decreases for the other schemes.The RPS and WithoutIRS models show the worse performance.Further, we set N = 5, K = 20 and compare the performance results of the four schemes while changing the value of the threshold, r min , in Fig. 4. When the value of r min increases towards infinity, the number of D2D pairs that satisfies the QoS constraints decreases and the sum-rate of all schemes tends to 0. The proposed algorithm outperforms the other schemes for all values of r min .The gap between our algorithm and others increases following the increase in r min when r min ≥ 5.The MPT algorithm exhibits the worst performance when r min ≥ 7.This suggests that the optimisation of power allocation is important for efficient D2D communications.
Next, we compare the total sum-rate of the four schemes by setting different maximum transmission powers at the D2D- Tx, P max , in Fig. 5, with N = 5, K = 20.As P max varies from 100 mW to 400 mW, the performance of the four schemes increases in the same upward trend.The gap between our proposed algorithm and the other schemes increases with the increase value of P max as we jointly optimise both power allocation at the D2D-Tx and the IRS's phase shift matrix.It is clear that the proposed algorithm is more effective for mitigating interference and providing a better communication quality.Furthermore, we use neural networks for establishing the DRL algorithm.Thus, after iterative interactions with the environment, the neural networks are trained for achieving an optimal solution.After training offline, the neural network can be deployed to the system for online execution.The online neural networks can determine the proper action for the IRS phase shift value and the D2D-Tx power allocation for maximising the network sum-rate in real-time.V. CONCLUSION In this paper, we have presented a DRL-based optimal resource allocation scheme for IRS-assisted D2D communications.The PPO algorithm with the clipping surrogate technique has been proposed for joint optimisation of the D2D-Tx power and the IRS's phase shift matrix.Numerical results have showed a significant improvement in the achievable network sum-rate performance compared with the benchmark schemes.Our proposed scheme demonstrates the superiority of using IRS in mitigating the interference in the D2D communications when compared with other existing schemes.

Fig. 2 .
Fig.2.The network sum-rate versus the number of IRS elements, K.

Fig. 3 .
Fig. 3.The network sum-rate versus the number of D2D pairs, N .

Fig. 4 .
Fig. 4. The network sum-rate versus the QoS threshold, r min .