A survey on graph neural networks

In recent years, we have witnessed the developments that deep learning has brought to machine learning. It has solved many problems in the areas of computer vision, speech recognition, natural language processing, and various other tasks with state-of-the-art performance. However, the data in these tasks is typically represented in Euclidean space. As technology develops, more and more applications are generating data from non-Euclidean domains and representing them as graphs with complex relationships and interdependencies between objects. This poses a significant challenge to deep learning algorithms. This is because, due to the uniqueness of graphs, applying deep learning to the ubiquitous graph data is not an easy task. To solve the problem in non-Euclidean domains, Graph Neural Networks (GNNs) have emerged. A Graph Neural Network (GNN) is a neural model that captures dependencies between graphs by passing messages between graph nodes. This paper introduces commonly used graph neural networks, their learning methods, and common datasets for graph neural networks. It also provides an outlook on the future of Graph Neural Networks.


Introduction
With the rapid development of neural networks in recent years, deep learning has become the "jewel" of artificial intelligence and machine learning [1]. Many machine learning tasks that once relied on manual methods to extract feature information (e.g., image recognition, machine translation) have been replaced by various more advanced deep learning methods. Of course, the success of deep learning in areas such as image classification, video processing, speech recognition, and natural language understanding is no accident, thanks not only to big data and high-performance computing power but also to the effectiveness of deep learning [2] itself in extracting potential representations from Euclidean data. For graphs can be regular or irregular. A graph may have both unordered nodes of different sizes, nodes from the same graph may have different numbers of neighbours, and each node in the graph may have different neighbourhoods. This leads to the fact that some operations of deep learning algorithms (such as convolution operations) work well in tasks such as classification, clustering, and prediction of the graph as a whole. Unlike traditional machine learning algorithms that require the transformation of graphs into vectors or matrices, GNNs improve the representation of graph data by performing calculations directly on the graph, using the relationships between nodes. For example, in social network analysis, GNNs can help us discover community structure and predict user interests and behavior, among other tasks. In chemical molecular analysis, GNNs can help us with tasks such as classifying, clustering, and predicting molecules. In recommender systems, GNNs can use the relationships between users to improve the effectiveness of recommendations.

Background
Early research on graph neural networks (GNNs) belongs to the category of recurrent neural networks (RecGNNs) and has a high overhead. Sperduti and Starita [5] introduced neural networks to direct acyclic graphs and promoted the research of GNNs. Gori, Monfardini [6] first introduced the concept of graph neural networks. [7,8]further elaborate the concept of graph neural networks. In recent years, with the wide application of non-Euclidean data, more and more people focus on the study of graph neural networks. Wu, Pan [9] classifies graph neural networks into four categories. Zhang, Cui [10] A comprehensive review of deep learning methods on different types of graphs. Thomas, Moallemy-Oureh [11] Classifies graph neural networks according to their different abilities to process graph types and attributes. Waikhom and Patgiri [12] The learning mode of graph neural network is summarized. Zhou, Cui [13] A generic pipeline design for graph neural network models is proposed. There are also many research works on graph neural network learning methods. Cao, Li [14] extracted feature information in hyperspectral classification to avoid the problem of over-smoothing of message delivery caused by [15]. As the research work progressed, contrast-based learning methods were also successful. Okuda, Satoh [16] proposed unsupervised graph representation learning to discover common objects and a set of specific objects in an image for localisation. The node classification and edge detection of [17] combines two learning methods, random walk, and language modelling, and the learned representations can be used for downstream tasks.
This paper provides a comprehensive review of different models of graph neural networks and how graph neural networks learn. A more complete overview of graph neural networks is provided. In summary, the main contributions of this paper include: (1) a comprehensive and detailed review of models of graph neural networks; (2) a discussion of graph-based training approaches; and (3) challenges for future research on graph neural networks.
The rest of the paper includes: Section 3 introduces the concept and notation of graphs. Section 4 introduces the respective learning methods of graph convolutional networks (GCN), graph attention networks (GAT), and graph autoencoders (GAE). In addition, the difference between graph attention networks (GAT) and graph convolutional networks (GCN) is described. At the same time, we also make a simple distinction between GAT and GAN. Section 5 describes the datasets commonly used in graph neural networks. Section 6 summarises the paper and discusses the challenges faced by graph neural networks.

Concept and notation representation of graphs
A graph neural network [18,19] is a deep neural network suitable for the analysis of graph structures. The notation involved in this paper is interpreted as shown in Table 1. The graph is expressed as G=(V, E). where = { 1 , 2 , 3 , . . , } represents the set of N=|V| nodes. E ⊆ V × V represents the set of edges between nodes [20]. We use A ∈ R N×N to represent the adjacency matrix [21]. an element of the i th row of A can be written as A(i, :) and an element of the j th column can be written as A(:, j). A(i, j) represents an element of the i th row and j th column of A.

Structure of Graph Neural Network
The graph structure is such that each node is defined by its own features and by the features of the nodes connected to it. The purpose of GNN [22]  (i) Assume that f (.) is a function with parameters called the local transition function, which is shared among all nodes and updates the node state based on input from neighbouring nodes. (ii) Assume that g (.) is the local output function, which is used to describe how the output is generated.  Some non-linear activation function 1 2

X X 
The element-wis e multiplication Θ Learn able parameters s The sample size Where, F and G are respectively called the global transfer function and the global output function and are stacked versions of f and g for all nodes in the graph. According to Banach's immobility point theorem, GNN uses the following conventional iterative approach to calculate the state covariates:

Graph Convolutional Network (GCN∈
GCN is a convolutional neural network that acts directly on the graph and makes use of its structural information. The main idea of GCN [24][25][26] is that for each node, we consider all of its neighbors and the characteristic information it contains. Assuming that we use the average () function, this is done for each node to obtain an average representation that can be fed into the neural network. Modern GCNs mimic CNNs by designing convolution and readout functions to learn common local and global structural patterns of graphs. We first discuss the convolution operation and then move to the readout operation and some other improvements. Convolutional neural networks [27] play a central role in building many other complex GNN models.
(1) Spatially based graph convolutional neural networks The spatial domain-based graph convolutional neural structure consists of three main types of operators: neighbour sampling [35], message computation and message aggregation. The graph convolutional neural structure based on spatial domain mainly consists of three types of operators: neighbor sampling, message computation and information aggregation. In GCN, an aggregate operation is used to aggregate adjacent nodes represented by a node to achieve message transmission between nodes. Figure 1 shows the transfer of node information based on spatial domain GCN.

Figure 1: GCN node information transmission based on spatial domain
The simplest aggregation process is to do a product operation of the node features of the graph (X) with the topological structure information of the graph (adjacency matrix A). The exact process is shown in Figure 2.

Figure 2: Information aggregation process for spatial domain-based GCNs
To solve the problem in Figure 2 of not calculating the nodes' own features and aggregating them directly by summation, which can cause the gradients to explode or disappear, we can add the unit matrix I to the adjacency matrix A and aggregate the features of the neighbouring nodes by taking a weighted average.
According to the different methods of convolutional stacking, space-based GCN can be further divided into two categories: recurrent-based and composition-based spatial GCN. recurrent-based approaches use the same graph convolution layer to update the hidden representation, and compositional -based approaches use a different graph convolution layer to update the hidden representation. Figure 3 illustrates this difference.

Figure 3: Comparison of recurrent-based and composition-based
The spatial approach is to define the convolution directly in the spatial domain. The problem faced is that, because each node's neighbours are of different sizes, it is impossible to define a neighbourhood of the same size, so achieving parameter sharing faces greater difficulties, but the idea is still that the convolution is still a weighted average of a node over its neighbouring nodes, so many subsequent approaches aim to solve the problem of parameter sharing.
(2)Spectral-based graph convolutional neural network Convolution based on spectral methods is a special case of convolution based on spatial methods. Spectral domain-based graph convolution via neural networks investigates the properties of graphs with the help of the eigenvalues and eigenvectors of the Laplacian matrix of the graph. Filters are introduced to define convolution from a signal processing perspective. Firstly, the signal in the spectral domain is multiplied using the theorem of convolution. Secondly, the Fourier transform is used to transform the signal to the original space to achieve convolution. This approach avoids the difficulty of defining convolution caused by the fact that the graph data does not satisfy translation invariance. Because the structure of the graph does not satisfy translation invariance, it is not possible to define convolution directly in the spatial domain, so the signal is transformed into the frequency domain, where the convolution operation is realised, before being transformed back into the spatial domain, which is the spectral method. Graph convolutional neural networks based on spectral methods assume that the graph is undirected. The normalised graph Laplacian matrix is a mathematical representation of an undirected graph, defined as: Where, x is the result of the Fourier transform. To better understand the Fourier transform of a graph, we can see from its definition that it does project the input graph signal into an orthogonal space whose base is made up of the eigenvectors of the regularised graph Laplacian. The elements of the transformed signal are the coordinates of the graph signal in the new space, so that the original input signal can be expressed as: This is the result of the Fourier inverse transform. Next we can define the graph convolution operation on the input signal X.
is the filter we define; ∈ Indicates the Hadamard product. Suppose we define such a filter [36]: Thus, the graph convolution operation [37] can be represented in a simplified way as: T G x g θ θ * = Ug U x (13) Spectral-based graph convolution networks all follow this pattern, with the key difference between them being the choice of filter. The following models of spectral-based graph convolution networks exist: Spectral CNN, Chebyshev [31] Spectral CNN (ChebNet), Adaptive Graph Convolution Network (AGCN).

Comparison Between Spectral and Spatial Models
As the earliest graph convolutional networks, spectralbased models have achieved impressive results in many graph-related analysis tasks. These models have some theoretical basis in graph signal processing. By designing new graph signal filters, we can theoretically design new graph convolutional networks. However, spectral-based models have some insurmountable drawbacks, which we will address below in terms of efficiency, generality and flexibility.
In terms of efficiency, the computational cost of spectral-based models increases dramatically with the size of the graph, as they either need to perform feature vector calculations or process the entire graph at the same time, making them difficult to apply to large graphs. Space-based models have the potential to handle large graphs as they perform convolution directly in the graph domain by aggregating neighbouring nodes. The computation can be performed in a batch of nodes rather than in the whole graph. Sampling techniques can be introduced to improve efficiency when the number of neighbouring nodes increases.
In terms of generality, spectral-based models assume a fixed graph, making it difficult for them to add new nodes to the graph. On the other hand, space-based models perform graph convolution locally at each node and can easily share weights between different locations and structures.
In terms of flexibility, the spectral-based model is limited to working on undirected graphs; the Laplace matrix on directed graphs is not clearly defined, so the only way to apply the spectral-based model to directed graphs is to convert the directed graph to an undirected graph. Spacebased models are more flexible in dealing with multiple source inputs which can be combined into aggregation functions. As a result, spatial models have received increasing attention in recent years.

Graph Attention Network(GAT∈
Graph Attention Network (GAT) [38] consists of a number of functionally identical blocks (Graph Attention Layer) [39]. Its properties include high efficiency, low storage type, inductive learning and full graph access. The graph attention layer has a feature value of 1 2 { , , , }, where N represents the number of nodes and F represents the dimensionality of the node features. After a Graph Attention Layer, a new feature vector is output, which can be represented as 1 2 { , , , }, assuming that the dimension of the node feature of this feature vector is F′ . As shown in Figure 4.

Figure 4: Attention layer in GAT
The purpose of using Self-attention is to improve the expressiveness of ′ h  . In the Graph Attention Layer, a weight matrix is first applied to each node using a weight matrix, and then self-attention is used for each node to calculate an attention coefficient, the shared self-attention mechanism used here, denoted a: ij j e a h = i Wh W   (14) ij e represents the importance of node j for node i. In theory we can calculate the weight of any node in the graph to the central node. In GAT, to simplify the calculation, the nodes are restricted to the one-hop neighbours of the central node, and in addition the nodes take themselves into account as neighbouring nodes. In the existing studies a there are many ways to choose. For example, choosing a single-layer feedforward network with parameter 2F a ′ ∈  R and then using LeakyReLU to do a non-linearisation gives.
The output feature ′ h  is obtained by weighting the input features.
In order to improve the generalisation of the attention mechanism, GAT chose to use a multi-headed attention layer, i.e. a single-headed attention [40] layer from a set of K mutually independent graph attention layers, and then stitch their results together. At this point, i ′ h is: where ‖ represents the splicing operation, ( ) k ij α represents the weight factor calculated from the k th group of attention mechanisms, and k W is the weight factor of the kth module. In order to reduce the dimensionality of the feature vector, we can also use the averaging operation instead of the splicing operation, as shown in the following equation. 1

Graph Autoencoder(GAE)
Starting with the graph-based self-encoder proposed in Kipf and Welling [41], graph self-encoders have come in handy in many fields due to their simple encoderdecoder [42,43] structure and efficient encode capability. Figure 5 briefly describes the flow of a graph self-encoder (GAE). graph Reconstructed graph

Figure 5: GAE workflow
Encoder GAE [44,45] uses the GCN as an encoder [46] to obtain latent representations (or embedding) of the nodes, a process that can be expressed in a short line of equation [47]: ( , ) = Z GCN X A (20) Using the GCN as a function, X and A are input to the GCN as a function and the output × N f ∈ Z R , Z represents the latent representations (embedding) of all nodes. The GCN is defined as follows: W represent the parameters to be learned. Here the GCN is equivalent to a function with the node features [48,49] and adjacency matrix as input and the node embedding [50] as output, with the aim of obtaining the embedding only.

Decoder
GAE uses the inner-product as decoder to reconstruct the original graph: (22) Â represents the reconstructed adjacency matrix. A good Z should make the reconstructed adjacency matrix [51] as similar as possible to the original adjacency matrix, because the adjacency matrix determines the structure of the graph. Therefore, GAE uses cross-entropy as the loss function [52] in the training process.
where y represents the value of an element in the adjacency matrix A (0 or 1) and ŷ represents the value of the corresponding element in the reconstructed adjacency matrix Â (between 0 and 1). It can be seen from the loss function that the closer and more similar the reconstructed adjacency matrix (or reconstructed graph) is to the original adjacency matrix (or original graph), the better.

Differences and connections between GCN and GAT
The core difference between GAT and GCN is how to collect and accumulate feature representations of neighbouring nodes at a distance of 1. GAT replaces the fixed normalization operations in GCN with an attention mechanism. Essentially, GAT simply replaces the standard function of GCN with a feature aggregation function of neighbouring nodes using attention weights.GAT was created to address the shortcomings of GCN. Disadvantages of GCN include: i.The weights assigned to different neighbours on the same order neighbourhood are identical, and it is not possible to assign different weights to different nodes in the neighbourhood. This limits the model's ability to capture the relevance of spatial information and is the reason why it is inferior to GAT for many tasks. ii.The way in which the GCN combines features of adjacent nodes is dependent on the structure of the graph, which makes the trained model relatively poor at generalising to graphs of other structures. Benefits of the GAT include: i.Different weights can be assigned to different nodes in the neighbourhood. ii.After the attention mechanism is introduced, it is only relevant to neighboring nodes (nodes with shared edges), and there is no need to get information about the whole graph:(1) The graph need not be undirected (if the edge does not exist, we can simply omit the calculation;(2) It makes our technique directly applicable to Inductive Learning -including the task of evaluating models graphically that are completely invisible during training. The classification process of GAT is very similar to that of GCN in that it uses a softmax function + cross-entropy loss function + gradient descent to complete the process. In essence: GCN and GAT both aggregate the features of neighbouring vertices to the central vertex (an aggregate operation) and use the local stationary on the graph to learn the new vertex feature representation. The difference is that GCN uses the Laplacian matrix and GAT uses the attention coefficients. To a certain extent, GAT is stronger because the correlation between vertex features is better integrated into the model. In contrast to GCN, GAT is suitable for inductive tasks, where the important learning parameters are W and a(.), which only relate to vertex features and have nothing to do with the structure of the graph. Therefore, changing the structure of the graph in a test task has little effect on GAT, and only requires changing Ni and recalculating it. In contrast, GCN is a graph-wide calculation, where the node features of the whole graph are updated in a single calculation. The parameters learned are largely related to the graph structure, which puts GCN in a difficult position for the inductive task. In addition it is important to note that although the GAT and GAN [53][54][55][56] appears to be only one letter difference, but the actual meaning to indeed. GAT is a kind of graph neural network, but GAN is not. The following is a brief introduction to GAN to give us a clearer idea of the difference between GAT and GAN. GAN [53][54][55][56]is composed of two neural networks. The first one is called the Discriminator, D(Y). It takes input Y(such as a graph) and outputs a value that indicates whether Y looks "real." D(Y) can be thought of as some kind of energy function that is close to 0 when Y is a real sample, and positive when Y is noisy or strange [57][58][59]. The other network is called the Generator (G (Z)). Here Z is usually a vector randomly sampled from a simple distribution (e.g., Gaussian distribution [60]), and the generator G(Z) is used to generate pictures, which are then used to train the discriminator D(Y)(to give lower values to real pictures and higher values to other pictures). In the process of training D, give it a real picture, make it adjust the parameter output lower value; Give it a picture of G and ask it to adjust the parameters to output a larger value D(G(Z)). On the other hand, as G is trained, it adjusts its internal parameters to make the images it produces more and more realistic. That is, it has been optimizing the images it produces to fool D into thinking that the images it produces are real. This means that for these generated images, G wants to minimise the output of D, while D wants to maximise the output of D. The two networks have opposite aims and are in an adversarial posture. This is called adversarial training, or GAN. In the following we will explain the training process of G AN in relation to the formula. First, a generator neural net work is built, with all parameters represented by θ. The pu rpose is to generate images x, and these samples x all obey a distribution

Datasets
In recent years, commonly used datasets for graph neural networks include Cora, Citeseer, PubMed. Table 2 provides a comparison of these three datasets The dataset contains two files, cora.conten and cora.cite. The content of cora.conten is described in the following format: <paper_id> <word_attributes>+ <class_label>. The first entry in each line (paper_id) is a unique numbered ID for each paper, the subsequent (word_attributes) contains 1433 binary codes indicating whether each word in the vocabulary is present (represented by a 1) or absent (represented by a 0) in the paper, and the last entry (class_label) indicates the class label of the paper. And cora.cite contains the citation relations of the papers in the corpus in the format <ID of cited paper> <ID of citing paper>. Each row of data contains the coded IDs of two papers, the first entry (ID of cited paper) indicates the number of the cited paper and the second entry (ID of citing paper) indicates the number of the citing paper.
(2)Citeseer dataset The Citeseer dataset is a selection of papers from the CiteSeer library of digital papers, classified into six categories: Agents, AI, DB, IR, ML, HCI. Papers were selected in such a way that each paper in the final corpus cited or was cited by at least one other paper. There are 3327 papers in the entire corpus. After stem extraction and removal of stop words, only 3703 words remained. All words with a document frequency of less than 10 were removed.
The dataset contains two files, citeseer.conten and citeseer.cites. The contents of citeseer.conten are in the format <paper_id> <word_attributes> + <class_label>. The first entry in each line (paper_id) is a unique numbered ID for each paper, the subsequent (word_attributes) contains 3703 binary codes indicating whether each word in the vocabulary is present (represented by a 1) or absent (represented by a 0) in the paper, and the last entry (class_label) indicates the class label of the paper. And citeseer.cites contains the citation relationships of the papers in the corpus in the format <ID of cited paper> <ID of citing paper>. Each row of data contains the coded IDs of two papers, with the first entry (ID of cited paper) indicating the number of the cited paper and the second entry (ID of citing paper) indicating the number of the citing paper.
(3)PubMed dataset The PubMed dataset consists of 19717 scientific publications on diabetes from the Pubmed database, divided into three categories: Diabetes Mellitus, Experimental, Diabetes Mellitus Type 1, and Diabetes Mellitus Type 2. The citation network consists of 44,338 links. Each publication in the dataset is described by a TF/IDF weighted word vector in a dictionary of 500 unique words.TF-IDF (term frequency-inverse document frequency) is a common weighting technique used in information retrieval and data mining.TF is the word frequency (TF-IDF is a statistical method for assessing the importance of a word to a collection of documents or to one of the documents in a corpus. The importance of a word increases proportionally with the number of times it appears in a document, but decreases inversely with the frequency with which it appears in the corpus. The dataset consists of three files: ∈ Pubmed-Diabetes.NODE.paper.tab. Its content format is described as <paper_id> +<label=****> +< TF-IDF>. The first entry of each line of data (paper_id) is the unique numbering ID of each paper, the second entry is "label=***","***" indicates the category of the paper, followed by 500 floating point TF_IDF values, in the form of "word=***"," word" indicates the term, "***" indicates the TF_IDF value of the term. (2) t Pubmed-Diabetes.GRAPH.pubmed.tab. This file is useless and you do not need to pay attention to it. (3) Pubmed-Diabetes.DIRECTED.cites.tab.

Conclusion
Firstly, this paper introduces four commonly used graph neural networks and their respective learning methods. Secondly, this paper also introduces the datasets that have been commonly used for graph neural networks in recent years.
Graph neural networks are very promising and have a wide range of applications in areas such as social network analysis, recommender systems, biomedicine and visualisation. At present, as research continues to progress, graph neural network technology is also evolving and new models and algorithms are emerging. Therefore, we can foresee that graph neural networks will be more widely and deeply used in future research and applications. However, current graph neural networks are currently facing many challenges. For example, (1) the challenge of processing scaled graph data. As the scale of graph data continues to increase, how to efficiently process graph data storage, sampling, acquisition and transmission has become an important challenge for graph neural networks. (2) The challenge of model interpretability. As with other deep learning models, the black-box nature of graph neural networks makes model interpretability an issue that cannot be ignored. How to better explain the decision process and results of the model to help users better understand the model and thus better use it is an important challenge for graph neural networks. (3) The challenge of data sparsity. Unlike other deep learning models, the data that graph neural networks need to handle is usually sparse, which poses a great challenge to the training and inference of the models. How to better handle sparse data and improve the performance and efficiency of the model is another important challenge facing graph neural networks. In the future, research on graph neural networks will have to be considered in conjunction with these issues.