MIED : An Improved Graph Neural Network for Node Embedding in Heterogeneous Graphs

.


Introduction
Graph data, utilizing nodes and edges to represent entities and their relationships, is a powerful structure for modeling complex real-world data [1,2], including social networks [3] and academic citation networks [4,5].Computing the embedding vectors of the nodes of the graph in the low-dimensional space is an important work.Compared to other types of data, graph data provides flexibility and expressive power for handling intricate relationships [6].To take advantage of this, a variety of graph embedding techniques [7] have been developed, such as HOPE [8] and GraRep [9] that use matrix factorization [10], DeepWalk [11] and Node2Vec [12] that rely on random walks.Powerful models like Graph Neural Networks (GNNs) [13] can also be used to calculate embedding.These methods are carefully designed and demonstrate strong embedding capabilities and primarily focus on homogeneous graph data [14].
However, in the face of real-world data, these techniques encounter difficulties.Often, data in many cases presents not as homogeneous graphs but as heterogeneous structures [15].These diverse entities and relationships reflect the complexity of real-world systems such as movie recommendation systems [16].Recognizing these complexities, researchers propose many heterogeneous graph embedding algorithm and metapath-based [17] methods are widely used in this area.Metapaths are predefined sequences of node types that encapsulate high-order semantic relationships within a heterogeneous graph.By simplifying complex relationships into homogeneous subgraphs, we can apply the homogeneous graph embedding algorithms that were previously discussed.
MAGNN [17], for instance, is one of best methods that integrate metapaths and attention mechanism to achieve efficient node representation learning for heterogeneous graph.Firstly, it uses dense layers to transform the features of different type of nodes into the same dimension.Then it uses metapaths to connect two same type nodes which are not directly connected in the original graph and aggregates the features of the metapaths.When combining results of different metapaths or their instances, it uses attention mechanism to find appropriate weights.This method has strong adaptability.Through the metapaths, the efficient aggregation of information in various heterogeneous graphs can be realized, and the attention mechanism is used to assign appropriate weights to different metapaths, so as to extract as much information as possible from the graph.
Despite these advantages, metapath-based methods still have room for improvement.Current models often neglect important adjacency relationships between nodes during feature alignment.To address these limitations, we propose a novel model called Metapath-Infused Exponential Decay graph neural network (MIED) for heterogeneous graph embedding.The model includes three parts: GCN-based node feature alignment, exponential decay encoder and metapath aggregation.Specifically, MIED first utilizes GCN to align the node features in the heterogeneous graph, enabling the aligned features to have the same dimension and reside in a unified feature space.Additionally, GCN effectively utilizes the neighborhood information based on metapaths during feature alignment, resulting in improved graph embeddings for downstream tasks.Next, when encoding features on metapath instances we introduce an exponential decay encoder to aggregate the features of nodes on the metapath with varying importance due to their distance to starting node.Finally, MIED employs the attention mechanism for metapath aggregation, fusing latent vectors obtained from multiple metapaths into the final node embeddings.
In summary, this work makes the following major contributions: (1) We propose a novel GCN-based node feature alignment method for metapath-based heterogeneous graph node embedding.
(2) We design an effective encoder function for metapath instances called exponential decay encoder which reasonably encode the features on the metapath according to the node importance.
(3) We conduct extensive experiments on the IMDb [18] and DBLP datasets [19] for node classification and node clustering, demonstrating that the node embeddings learned by MIED consistently outperform those generated by other state-of-the-art baselines.

Related works
This chapter will focus on heterogeneous graph representation learning methods relevant to our research.Through a comprehensive analysis of these methods, we will uncover their characteristics in practical applications and gain valuable insights for our research.

Graph Neural Networks
GNN [13,20] is a kind of specialized machine learning models designed for processing graph data.Graph Convolutional Networks (GCNs) extend convolution operation to GNN field.However, GCN suffers from "over-smoothing" [21][22][23] where its performance decreases with an increasing number of layers.Several algorithms have been proposed to address this problem or extend the GCN model.For example, Simplified Graph Convolution (SGC) [24] simplifies GCN by removing non-linear activation functions to enhance the propagation mechanism.
GCN requires that all nodes of the graph are present during training and do not naturally generalize to new added nodes.GraphSage [25] is designed to learn aggregators that samples and aggregates features from a node's neighbors.Therefore, it allows for efficiently generate node embeddings for previously unseen data.
Graph Attention Networks (GAT) [26] introduced attention-based neighborhood aggregation, allowing each node to attend to its neighbors and update its representation.GATv2 [27] modifies the order of internal operations in its attention function, working better in some cases.

Heterogeneous Graph Embedding
The distinct structures and properties of heterogeneous graphs make the direct application of homogeneous algorithms unsuitable [28][29][30].To address these limitations, researchers have introduced heterogeneous graph embedding techniques.
In this area, early algorithms like Metapath2Vec [31] utilized metapaths to generate node sequences in heterogeneous networks, capturing complex semantic relationships.However, they primarily focused on local structures, neglecting global information.To rectify this, Heterogeneous Graph Attention Networks (HAN) [32][33][34] and ESim [35] were introduced.HAN incorporates node types and neighbor data through a hierarchical attention mechanism, enhancing the capture of node attributes and topological data, and ESim provides a embedding-based similarity search framework for heterogeneous information networks, enabling effective handling of large-scale networks and capturing rich semantics.
In the domain of analyzing heterogeneous graph networks, MAGNN [17] has demonstrated promising performance.This method leverages a multi-head attention mechanism to adaptively allocate weights during the learning process, enabling a more comprehensive exploration of node data.MAGNN enhances graph embedding by aggregating neighbor features via combining node content transformation, intra-metapath and inner-metapath aggregation.
Furthermore, the application of knowledge graphs in online learning frameworks has been explored, with a focus on access control decision-making [36,37].Another notable work is SMiLE, which presents a schema-augmented multi-level contrastive learning approach for knowledge graph link prediction [38].

Definition
When describing the symbolic definition of heterogeneous graph embedding, we first need to clarify the symbolic representations of concepts.To facilitate the understanding of the symbols used throughout the discussion, Table 1 provides a comprehensive list of symbols and their respective descriptions.
Heterogeneous Graph.A heterogeneous graph is represented as G = (V , E), where V and E represent the sets of nodes and edges.T and R represent the sets of node types and edge types.ψ V : V → T and ψ E : E → R denote the node type mapping function and the edge type mapping function with |T | + |R| > 2.
Metapath.In the context of heterogeneous graphs, a metapath is a sequence defined by alternating nodes and edges.A metapath P is formally defined by the Equation ( 1), where t i and r i respectively represent the node types and edge types in the graph.
Metapath-based Neighbor.For a given metapath P and a node v ∈ V , the metapath-based neighbors of node v, denoted as N P (v), are defined as nodes that can be reached from node v through the metapath P .
Metapath-based Subgraph.For a node type t and a given metapath P , we can derive a homogeneous subgraph G t P = (V t , E t P ) from graph G.We define the adjacency matrix A t P = [a i,j ] ∈ R n t ×n t , where n t represents the number of nodes of type t.If node v i is adjacent to node v j with respect to metapath P , then a i,j = 1 else a i,j = 0.The node feature matrix for node type t is represented as X t ∈ R n t ×d t , where d t represents the dimension of the original node features.
Heterogeneous Graph Embedding: For a node type t in heterogeneous graph G, X t ∈ R n t ×d t denotes its feature matrix, in which d t represents the dimension of node features.Heterogeneous graph embedding aims to learn d-dimensional node representations H t ∈ R n t ×d (usually d ≪ d t ) which captures structural information for all nodes of type t.

Methodology
Our model MIED consists of three main parts: GCNbased node feature alignment, exponential decay encoder and metapath aggregation.It is outlined in Figure 1.The calculation process of heterogeneous graph embedding is illustrated in Algorithm 1.

GCN-based node feature alignment
In heterogeneous graphs, different types of nodes may have different feature dimensions.Even with the same dimension, different structures may have different feature distributions, demanding feature space alignment.In order to take into account the structural information of the graph, we propose node feature alignment using GCN.
If the adjacency matrix of a homogeneous graph is denoted as A, and I is its corresponding identity matrix, the output of (l + 1)-th GCN layer H (l+1) is calculated in the form of Equation ( 2), where Ã = A + I, H t P (0) = X t , Di,i = j Ãi,j is a diagonal matrix and W (l) is the weights of l-th layer.φ is used to denote the activation function, such as ReLU (x) = max(0, x) or sigmoid(x) = 1 1+e x .
For a heterogeneous graph G and a node type t, if given a metapath P , we can derive a homogeneous subgraph from G by using P and denote its adjacency matrix as A t P .We can apply GCN for each homogeneous subgraph and can get the transformed hidden states as in Equation ( 3).
If given a set of metapaths P t = {P 1 , P 2 , ..., P n } for node type t, We can apply GCN to each metapath according to the above method.Then we sum the hidden states computed by each metapath and get the final hidden states for each node of type t, as shown in Equation (4), where H t P represents the final output of GCN layers.
During the alignment process, the features of each node type are transformed into the same dimension, and feature space and graph information from neighbor nodes is aggregated, enriching the information contained in the aligned results.Output of the l + 1-th GCN layer W (l)  Weights of the l-th GCN layer α Decay parameter in the exponential decay encoder Regarding computational complexity, in comparison to the baseline MAGNN, the addition of GCN-based node feature alignment involves GCN aggregation for each node type.Consequently, the increased computational complexity can be expressed as , where |T | represents the number of node types, N i denotes the number of nodes of the i-th type, M i signifies the mean number of neighbors for the i-th type, and K i represents the number of features for the i-th type.In many real-world graph data, the mean number of neighbors and the number of features for different node types may be constrained by a constant.Therefore, if we denote the maximum number of neighbors and features by M and K respectively, the total computational complexity increased by GCNbased node feature alignment compared to MAGNN is O(M • K • N ), where N corresponds to the total number of nodes in the graph.

Exponential Decay Encoder
After feature alignment, feature aggregation on the metapath must consider that nodes farther from the starting point have less relevance.Some proposed methods use average encoder, linear encoder or RNNbased encoder, but they lack effectiveness in capturing node weight decay on a metapath because of a long path of gradient backpropagation.MAGNN's relational rotation encoder [39] mitigates this to some extent but increases computational complexity and has potential for improvement.
We propose an exponential decay encoder (EDE) for better feature aggregation on a metapath.Given a metapath P for node type t, the encoder function is f θ (P (v, u)), where P (v, u) denotes the metapath from target node v to a neighbor instance u ∈ N P (v).The aggregation process is in Equation (5), with h (i) P ∈ R d as the aligned feature vector of the i-th node on P (v, u) from v to u and α as the decay parameter.The length of for i = 1, 2, .., I do for node type t ∈ T do 12: for metapath P ∈ P t do 13: for v ∈ V t do 14: Calculate h l P (v,u) for all u ∈ N P (v) using EDE refer to Equation (5); 15: Calculate β P (v,u) refer to Equation (7); 16: Combine extracted metapath instances of multi-head attention: 17: end for end for 20: Calculate γ P for each metapath P ∈ P t refer to Equation ( 12) and get : 21: [ end for

23:
Layer output projection: In order to reduce the impact of different magnitudes of features on EDE, techniques such as normalization can also be used to normalize the aligned features.Many normalization methods can be used, such as Min-Max normalization and Z-score normalization [40].During actual training, these methods can be set as a hyperparameter to let the model choose the most suitable normalization method.
The computational complexity of EDE on a metapath instance with n nodes is n i=1 where K represents the number of aligned features.
In contrast, for a similar method, the relational rotation encoder of the baseline model has a computational complexity of K EDE has an increase of O(n 2 ), but n represents the length of a metapath, which is usually not large and much smaller than K. Additionally, EDE reduces data storage space from O(K • n) to O(1) compared with relational rotation encoder.

Metapath Aggregation
Upon aggregating features of nodes on each metapath instance to h P (v,u) , we seek to aggregate all metapath instances linking node v. Appropriate weights must be assigned to each metapath instance's vector representation, as they influence the target node v differently.A graph attention layer is applied to metapath instances related to v, allowing the model to find optimal weights.The process uses ∥ for vector concatenation.
s∈N p (v) exp(e P (v,s) )) ( 7) Similarly, this method can be extended to incorporate a multi-head attention mechanism, as shown in Equation (9).Here, K represents the number of attention heads and β k P (v,u) represents the relative contribution value of the k-th attention head.
After aggregating the instances of the metapath P for node v, we need to further aggregate the information about the metapath set P t = {P 1 , P 2 , ..., P n } for node type t.For a node v, we denote {h v P 1 , h v P 2 , ..., h v P m } as the aggregated representation of each metapath, where m represents the number of metapaths corresponding to its node type t.Therefore, considering the different contributions of metapaths, we can apply an attention mechanism to find the weights of these metapaths.
Firstly, we average the representations of all nodes of type t with each metapath P ∈ P t , as shown in Equation (10).Here, |V t | represents the number of nodes of type t, M t ∈ R d m ×d and b t ∈ R d m are the weight parameters and bias vector.We denote the dimension of parameterized attention vector as d m .
Then the attention mechanism can be represented as follows, where q t ∈ R d m is the parameterized attention MIED : An Improved Graph Neural Network for Node Embedding in Heterogeneous Graphs vector.By summing all the weighted vectors, we calculate the aggregation of information for node v.
e P = (q t ) T • s P (11) γ P = exp(e P ) exp(e P i ) ( 12) Finally, we use a linear layer followed by a nonlinear activation function to map the aggregation of information for node v into the desired dimension of the vector feature space.This can be represented by Equation ( 14), where σ is the activation function and W o ∈ R d o ×d is the weight matrix.The entire equation can be seen as the output layer that connects to downstream tasks.

Training
After obtaining the embedding representation h v for each node, using the node labels, we perform backpropagation and gradient descent to minimize the cross-entropy and optimize the model weights.The loss function can be expressed by Equation (15), where C represents the number of classes, y v [c] represents the one-hot vector of the node label c, and log(h v [c]) represents the predicted vector for the node label c.
The MIED forward propagation algorithm, as delineated in Algorithm 1, is meticulously designed to produce node embeddings from a heterogeneous graph G with its nodes and edges.
Starting with the inputs, the algorithm receives the heterogeneous graph G, node types T , metapaths P t for each node type t, initial node features X t , the number of attention heads K, the number of layers L, the number of GCN layers I, and the exponential decay weight α.The end goal is to compute the node embeddings h v for every node v in G.
For every node type t, the algorithm, in lines 1-9 of Algorithm 1, initializes the feature matrix for each related metapath P using X t and subsequently refines it over I GCN layers.The formula H t P (i) (line 5 of Algorithm 1) depicts the GCN transformation, which incorporates adjacency and degree matrices with the node features to capture the graph structure.After iterating over all metapaths, the aligned node feature matrix is produced in line 8 by summing the outputs.
Delving deeper, lines 10-24 of Algorithm 1 iterate over L layers.Within each layer, for every node v of type t, the algorithm calculates features for all its neighbors using the metapath and the Exponential Decay Encoder (EDE) as detailed in line 14 of Algorithm 1.The attention mechanism is then invoked in lines 15-17 to aggregate these features into a single representation for the node.This ensures that more significant neighboring nodes, as described by the metapath, have a stronger influence on the node's representation.
Subsequent to aggregating features for each metapath, the algorithm, in line 20 of Algorithm 1, computes weights γ P for every metapath.This information is then employed in line 21 to combine embeddings across metapaths, forming a complete representation for each node type.Lastly, line 26 of Algorithm 1 applies a dense layer to refine these embeddings, making them suitable for downstream tasks.
By intertwining information from various metapaths in a heterogeneous graph, and leveraging the power of GCN and attention mechanisms, this algorithm efficiently derives rich and informative node embeddings.

Experiment
In this section, we apply MIED on two datasets to compare MIED with baselines on node classification task and node clustering task.We also try different normalization strategies and different hyperparameters α for exponential decay encoder to further explore their impact.Dataset.We experiment with two heterogeneous graph datasets, similar to the base model MAGNN.The detailed information of the two datasets is shown in Table 2 and the schemas of the datasets are shown in Figure 2. IMDb is a database about movies and television programs and has three kinds of labels:  [41] and we use a subset of the dataset extracted by [17].The authors in DBLP have three kinds of labels: Database, Data Mining, Artificial Intelligence and Information Retrieval.We divide the DBLP dataset into training, validation, and testing sets with 400, 400 and 3257 nodes, respectively.We use one-hot id vectors as input features for nodes with no features in these datasets.We conduct node classification and node clustering tasks on these two datasets to evaluate the performance of our model.

Experimental Settings
Baselines and Hyperparameters.We compare MIED with various types of graph embedding models including MAGNN.These models include homogeneous models such as node2vec, GCN and GAT, as well as heterogeneous models such as ESim, metapath2vec, HERec and HAN.For MIED, we use the same settings and metapaths with MAGNN, if applicable.We set dropout rate to 0.5 and learning rate to 0.005.The number of attention heads is set to 8 and the dimension of attention vectors is set to 128.We set the dimension of the aligned features to 64 for MAGNN and MIED.For exponential decay encoder, we conduct a grid search on the weight decay parameter, using both original and Zscore normalized inputs to find the optimal model.

Node Clustering
We use the same split method of training, validation and testing sets and use K-Means [42] algorithm to cluster embeddings of labeled nodes into the number of classes for each dataset.Normalized mutual information (NMI) [43] and adjusted rand index (ARI) [44] are used as the evaluation metrics.MIED is tested over 10 runs, and we report the average results in Table 3.We can see that MIED outperforms MAGNN and other baselines in all metrics.When compared with the base model, MIED has improved by a maximum of 7.23% and a minimum of 1.27%.For the IMDb dataset, the best result is attained when using a decay value of 2/3 on original input.For the DBLP dataset, the best result is attained when using a decay value of 2/3 on Z-score normalized input.

Node Classification
To evaluate the embeddings of labeled nodes generated by each model, we use a linear support vector machine (SVM) [45] to classify them with varying training proportions.We test MIED over 10 runs and report the averaged Macro-F1 and Micro-F1, as shown in Table 4.We can see that MIED performs best across all metrics.When compared with base model, MIED has improved by 1.97% at most when the dataset is IMDb and the training proportion is 20%.For the IMDb dataset, the best result is attained when using a decay value of 1/2 on original input.For the DBLP dataset, the best result is attained when using a decay value of 1/3 on original input.

Module Analysis
To further explore the effect of two modules, we train a model that only applies the GCN module to MAGNN on the same data in 5.2 and 5.3, and collect results for comparison and analysis.We average the scores of node classification in different training proportions and show them in Table 5.
As shown in the table, after incorporating GCN method in MAGNN, the model performs better across all metrics.This shows that replacing the dense layer with GCN is effective.After implementing the EDE, the performance of the model is further improved.Among the 8 sets of comparative data, 7 sets of data have shown further improvement, which shows EDE is also effective.This table shows that our methods enhance metapath based heterogeneous graph embedding.

Parameter Analysis
We also try to inspect the influence of hyperparameters in the two modules.For GCN, we use one layer in our best result.We also experiment with using two layers of GCN, but the performance of the model get worse.We think the reason is when using two (or more) layers of GCN, the model will consider neighbors that are two (or more) hops away in the homogeneous graphs derived from metapaths, which are actually much far away in the original heterogeneous graph.
For the exponential decay encoder, an important parameter is α, and data normalization is also an important factor.So more experiments are done to inspect the influence of these two factors.We search best α in a set of {1/3, 1/2, 2/3, 1} on both original input and Z-score normalized input when    that the model works relatively well in the middle part of the curves.Therefore, when selecting α we recommend selecting an α in [1/2, 2/3] or performing a grid search on (0, 1).

Conclusions and future work
In this paper, we propose MIED which contains two modules to enhance the node embeddings of heterogeneous graph.These two modules are: (1) using metapath-based GCN in the feature alignment to include graph information; (2) proposing EDE to distinguish the importance of different nodes when aggregating features on the metapath.Our comparative experiments demonstrate the effectiveness of our methods.
We also analyze the computational complexity of two proposed methods.They increase in computational complexity, but are controllable compared to their effects.Meta-path based methods have some scalability challenges when applied to large-scale graphs.But this can be effectively controlled by controlling the number and length of metapaths.After all, the value of information along very long metapaths is rapidly diminishing.
As heterogeneous graphs play an important role in fields such as multi-mode heterogeneous graph recommendation, we will apply our methods to graphs in these fields in the future, with the aim of further improving the performance of these models.MIED : An Improved Graph Neural Network for Node Embedding in Heterogeneous Graphs

Figure 1 .
Figure 1.Framework of MIED (GCN feature transformations for yellow and blue nodes are executed but not illustrated for clarity).

Figure 2 .
Figure 2. Schemas of the two heterogeneous graph datasets used in experiment.

Figure 3 .
Figure 3. NMI and mean Macro-F1 of IMDb and DBLP with different settings.

EAI
Endorsed Transactions on Scalable Information Systems 2023 | Volume 10 | Issue 6

Table 1 .
Symbols and Descriptions.
t Aggregated information representation of node v for node type t Ã Adjusted adjacency matrix D Diagonal matrix corresponding to Ã H (l+1) Algorithm 1 : MIED forward propagation.Output: The node embeddings {h v , ∀v ∈ V } 1: for node type t ∈ T do Input: The heterogeneous graph G = (V , E) , node types T = {t 1 , t 2 , ..., t |T | }, metapaths P t = {P 1 , P 2 , ..., P |P t | } for node type t , node features {X t , ∀t ∈ T }, the number of attention heads K, the number of layers L, the number of GCN layers I, the exponential decay weight α 4:
MIED : An Improved Graph Neural Network for Node Embedding in Heterogeneous Graphs EAI Endorsed Transactions on Scalable Information Systems 2023 | Volume 10 | Issue 6

Table 3 .
Experimental results (%) on the datasets for the node clustering task (n2vec is short for node2vec and m2vec is short for metapath2vec in the table).

Table 4 .
Experimental results (%) on the datasets for the node classification task (n2vec is short for node2vec and m2vec is short for metapath2vec in the table).

Table 5 .
The effect of the two modules on the model.