Articles - Social Network Analysis

Network Analysis and Manipulation using R

  |   31123  |  Post a comment  |  Social Network Analysis

This chapter describes how to manipulate and analyze a network graph in R using the tidygraph package.

The tidygraph package provides a tidy framework to easily manipulate different types of relational data, including: graph, network and trees.

In the tidygraph framework, network data are considered as two tidy data tables, one describing the node data and the other is for edge data. The package provides a simple solution to switch between the two tables and provides dplyr verbs for manipulating them.

You will learn methods for detecting important or central entities in a network graph. We’ll also introduce how to detect community (or cluster) in a network.

Contents:


Load required packages

  • tidyverse for general data manipulation and visualization.
  • tidygraph for manipulating and analyzing network graphs.
  • ggraph for visualizing network objects created using the tidygraph package.
library(tidyverse)
library(tidygraph)
library(ggraph)

Create network objects

Key R functions:

  • tbl_graph(). Creates a network object from nodes and edges data
  • as_tbl_graph(). Converts network data and objects to a tbl_graph network.

Demo data set: phone.call2 data [in the navdata R package], which is a list containing the nodes and the edges list prepared in the chapter @ref(network-visualization-essentials).

Use tbl_graph

  • Create a tbl_graph network object using the phone call data:
library("navdata")
data("phone.call2")
phone.net <- tbl_graph(
  nodes = phone.call2$nodes, 
  edges = phone.call2$edges,
  directed = TRUE
  )
  • Visualize:
ggraph(phone.net, layout = "graphopt") + 
  geom_edge_link(width = 1, colour = "lightgray") +
  geom_node_point(size = 4, colour = "#00AFBB") +
  geom_node_text(aes(label = label), repel = TRUE)+
  theme_graph()

Use as_tbl_graph R function

One can also use the as_tbl_graph() function to converts the following data structure and network objects:

  • data.frame, list and matrix data [R base]
  • igraph network objects [igraph package]
  • network network objects [network pakage]
  • dendrogram and hclust [stats package]
  • Node [data.tree package]
  • phylo and evonet [ape package]
  • graphNEL, graphAM, graphBAM [graph package (in Bioconductor)]

In the following example, we’ll create a correlation matrix network graph. The mtcars data set will be used.

Compute the correlation matrix between cars using the corrr package:

    1. Use the mtcars data set
    1. Compute the correlation matrix: correlate()
    1. Convert the upper triangle to NA: shave()
    1. Stretch the correlation data frame into long format
    1. Keep only high correlation
library(corrr)
res.cor <- mtcars [, c(1, 3:6)] %>%  # (1)
  t() %>% correlate() %>%            # (2)
  shave(upper = TRUE) %>%            # (3)
  stretch(na.rm = TRUE) %>%          # (4)
  filter(r >= 0.998)                 # (5)
res.cor
## # A tibble: 59 x 3
##               x             y     r
##                     
## 1     Mazda RX4 Mazda RX4 Wag 1.000
## 2     Mazda RX4      Merc 230 1.000
## 3     Mazda RX4      Merc 280 0.999
## 4     Mazda RX4     Merc 280C 0.999
## 5     Mazda RX4    Merc 450SL 0.998
## 6 Mazda RX4 Wag      Merc 230 1.000
## # ... with 53 more rows

Create the correlation network graph:

set.seed(1)
cor.graph <- as_tbl_graph(res.cor, directed = FALSE)
ggraph(cor.graph) + 
  geom_edge_link() + 
  geom_node_point() +
  geom_node_text(
    aes(label = name), size = 3, repel = TRUE
    ) +
  theme_graph()

Network graph manipulation

With the tidygraph package, you can easily manipulate the nodes and the edges data in the network graph object using dplyr verbs. For example, you can add new columns or rename columns in the nodes/edges data.

You can also filter and arrange the data. Note that, applying filter()/slice() on node data will remove the edges terminating at the removed nodes.

In this section we’ll manipulate the correlation network graph.

  1. Modify the nodes data:
    1. Group the cars by the “cyl” variable (number of cylinders) in the original mtcars data set. We’ll color the cars by groups.
    1. Join the group info to the nodes data
    1. Rename the column “name”, in the nodes data, to “label”

You can use the dplyr verbs as follow:

# Car groups info
cars.group <- data_frame(
  name = rownames(mtcars),
  cyl = as.factor(mtcars$cyl)
)
# Modify the nodes data
cor.graph <- cor.graph %>%
  activate(nodes) %>%
  left_join(cars.group, by = "name") %>%
  rename(label = name)
  1. Modify the edge data. Rename the column “r” to “weight”.
cor.graph <- cor.graph %>%
  activate(edges) %>%
  rename(weight = r)
  1. Display the final modified graphs object:
cor.graph
## # A tbl_graph: 24 nodes and 59 edges
## #
## # An undirected simple graph with 3 components
## #
## # Edge Data: 59 x 3 (active)
##    from    to weight
##      
## 1     1     2  1.000
## 2     1    20  1.000
## 3     1     8  0.999
## 4     1     9  0.999
## 5     1    11  0.998
## 6     2    20  1.000
## # ... with 53 more rows
## #
## # Node Data: 24 x 2
##           label    cyl
##            
## 1     Mazda RX4      6
## 2 Mazda RX4 Wag      6
## 3    Datsun 710      4
## # ... with 21 more rows
  1. Visualize the correlation network.
  • Change the edges width according to the variable weight
  • Scale the edges width by setting the minimum width to 0.2 and the maximum to 1.
  • Change the color of cars (nodes) according to the grouping variable cyl.
set.seed(1)
ggraph(cor.graph) + 
  geom_edge_link(aes(width = weight), alpha = 0.2) + 
  scale_edge_width(range = c(0.2, 1)) +
  geom_node_point(aes(color = cyl), size = 2) +
  geom_node_text(aes(label = label), size = 3, repel = TRUE) +
  theme_graph()

Network analysis

In this sections, we described methods for detecting important or central entities in a network graph. We’ll also introduce how to detect community (or cluster) in a network.

Centrality

Centrality is an important concept when analyzing network graph. The centrality of a node / edge measures how central (or important) is a node or edge in the network.

We consider an entity important, if he has connections to many other entities. Centrality describes the number of edges that are connected to nodes.

There many types of scores that determine centrality. One of the famous ones is the pagerank algorithm that was powering Google Search in the beginning.

Examples of common approaches of measuring centrality include:

  • betweenness centrality. The betweenness centrality for each nodes is the number of the shortest paths that pass through the nodes.

  • closeness centrality. Closeness centrality measures how many steps is required to access every other nodes from a given nodes. It describes the distance of a node to all other nodes. The more central a node is, the closer it is to all other nodes.

  • eigenvector centrality. A node is important if it is linked to by other important nodes. The centrality of each node is proportional to the sum of the centralities of those nodes to which it is connected. In general, nodes with high eigenvector centralities are those which are linked to many other nodes which are, in turn, connected to many others (and so on).

  • Hub and authority centarlities are generalization of eigenvector centrality. A high hub node points to many good authorities and a high authority node receives from many good hubs.

The tidygraph package contains more than 10 centrality measures, prefixed with the term centrality_. These measures include:

centrality_authority()
centrality_betweenness()
centrality_closeness()
centrality_hub()
centrality_pagerank()
centrality_eigen()
centrality_edge_betweenness()

All of these centrality functions returns a numeric vector matching the nodes (or edges in the case of `centrality_edge_betweenness()).

In the following examples, we’ll use the phone call network graph. We’ll change the color and the size of nodes according to their values of centrality.

set.seed(123)
phone.net %>%
  activate(nodes) %>%
  mutate(centrality = centrality_authority()) %>% 
  ggraph(layout = "graphopt") + 
  geom_edge_link(width = 1, colour = "lightgray") +
  geom_node_point(aes(size = centrality, colour = centrality)) +
  geom_node_text(aes(label = label), repel = TRUE)+
  scale_color_gradient(low = "yellow", high = "red")+
  theme_graph()

For a given problem at hand, you can test the different centrality score to decide which centrality measure makes most sense for your specific question.

Clustering

Clustering is a common operation in network analysis and it consists of grouping nodes based on the graph topology.

It’s sometimes referred to as community detection based on its commonality in social network analysis.

Many clustering algorithms from are available in the tidygraph package and prefixed with the term group_. These include:

  • Infomap community finding. It groups nodes by minimizing the expected description length of a random walker trajectory. R function: group_infomap()
  • Community structure detection based on edge betweenness. It groups densely connected nodes. R function: group_edge_betweenness().

In the following example, we’ll use the correlation network graphs to detect clusters or communities:

set.seed(123)
cor.graph %>%
  activate(nodes) %>%
   mutate(community = as.factor(group_infomap())) %>% 
  ggraph(layout = "graphopt") + 
  geom_edge_link(width = 1, colour = "lightgray") +
  geom_node_point(aes(colour = community), size = 4) +
  geom_node_text(aes(label = label), repel = TRUE)+
  theme_graph()

Three communities are detected.

Read more