Network Analysis and Manipulation using R

kassambara | 28/11/2017 | 30849 | Post a comment | Social Network Analysis

This chapter describes how to manipulate and analyze a network graph in R using the tidygraph package.

The tidygraph package provides a tidy framework to easily manipulate different types of relational data, including: graph, network and trees.

In the tidygraph framework, network data are considered as two tidy data tables, one describing the node data and the other is for edge data. The package provides a simple solution to switch between the two tables and provides dplyr verbs for manipulating them.

You will learn methods for detecting important or central entities in a network graph. We’ll also introduce how to detect community (or cluster) in a network.

Contents:

Load required packages
Create network objects
- Use tbl_graph
- Use as_tbl_graph R function
Print out a network object
Network graph manipulation
Network analysis
- Centrality
- Clustering
Read more

The Book:

Network Analysis and Visualization in R: Quick Start Guide

Load required packages

tidyverse for general data manipulation and visualization.
tidygraph for manipulating and analyzing network graphs.
ggraph for visualizing network objects created using the tidygraph package.

library(tidyverse)
library(tidygraph)
library(ggraph)

Create network objects

Key R functions:

tbl_graph(). Creates a network object from nodes and edges data
as_tbl_graph(). Converts network data and objects to a tbl_graph network.

Demo data set: phone.call2 data [in the navdata R package], which is a list containing the nodes and the edges list prepared in the chapter @ref(network-visualization-essentials).

Use tbl_graph

Create a tbl_graph network object using the phone call data:

library("navdata")
data("phone.call2")
phone.net <- tbl_graph(
  nodes = phone.call2$nodes, 
  edges = phone.call2$edges,
  directed = TRUE
  )

Visualize:

ggraph(phone.net, layout = "graphopt") + 
  geom_edge_link(width = 1, colour = "lightgray") +
  geom_node_point(size = 4, colour = "#00AFBB") +
  geom_node_text(aes(label = label), repel = TRUE)+
  theme_graph()

Use as_tbl_graph R function

One can also use the as_tbl_graph() function to converts the following data structure and network objects:

data.frame, list and matrix data [R base]
igraph network objects [igraph package]
network network objects [network pakage]
dendrogram and hclust [stats package]
Node [data.tree package]
phylo and evonet [ape package]
graphNEL, graphAM, graphBAM [graph package (in Bioconductor)]

In the following example, we’ll create a correlation matrix network graph. The mtcars data set will be used.

Compute the correlation matrix between cars using the corrr package:

1. Use the mtcars data set
1. Compute the correlation matrix: correlate()
1. Convert the upper triangle to NA: shave()
1. Stretch the correlation data frame into long format
1. Keep only high correlation

library(corrr)
res.cor <- mtcars [, c(1, 3:6)] %>%  # (1)
  t() %>% correlate() %>%            # (2)
  shave(upper = TRUE) %>%            # (3)
  stretch(na.rm = TRUE) %>%          # (4)
  filter(r >= 0.998)                 # (5)
res.cor

## # A tibble: 59 x 3
##               x             y     r
##                     
## 1     Mazda RX4 Mazda RX4 Wag 1.000
## 2     Mazda RX4      Merc 230 1.000
## 3     Mazda RX4      Merc 280 0.999
## 4     Mazda RX4     Merc 280C 0.999
## 5     Mazda RX4    Merc 450SL 0.998
## 6 Mazda RX4 Wag      Merc 230 1.000
## # ... with 53 more rows

Create the correlation network graph:

set.seed(1)
cor.graph <- as_tbl_graph(res.cor, directed = FALSE)
ggraph(cor.graph) + 
  geom_edge_link() + 
  geom_node_point() +
  geom_node_text(
    aes(label = name), size = 3, repel = TRUE
    ) +
  theme_graph()

Print out a network object

cor.graph

## # A tbl_graph: 24 nodes and 59 edges
## #
## # An undirected simple graph with 3 components
## #
## # Node Data: 24 x 1 (active)
##                name
##               
## 1         Mazda RX4
## 2     Mazda RX4 Wag
## 3        Datsun 710
## 4    Hornet 4 Drive
## 5 Hornet Sportabout
## 6           Valiant
## # ... with 18 more rows
## #
## # Edge Data: 59 x 3
##    from    to     r
##     
## 1     1     2 1.000
## 2     1    20 1.000
## 3     1     8 0.999
## # ... with 56 more rows

The output shows:

a tbl_graph object with 24 nodes and 59 edges. Nodes are the car names and the edges are the correlation links.
the first six rows of “Node Data”" and the first three of “Edge Data”.
that the Node Data is active.

The notion of an active tibble within a tbl_graph object makes it possible to manipulate the data in one tibble at a time. The nodes tibble is activated by default, but you can change which tibble is active with the activate() function.

If you want to rearrange the rows in the edges tibble to list those with the highest “r” first, you could use activate() and then arrange(). For example, type the following R code:

cor.graph %>% 
  activate(edges) %>% 
  arrange(desc(r))

Note that, to extract the current active data as a tibble, you can use the function as_tibble(cor.graph).

Network graph manipulation

With the tidygraph package, you can easily manipulate the nodes and the edges data in the network graph object using dplyr verbs. For example, you can add new columns or rename columns in the nodes/edges data.

You can also filter and arrange the data. Note that, applying filter()/slice() on node data will remove the edges terminating at the removed nodes.

In this section we’ll manipulate the correlation network graph.

Modify the nodes data:

1. Group the cars by the “cyl” variable (number of cylinders) in the original mtcars data set. We’ll color the cars by groups.
1. Join the group info to the nodes data
1. Rename the column “name”, in the nodes data, to “label”

You can use the dplyr verbs as follow:

# Car groups info
cars.group <- data_frame(
  name = rownames(mtcars),
  cyl = as.factor(mtcars$cyl)
)
# Modify the nodes data
cor.graph <- cor.graph %>%
  activate(nodes) %>%
  left_join(cars.group, by = "name") %>%
  rename(label = name)

Modify the edge data. Rename the column “r” to “weight”.

cor.graph <- cor.graph %>%
  activate(edges) %>%
  rename(weight = r)

Display the final modified graphs object:

cor.graph

## # A tbl_graph: 24 nodes and 59 edges
## #
## # An undirected simple graph with 3 components
## #
## # Edge Data: 59 x 3 (active)
##    from    to weight
##      
## 1     1     2  1.000
## 2     1    20  1.000
## 3     1     8  0.999
## 4     1     9  0.999
## 5     1    11  0.998
## 6     2    20  1.000
## # ... with 53 more rows
## #
## # Node Data: 24 x 2
##           label    cyl
##            
## 1     Mazda RX4      6
## 2 Mazda RX4 Wag      6
## 3    Datsun 710      4
## # ... with 21 more rows

Visualize the correlation network.

Change the edges width according to the variable weight
Scale the edges width by setting the minimum width to 0.2 and the maximum to 1.
Change the color of cars (nodes) according to the grouping variable cyl.

set.seed(1)
ggraph(cor.graph) + 
  geom_edge_link(aes(width = weight), alpha = 0.2) + 
  scale_edge_width(range = c(0.2, 1)) +
  geom_node_point(aes(color = cyl), size = 2) +
  geom_node_text(aes(label = label), size = 3, repel = TRUE) +
  theme_graph()

Network analysis

In this sections, we described methods for detecting important or central entities in a network graph. We’ll also introduce how to detect community (or cluster) in a network.

Centrality

Centrality is an important concept when analyzing network graph. The centrality of a node / edge measures how central (or important) is a node or edge in the network.

We consider an entity important, if he has connections to many other entities. Centrality describes the number of edges that are connected to nodes.

There many types of scores that determine centrality. One of the famous ones is the pagerank algorithm that was powering Google Search in the beginning.

Examples of common approaches of measuring centrality include:

betweenness centrality. The betweenness centrality for each nodes is the number of the shortest paths that pass through the nodes.
closeness centrality. Closeness centrality measures how many steps is required to access every other nodes from a given nodes. It describes the distance of a node to all other nodes. The more central a node is, the closer it is to all other nodes.
eigenvector centrality. A node is important if it is linked to by other important nodes. The centrality of each node is proportional to the sum of the centralities of those nodes to which it is connected. In general, nodes with high eigenvector centralities are those which are linked to many other nodes which are, in turn, connected to many others (and so on).
Hub and authority centarlities are generalization of eigenvector centrality. A high hub node points to many good authorities and a high authority node receives from many good hubs.

The tidygraph package contains more than 10 centrality measures, prefixed with the term centrality_. These measures include:

centrality_authority()
centrality_betweenness()
centrality_closeness()
centrality_hub()
centrality_pagerank()
centrality_eigen()
centrality_edge_betweenness()

All of these centrality functions returns a numeric vector matching the nodes (or edges in the case of `centrality_edge_betweenness()).

In the following examples, we’ll use the phone call network graph. We’ll change the color and the size of nodes according to their values of centrality.

set.seed(123)
phone.net %>%
  activate(nodes) %>%
  mutate(centrality = centrality_authority()) %>% 
  ggraph(layout = "graphopt") + 
  geom_edge_link(width = 1, colour = "lightgray") +
  geom_node_point(aes(size = centrality, colour = centrality)) +
  geom_node_text(aes(label = label), repel = TRUE)+
  scale_color_gradient(low = "yellow", high = "red")+
  theme_graph()

For a given problem at hand, you can test the different centrality score to decide which centrality measure makes most sense for your specific question.

Clustering

Clustering is a common operation in network analysis and it consists of grouping nodes based on the graph topology.

It’s sometimes referred to as community detection based on its commonality in social network analysis.

Many clustering algorithms from are available in the tidygraph package and prefixed with the term group_. These include:

Infomap community finding. It groups nodes by minimizing the expected description length of a random walker trajectory. R function: group_infomap()
Community structure detection based on edge betweenness. It groups densely connected nodes. R function: group_edge_betweenness().

In the following example, we’ll use the correlation network graphs to detect clusters or communities:

set.seed(123)
cor.graph %>%
  activate(nodes) %>%
   mutate(community = as.factor(group_infomap())) %>% 
  ggraph(layout = "graphopt") + 
  geom_edge_link(width = 1, colour = "lightgray") +
  geom_node_point(aes(colour = community), size = 4) +
  geom_node_text(aes(label = label), repel = TRUE)+
  theme_graph()

Three communities are detected.

Thomas Lin Pedersen. Introducing tidygraph. https://www.data-imaginist.com/2017/introducing-tidygraph/
Shirin Glander. Network analysis of Game of Thrones. https://datascienceplus.com/network-analysis-of-game-of-thrones/

0 Note

Enjoyed this article? Give us 5 stars (just above this text block)! Reader needs to be STHDA member for voting. I’d be very grateful if you’d help it spread by emailing it to a friend, or sharing it on Twitter, Facebook or Linked In.

Show me some love with the like buttons below... Thank you and please don't forget to share and comment below!!

Avez vous aimé cet article? Donnez nous 5 étoiles (juste au dessus de ce block)! Vous devez être membre pour voter. Je vous serais très reconnaissant si vous aidiez à sa diffusion en l'envoyant par courriel à un ami ou en le partageant sur Twitter, Facebook ou Linked In.

Montrez-moi un peu d'amour avec les like ci-dessous ... Merci et n'oubliez pas, s'il vous plaît, de partager et de commenter ci-dessous!

Recommended for You!

Machine Learning Essentials: Practical Guide in R

Practical Guide to Cluster Analysis in R

Practical Guide to Principal Component Methods in R

R Graphics Essentials for Great Data Visualization

Network Analysis and Visualization in R

More books on R and data science

Recommended for you

This section contains best data science and self-development resources to help you on your path.

Coursera - Online Courses and Specialization

Books - Data Science

Our Books

Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)
Practical Guide To Principal Component Methods in R by A. Kassambara (Datanovia)
Machine Learning Essentials: Practical Guide in R by A. Kassambara (Datanovia)
R Graphics Essentials for Great Data Visualization by A. Kassambara (Datanovia)
GGPlot2 Essentials for Great Data Visualization in R by A. Kassambara (Datanovia)
Network Analysis and Visualization in R by A. Kassambara (Datanovia)
Practical Statistics in R for Comparing Groups: Numerical Variables by A. Kassambara (Datanovia)
Inter-Rater Reliability Essentials: Practical Guide in R by A. Kassambara (Datanovia)

Others

R for Data Science: Import, Tidy, Transform, Visualize, and Model Data by Hadley Wickham & Garrett Grolemund
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurelien Géron
Practical Statistics for Data Scientists: 50 Essential Concepts by Peter Bruce & Andrew Bruce
Hands-On Programming with R: Write Your Own Functions And Simulations by Garrett Grolemund & Hadley Wickham
An Introduction to Statistical Learning: with Applications in R by Gareth James et al.
Deep Learning with R by François Chollet & J.J. Allaire
Deep Learning with Python by François Chollet

Comments

You are not authorized to post a comment

Load required packages

Create network objects

Use tbl_graph

Use as_tbl_graph R function

Print out a network object

Network graph manipulation

Network analysis

Centrality

Clustering

Read more