Note | Applied Data Analytics in Python
date
Oct 23, 2019
slug
applied-data-analytics-in-python
status
Published
summary
This is a lecture note of mine.
tags
Academic
Data Analysis
Engineering
Study
Methodology
type
Post
I. Why Study Networks and Basics of NetworkX
Definition
- Networks (of Graph): a representation of connections among a set of
- Nodes: items
- Edges: connections
G = nx.Graph()
- Edge Direction:
- Symmetric relationships
- Asymmetric relationships
G.add_edge('A', 'B')
- Weighted networks: a network where edges are assigned a (typically numerical) weight
G.add_edge('A', 'B', weight = 6)
- Signed networks: a networks where edges are assined positive of negative sign
G.add_edge('A', 'B', sigh = '+')
- Other edge attributes
G.add_edge('A', 'B', relation = 'friend')
- Multigraph: a network where multiple edges can connect the same nodes (parallel edges
nx.MultiGraph
Node and Edge Attributes
- Adding
- Accessing
Bipartite graphs
- Bipartite Graph: a graph whose nodes can be split into two sets L and R and every edge connects an node in L with a node in R
- Check if bipartite graph:
bipartite.is_bipartite(B)
- Check if bipartite nodes:
- Getting each set of a bipartite graph:
bipartite.sets(B)
- L-bipartite graph projection: Network of nodes in group L, where a pair of nodes is connected if they have a common neighbor in R in the bipartite graph.
- L-bipartite weighted graph projection: An L-bipartite graph projection with Weights on the edges that are proportional to the number of common neighbors between the nodes
bipartite.weighted_projected_graph(B, X)
II. Network Connectivity
Clustering coefficient
- Triadic closure: The tendency for people who share connections in a social network to become connected
- Clustering coefficient: measures the degree to which nodes in a network tend to "cluster" or form triangles
- Local clustering coefficient of a node:
Fraction of pairs of the nodes friends that are friends with each other
nx.clustering(G, F)
- Global clustering coefficient
- Average:
nx.average_clustering(G)
- Transitivity: Ratio of number of triangles and number of open triads in a network:
nx.transitivity(G)
Distance measures
- Distance:
- Paths: a sequence of nodes connected by an edge
- Path length: number of steps it contains from beginning to end
- Distance between two nodes: the length of the shortest path between them
- Breadth-first search: a systematic and efficient procedure for computing distances from a node to all other nodes in a large network by"discovering nodes in layers
nx.bfs_tree(G, 'A')
- Eccentricity of a node N is the largest distance between N and all other nodes
nx.eccentricity(G)
- Characterizing distances in a network:
- Average distance: between every pair of nodes
- Diameter: maximum distance between any pair of nodes
- Radius of a graph is the minimum eccentricity
nx.radius(G)
- Identifying central and peripheral nodes
- The periphery of a graph is the set of nodes that have eccentricity equal to the diameter.
- The center of a graph is the set of nodes that have eccentricity equal to the radius
Connected components
Undirected graphs
- An undirected graph is connected if, for every pair nodes, there is a path between them
nx.is_connected(G)
- Connected components is a subset of nodes such as:
- Every node in the subset has a path to every other node
- No other node has a path to any node in the subset
Directed graphs
- Connectivity in Directed Graphs
- A directed graph is strongly connected if for every pair nodes u and v, there is a directed path from u to v and a directed path from y to u
nx.is_strongly_connected(G)
- A directed graph is weakly connected if replacing all directed edges with undirected edges produces a connected undirected graph
nx.is_weakly_connecte(G)
- Strongly connected component is a subset of nodes such as:
- Every node in the subset has a directed path to every other node
- No other node has a directed path to and from every node in the subset
Network robustness
- Network robustness: the ability of a network to maintain its general structural properties when it faces failures or attacks
- Type of attacks: removal of nodes or edges
- Structural properties: connectivity
- Node connectivity: Minimum number of nodes needed to disconnect a graph or pair of nodes
nx.node_connectivity(G)
/nx.minimum_node_cut(G)
- Edge connectivity: Minimum number of edges needed to disconnect a graph or pair of nodes
nx.edge_connectivity(G)
/nx.minimum_edge_cut(G)
- Robust networks have large minimum node and edge cuts
III. Influence Measures and Network Centralization
Degree and Closeness Centrality
- Network Centrality: measures identify the most important nodes in a network
- Measures
- Degree centrality
- Assumption: important nodes have many connections
- Number of neighbors
- Undirected: degree
nx.degree_centrality(G)
- Directed: in/out-degree
-
nx.in_degree_centrality(G)
-
nx.out_degree_centrality(G)
- Closeness centrality
- Assumption: important nodes are close to other nodes
- Disconnected Nodes: Normalized Closeness Centrality
- Assumption: important nodes connect other nodes
- When computing betweenness centrality, we only consider nodes such that there is at least one path between them
- Normalized Betweenness Centrality: betwenness centrality values will be larger in graphs with many nodes. To control for this, we divide centrality values by the number of pairs of nodes in the graph (excluding v)
nx.betweeness_centrality(G, normalized=Treu,endpoints=False)
- undirected:
- directed:
- Approximation: computationally:
betweenness_centrality(G, normalized=True, endpoints=False,k=10)
- Subsets:
betweenness_centrality_subset(G, normalized=True)
- Edge:
nx.edge_betweenness_centrality(G, normalized=True)
- Load centrality
- Page Rank
- Katz centrality
- Percolation centrality
nx.closeness_centrality(G)
nx.closeness_centrality(G, normalized=True)
Betweenness centrality
Basic PageRank
- Start with
- Perform the Basic PageRank Update Rule:
- Each node gives an equal share of its current PageRank to all the nodes it links to
- The new PangeRank of each node is the sum of all the PageRank it received from other nods
- For most networks, convergence
Scaled PageRank
- Problem of Basic PageRank: "stuck" ←The PageRank of a node at step k is the probability that a random walker lands on the node after taking k steps
- Solution: damping parameter
nx.pagerank(G, alpha=0.8)
- with probability , random walk
- with probability , randomly choose a node to walk
Hubs and Authorities 暂不要
- root
- base
- HITS Algorithm
- authoreity update rule
- hub update rule
Centrality Examples
IV. Network Evolution
Preferential Attachment Model
Degree Distribution
- The degree of a node in an undirected graph is the number of neighbors it has
- The degree distribution of a graph is the probability distribution of the degrees over the entire network
In-degree Distribution
- The in-degree of a node in a directed graph is the number of in-links it has
- The in-degree distribution, Pin (k), of this network has the following values:
Degree distribution in Real Networks
Power law: Networks with power law distribution have many nodes with small degree and a few nodes with very large degree
Preferential Attachment Model
Small World Networks
Milgram Small World Experiment
Small World of Instant Message
Small World of Facebook
Clustering Coefficient
- Facebook2011: High Average CC (decreases with degree)
- Microsoft IM: Average CC of 0.13
- IMDB actor network: Average CC 0.78
Path Length and Clustering
Social networks tento have high CC and small average path length
Small World Model
Summary
- Real social networks appear to have small shortest paths between nodes and high clustering coefficient
- The preferential attachment model produces networks with small shortest paths but very small clustering coefficient
- The small world model starts with a ring lattice with nodes connected to k nearest neighbors (high local clustering), and it rewires edges with probability p
- For small values of p, small world networks have small average shortest path and high clustering coefficient, matching what we observe in real networks
Link Prediction
Basic
- Measure I: Common Neighbors
- Measure II: Jaccard coeffient
- Measure III: Resource Allocation
- Measure IV: Adamic-Adar Index
- Measure V: Pre. Attachment
Community Information Based
- Measure VI: Community Common Neighbors
- Community Structure:
- Assume the nodes in this network belong to different communities(sets of nodes)
- Pairs of nodes who belong to the same community and have many common neighbors in their community are likely to form an edge
- Measure VII: Community Resource Allocation