Note  Applied Data Analytics in Python
date
Oct 23, 2019
slug
applieddataanalyticsinpython
status
Published
summary
This is a lecture note of mine.
tags
Academic
Data Analysis
Engineering
Study
Methodology
type
Post
I. Why Study Networks and Basics of NetworkX
Definition
 Networks (of Graph): a representation of connections among a set of
 Nodes: items
 Edges: connections
G = nx.Graph()
 Edge Direction:
 Symmetric relationships
 Asymmetric relationships
G.add_edge('A', 'B')
 Weighted networks: a network where edges are assigned a (typically numerical) weight
G.add_edge('A', 'B', weight = 6)
 Signed networks: a networks where edges are assined positive of negative sign
G.add_edge('A', 'B', sigh = '+')
 Other edge attributes
G.add_edge('A', 'B', relation = 'friend')
 Multigraph: a network where multiple edges can connect the same nodes (parallel edges
nx.MultiGraph
Node and Edge Attributes
 Adding
 Accessing
Bipartite graphs
 Bipartite Graph: a graph whose nodes can be split into two sets L and R and every edge connects an node in L with a node in R
 Check if bipartite graph:
bipartite.is_bipartite(B)
 Check if bipartite nodes:
 Getting each set of a bipartite graph:
bipartite.sets(B)
 Lbipartite graph projection: Network of nodes in group L, where a pair of nodes is connected if they have a common neighbor in R in the bipartite graph.
 Lbipartite weighted graph projection: An Lbipartite graph projection with Weights on the edges that are proportional to the number of common neighbors between the nodes
bipartite.weighted_projected_graph(B, X)
II. Network Connectivity
Clustering coefficient
 Triadic closure: The tendency for people who share connections in a social network to become connected
 Clustering coefficient: measures the degree to which nodes in a network tend to "cluster" or form triangles
 Local clustering coefficient of a node:
Fraction of pairs of the nodes friends that are friends with each other
nx.clustering(G, F)
 Global clustering coefficient
 Average:
nx.average_clustering(G)
 Transitivity: Ratio of number of triangles and number of open triads in a network:
nx.transitivity(G)
Distance measures
 Distance:
 Paths: a sequence of nodes connected by an edge
 Path length: number of steps it contains from beginning to end
 Distance between two nodes: the length of the shortest path between them
 Breadthfirst search: a systematic and efficient procedure for computing distances from a node to all other nodes in a large network by"discovering nodes in layers
nx.bfs_tree(G, 'A')
 Eccentricity of a node N is the largest distance between N and all other nodes
nx.eccentricity(G)
 Characterizing distances in a network:
 Average distance: between every pair of nodes
 Diameter: maximum distance between any pair of nodes
 Radius of a graph is the minimum eccentricity
nx.radius(G)
 Identifying central and peripheral nodes
 The periphery of a graph is the set of nodes that have eccentricity equal to the diameter.
 The center of a graph is the set of nodes that have eccentricity equal to the radius
Connected components
Undirected graphs
 An undirected graph is connected if, for every pair nodes, there is a path between them
nx.is_connected(G)
 Connected components is a subset of nodes such as:
 Every node in the subset has a path to every other node
 No other node has a path to any node in the subset
Directed graphs
 Connectivity in Directed Graphs
 A directed graph is strongly connected if for every pair nodes u and v, there is a directed path from u to v and a directed path from y to u
nx.is_strongly_connected(G)
 A directed graph is weakly connected if replacing all directed edges with undirected edges produces a connected undirected graph
nx.is_weakly_connecte(G)
 Strongly connected component is a subset of nodes such as:
 Every node in the subset has a directed path to every other node
 No other node has a directed path to and from every node in the subset
Network robustness
 Network robustness: the ability of a network to maintain its general structural properties when it faces failures or attacks
 Type of attacks: removal of nodes or edges
 Structural properties: connectivity
 Node connectivity: Minimum number of nodes needed to disconnect a graph or pair of nodes
nx.node_connectivity(G)
/nx.minimum_node_cut(G)
 Edge connectivity: Minimum number of edges needed to disconnect a graph or pair of nodes
nx.edge_connectivity(G)
/nx.minimum_edge_cut(G)
 Robust networks have large minimum node and edge cuts
III. Influence Measures and Network Centralization
Degree and Closeness Centrality
 Network Centrality: measures identify the most important nodes in a network
 Measures
 Degree centrality
 Assumption: important nodes have many connections
 Number of neighbors
 Undirected: degree
nx.degree_centrality(G)
 Directed: in/outdegree

nx.in_degree_centrality(G)

nx.out_degree_centrality(G)
 Closeness centrality
 Assumption: important nodes are close to other nodes
 Disconnected Nodes: Normalized Closeness Centrality
 Assumption: important nodes connect other nodes
 When computing betweenness centrality, we only consider nodes such that there is at least one path between them
 Normalized Betweenness Centrality: betwenness centrality values will be larger in graphs with many nodes. To control for this, we divide centrality values by the number of pairs of nodes in the graph (excluding v)
nx.betweeness_centrality(G, normalized=Treu,endpoints=False)
 undirected:
 directed:
 Approximation: computationally:
betweenness_centrality(G, normalized=True, endpoints=False,k=10)
 Subsets:
betweenness_centrality_subset(G, normalized=True)
 Edge:
nx.edge_betweenness_centrality(G, normalized=True)
 Load centrality
 Page Rank
 Katz centrality
 Percolation centrality
nx.closeness_centrality(G)
nx.closeness_centrality(G, normalized=True)
Betweenness centrality
Basic PageRank
 Start with
 Perform the Basic PageRank Update Rule:
 Each node gives an equal share of its current PageRank to all the nodes it links to
 The new PangeRank of each node is the sum of all the PageRank it received from other nods
 For most networks, convergence
Scaled PageRank
 Problem of Basic PageRank: "stuck" ←The PageRank of a node at step k is the probability that a random walker lands on the node after taking k steps
 Solution: damping parameter
nx.pagerank(G, alpha=0.8)
 with probability , random walk
 with probability , randomly choose a node to walk
Hubs and Authorities 暂不要
 root
 base
 HITS Algorithm
 authoreity update rule
 hub update rule
Centrality Examples
IV. Network Evolution
Preferential Attachment Model
Degree Distribution
 The degree of a node in an undirected graph is the number of neighbors it has
 The degree distribution of a graph is the probability distribution of the degrees over the entire network
Indegree Distribution
 The indegree of a node in a directed graph is the number of inlinks it has
 The indegree distribution, Pin (k), of this network has the following values:
Degree distribution in Real Networks
Power law: Networks with power law distribution have many nodes with small degree and a few nodes with very large degree
Preferential Attachment Model
Small World Networks
Milgram Small World Experiment
Small World of Instant Message
Small World of Facebook
Clustering Coefficient
 Facebook2011: High Average CC (decreases with degree)
 Microsoft IM: Average CC of 0.13
 IMDB actor network: Average CC 0.78
Path Length and Clustering
Social networks tento have high CC and small average path length
Small World Model
Summary
 Real social networks appear to have small shortest paths between nodes and high clustering coefficient
 The preferential attachment model produces networks with small shortest paths but very small clustering coefficient
 The small world model starts with a ring lattice with nodes connected to k nearest neighbors (high local clustering), and it rewires edges with probability p
 For small values of p, small world networks have small average shortest path and high clustering coefficient, matching what we observe in real networks
Link Prediction
Basic
 Measure I: Common Neighbors
 Measure II: Jaccard coeffient
 Measure III: Resource Allocation
 Measure IV: AdamicAdar Index
 Measure V: Pre. Attachment
Community Information Based
 Measure VI: Community Common Neighbors
 Community Structure:
 Assume the nodes in this network belong to different communities(sets of nodes)
 Pairs of nodes who belong to the same community and have many common neighbors in their community are likely to form an edge
 Measure VII: Community Resource Allocation