Note | Applied Data Analytics in Python

date
Oct 23, 2019
slug
applied-data-analytics-in-python
status
Published
summary
This is a lecture note of mine.
tags
Academic
Data Analysis
Engineering
Study
Methodology
type
Post
Contents

I. Why Study Networks and Basics of NetworkX

Definition

  1. Networks (of Graph): a representation of connections among a set of
      • Nodes: items
      • Edges: connections
      • G = nx.Graph()
  1. Edge Direction:
      • Symmetric relationships
      • Asymmetric relationships
      • G.add_edge('A', 'B')
  1. Weighted networks: a network where edges are assigned a (typically numerical) weight
      • G.add_edge('A', 'B', weight = 6)
  1. Signed networks: a networks where edges are assined positive of negative sign
      • G.add_edge('A', 'B', sigh = '+')
  1. Other edge attributes
      • G.add_edge('A', 'B', relation = 'friend')
  1. Multigraph: a network where multiple edges can connect the same nodes (parallel edges nx.MultiGraph

    Node and Edge Attributes

    1. Adding
      1. Accessing

        Bipartite graphs

        1. Bipartite Graph: a graph whose nodes can be split into two sets L and R and every edge connects an node in L with a node in R
          1. Check if bipartite graph: bipartite.is_bipartite(B)
          1. Check if bipartite nodes:
            1. Getting each set of a bipartite graph: bipartite.sets(B)
            1. L-bipartite graph projection: Network of nodes in group L, where a pair of nodes is connected if they have a common neighbor in R in the bipartite graph.
              1. L-bipartite weighted graph projection: An L-bipartite graph projection with Weights on the edges that are proportional to the number of common neighbors between the nodes bipartite.weighted_projected_graph(B, X)

              II. Network Connectivity

              Clustering coefficient

              1. Triadic closure: The tendency for people who share connections in a social network to become connected
              1. Clustering coefficient: measures the degree to which nodes in a network tend to "cluster" or form triangles
              1. Local clustering coefficient of a node: Fraction of pairs of the nodes friends that are friends with each other nx.clustering(G, F)
              1. Global clustering coefficient
                  • Average: nx.average_clustering(G)
                  • Transitivity: Ratio of number of triangles and number of open triads in a network: nx.transitivity(G)
                    • notion image

              Distance measures

              1. Distance:
              1. Paths: a sequence of nodes connected by an edge
              1. Path length: number of steps it contains from beginning to end
              1. Distance between two nodes: the length of the shortest path between them
                1. Breadth-first search: a systematic and efficient procedure for computing distances from a node to all other nodes in a large network by"discovering nodes in layers nx.bfs_tree(G, 'A')
                  1. notion image
                1. Eccentricity of a node N is the largest distance between N and all other nodes nx.eccentricity(G)
                1. Characterizing distances in a network:
                    • Average distance: between every pair of nodes
                    • Diameter: maximum distance between any pair of nodes
                    • Radius of a graph is the minimum eccentricity nx.radius(G)
                1. Identifying central and peripheral nodes
                    • The periphery of a graph is the set of nodes that have eccentricity equal to the diameter.
                    • The center of a graph is the set of nodes that have eccentricity equal to the radius

                Connected components

                Undirected graphs

                1. An undirected graph is connected if, for every pair nodes, there is a path between them nx.is_connected(G)
                1. Connected components is a subset of nodes such as:
                    • Every node in the subset has a path to every other node
                    • No other node has a path to any node in the subset

                Directed graphs

                1. Connectivity in Directed Graphs
                    • A directed graph is strongly connected if for every pair nodes u and v, there is a directed path from u to v and a directed path from y to u nx.is_strongly_connected(G)
                    • A directed graph is weakly connected if replacing all directed edges with undirected edges produces a connected undirected graph nx.is_weakly_connecte(G)
                1. Strongly connected component is a subset of nodes such as:
                    • Every node in the subset has a directed path to every other node
                    • No other node has a directed path to and from every node in the subset

                Network robustness

                1. Network robustness: the ability of a network to maintain its general structural properties when it faces failures or attacks
                    • Type of attacks: removal of nodes or edges
                    • Structural properties: connectivity
                      • Node connectivity: Minimum number of nodes needed to disconnect a graph or pair of nodes nx.node_connectivity(G) / nx.minimum_node_cut(G)
                      • Edge connectivity: Minimum number of edges needed to disconnect a graph or pair of nodes nx.edge_connectivity(G) / nx.minimum_edge_cut(G)
                1. Robust networks have large minimum node and edge cuts

                III. Influence Measures and Network Centralization

                 

                Degree and Closeness Centrality

                1. Network Centrality: measures identify the most important nodes in a network
                1. Measures
                    • Degree centrality
                      • Assumption: important nodes have many connections
                      • Number of neighbors
                        • Undirected: degree nx.degree_centrality(G)
                        • Directed: in/out-degree
                          • nx.in_degree_centrality(G)
                          • nx.out_degree_centrality(G)
                    • Closeness centrality
                      • Assumption: important nodes are close to other nodes
                        • nx.closeness_centrality(G)
                      • Disconnected Nodes: Normalized Closeness Centrality
                        • nx.closeness_centrality(G, normalized=True)

                    Betweenness centrality

                    • Assumption: important nodes connect other nodes
                    • When computing betweenness centrality, we only consider nodes such that there is at least one path between them
                    • Normalized Betweenness Centrality: betwenness centrality values will be larger in graphs with many nodes. To control for this, we divide centrality values by the number of pairs of nodes in the graph (excluding v) nx.betweeness_centrality(G, normalized=Treu,endpoints=False)
                      • undirected:
                      • directed:
                    • Approximation: computationally: betweenness_centrality(G, normalized=True, endpoints=False,k=10)
                    • Subsets: betweenness_centrality_subset(G, normalized=True)
                    • Edge: nx.edge_betweenness_centrality(G, normalized=True)
                    • Load centrality
                    • Page Rank
                    • Katz centrality
                    • Percolation centrality

                Basic PageRank

                1. Start with
                1. Perform the Basic PageRank Update Rule:
                    • Each node gives an equal share of its current PageRank to all the nodes it links to
                    • The new PangeRank of each node is the sum of all the PageRank it received from other nods
                    • For most networks, convergence

                Scaled PageRank

                1. Problem of Basic PageRank: "stuck" ←The PageRank of a node at step k is the probability that a random walker lands on the node after taking k steps
                1. Solution: damping parameter nx.pagerank(G, alpha=0.8)
                    • with probability , random walk
                    • with probability , randomly choose a node to walk

                Hubs and Authorities 暂不要

                1. root
                1. base
                1. HITS Algorithm
                    • authoreity update rule
                    • hub update rule

                Centrality Examples

                notion image

                IV. Network Evolution

                Preferential Attachment Model

                Degree Distribution

                1. The degree of a node in an undirected graph is the number of neighbors it has
                1. The degree distribution of a graph is the probability distribution of the degrees over the entire network

                In-degree Distribution

                1. The in-degree of a node in a directed graph is the number of in-links it has
                1. The in-degree distribution, Pin (k), of this network has the following values:

                Degree distribution in Real Networks

                Power law: Networks with power law distribution have many nodes with small degree and a few nodes with very large degree
                notion image

                Preferential Attachment Model

                notion image

                Small World Networks

                Milgram Small World Experiment

                notion image

                Small World of Instant Message

                notion image

                Small World of Facebook

                notion image

                Clustering Coefficient

                • Facebook2011: High Average CC (decreases with degree)
                • Microsoft IM: Average CC of 0.13
                • IMDB actor network: Average CC 0.78

                Path Length and Clustering

                Social networks tento have high CC and small average path length

                Small World Model

                notion image
                notion image
                 
                notion image

                Summary

                • Real social networks appear to have small shortest paths between nodes and high clustering coefficient
                • The preferential attachment model produces networks with small shortest paths but very small clustering coefficient
                • The small world model starts with a ring lattice with nodes connected to k nearest neighbors (high local clustering), and it rewires edges with probability p
                • For small values of p, small world networks have small average shortest path and high clustering coefficient, matching what we observe in real networks

                Link Prediction

                Basic

                1. Measure I: Common Neighbors
                1. Measure II: Jaccard coeffient
                1. Measure III: Resource Allocation
                1. Measure IV: Adamic-Adar Index
                1. Measure V: Pre. Attachment

                Community Information Based

                1. Measure VI: Community Common Neighbors
                    • Community Structure:
                      • Assume the nodes in this network belong to different communities(sets of nodes)
                      • Pairs of nodes who belong to the same community and have many common neighbors in their community are likely to form an edge
                1. Measure VII: Community Resource Allocation

                © Rongxin 2021 - 2024