Monday, December 10, 2012

Community detection algorithm with igraph and R - (1)

In the first entry on this blog I gave an example on how to load huge graphs with R. The biggest problem, however, is actually doing something useful with huge graphs. This post is somewhat of a preparation for the next post on iterators in igraph. I thought it would be better to explain the limitations and how to avoid them with a real example: Community detection. igraph actually provides several different community-detection algorithms, which are written in plain C and, therefore, run quite fast. However, it does not provide a label propagation algorithm (LPA) which belongs to the broader spectrum of modularity optimization algorithms.

How does it work?

Essentially, LPA tries to find a consensus of labels in densely connected clusters (communities) of a graph. It does so in several iterations. On the first iteration, every vertex in the graph chooses a unique community. This could be, for example, its own label. Every vertex will then "send" its own label to all its neighbors.
On every subsequent iteration, each vertex counts the labels it received from it's neighbors in the previous iteration and selects the label with the highest count. If this is the second iteration, each label will only be received once (it is as if all neighbors sent their own, unique, id). However, the vertices will select a random label, if there is no clear winner in the counts.
Ultimately, the algorithm terminates if, during the current iteration, no vertex had to change its label.

You can see, that eventually some labels will win over others. If you do not believe it, you are not wrong either: "Resolution limit in community detection", is just one of the many scientific publications on issues with community detection.

An implementation in igraph

There is actually a basic implementation of this algorithm on the igraph-wiki. Here is a slightly adapted version which creates an actual community-object. This way, you can use it in other igraph-algorithms as well (such as create nice plots).

  LargeScaleCommunity <- function(g, mode="all") {
    cat("Assigning initial communities...\n")
    V(g)$group <- V(g)$name
    ## random order in which vertices will be processed
    cat("Generating random order...\n")
    order <- sample(vcount(g), vcount(g))
    t <- 0
    done <- FALSE
  
    while (!done) {
      t <- t + 1
      cat("round: ", t, "\n")
      ## change to FALSE whenever a node changes groups
      done <- TRUE

      for(i in order) {
        ## get the neighbor group frequencies:
        group.freq <- table(V(g)[neighbors(g, i, mode = mode)]$group)
        ## pick one of the most frequent:
        new.group <- sample(names(group.freq)[group.freq == max(group.freq)], 1)
        if (done) {
          ## we are only done if new group is the same as old group
          done <- (new.group == V(g)[i]$group)
        }
        V(g)[i]$group <- new.group
      }
    }

    cat("Creating community-object...\n")
    comms <- list(membership = as.numeric(V(g)$group),
                  vcount = vcount(g),
                  algorithm = "LPA",
                  names = V(g)$name)
    class(comms) <- "communities"
    return(comms)
  }

Input

One obvious application for these algorithms are graphs of social interaction. Since I wanted to make this as independent as possible from external data-sources, we'll have to construct an artificial graph. The interesting thing is that their structure differs only in one important point from purely random graphs: All social graphs are "scale-free". This means that very few vertices will have the majority of the edges and the majority of the vertices will have very few edges. It is actually easy to see why that is the case: A famous artist will have many connection (e.g. followers on Twitter or g+). But the ordinary person will have only a few.

Generating such a graph in R is quite easy:

  cat("--- Creating data.frame...\n")
  start <- proc.time()
  df <- data.frame(src = sample(1:100, 1000000, replace = TRUE, prob = exp(0.5*(1:100))),
                   dst = sample(1:100, 1000000, replace = TRUE, prob = exp(0.5*(1:100))))
  cat(sprintf("--- elapsed time: %fs\n\n", (proc.time() - start)[1]))


In contrast to the last time, we provide a vector of probabilities for the sample: "prob". This vector is constructed using a "power law" - i.e. from the 100 available vertices it will decide much more often for the 100, than for the 1.

Running the algorithm

Subsequently, we create the graph-object and run the community-detection:

  cat("--- Creating graph...\n")
  start <- proc.time()
  vertex.attrs <- list(name = unique(c(df$src, df$dst)))
  edges <- rbind(match(df$src, vertex.attrs$name), match(df$dst, vertex.attrs$name))
  G <- graph.empty(n = 0, directed = T)
  G <- add.vertices(G, length(vertex.attrs$name), attr = vertex.attrs)
  G <- add.edges(G, edges)
  cat(sprintf("--- elapsed user-time: %fs\n\n", (proc.time() - start)[1]))

  cat("--- Detecting communities...\n")

  start <- proc.time()
  C <- LargeScaleCommunity(G)

  cat(sprintf("--- elapsed time: %f\n\n", (proc.time() - start)[1]))


Running this example on my laptop results in a runtime of around 7s. Not that bad, right? But wait, it's a fairly small graph with only 100 different vertices and 1,000,000 edges. For 1000 unique vertices and 10,000,000  edges it's already around 40s. I won't dare to run this for a real large graph on my notebook. The good news is, however, it can be much faster. But that's material for the next post
.