Tuesday, February 26, 2013

The Cost of Convenience

The previous posts were all about saving processing time with R. However, a question that came to my mind was: How fast would it be in C++? Or, in other words: What does the convenience of using R cost? Given, that we consider really large data, this might be an important question to answer before the implementation is started.

Executive Summary

Last time we reached 7s runtime for 10,000,000 edges with the optimized version of LPA on top of R and IGraph. Let's compare that to a simple C++ program - which does nothing but LPA. In a way that is very unfair to R/IGraph. Naturally, they loose performance by offering a more general interface which does much, much more than just LPA or graph processing. But in some cases, it might indeed pay off to implement a specialized function once and have it run many times on large data-sets in a fraction of the time.

If you don't wanna read through the details, here is the executive summary: The C++ LPA finishes in about 0.6s. Note, that it is in no way optimized, other than the '-O3', offered by the compiler. This means its more than 11 times faster than its R counter part! However, as you'll see below the code is longer and quite specialized.

Implementation Details

Let's start with the preamble - the data-structures. We use a map to describe the graph. Intuitively, the keys are vertices and the values are the neighbors. However, we need to store some more information - e.g. the groups. There are, in fact, many options to do that. I chose not to rely on the ordering of the map (i.e. by using a "groups" vector like we used in the R counterpart which shares the index of the map).

Instead, the values in my map are somewhat "fat": They include the group of the vertex, the list of neighbors and a buffer for proposals. The latter is needed because I chose to implement LPA in a two-way manner: First, there is a proposal-phase where every vertex proposes its own group to its neighbors. Then, every vertex evaluates the received proposals and chooses a (possibly) new community, based on that.

  #include <map>
  #include <vector>
  #include <algorithm>
  #include <string>
  #include <iostream>
  #include <fstream>
  #include <stdint.h>
  #include <ctime>

  // contains the neighbors, all proposals and the group of a vertex
  struct info {
    std::vector<uint64_t> neighbors;
    std::map<uint64_t, uint64_t> proposals;
    uint64_t group;
  };

  // the actual graph, implemented as a map
  typedef std::map<uint64_t, struct info> GraphT;
  GraphT G;

  // we use this to obtain the most fequent group in our neighborhood
  bool value_comparer(std::map<uint64_t, uint64_t>::value_type &i1,
                      std::map<uint64_t, uint64_t>::value_type &i2)
  {
    return i1.second < i2.second;
  }

This is the function we use to assign a unique group to all vertices initially, using the vertex names as group. It is actually so boring that we can just skip it ;)

  // assign unique groups initially
  void initial_groups() {
    // run over all vertices and assign unique groups
    for (GraphT::iterator v = G.begin(); v != G.end(); ++v) {
      v->second.group = v->first;
    }
  }

Here, it gets interesting. As noted above, this algorithm has two phases (1: proposal phase, 2: decision phase) which are repeated until no more vertices change their group. It is in no way surprising that this reminds of consensus: LPA is in fact thriving to reach a consensus of densely connected clusters in a graph.

  // the actual algorithm
  void run_LPA() {
    std::clock_t begin = std::clock();
    while(true) {
      std::cout << "proposal phase\n";
      bool done = true;
        
      // run over all vertices and send proposals
      for (GraphT::iterator v = G.begin(); v != G.end(); ++v) {
        for (std::vector<uint64_t>::const_iterator n = v->second.neighbors.begin();
             n != v->second.neighbors.end();
             ++n)
        {
          G[*n].proposals[v->second.group]++;
        }
      }

      std::cout << "decision phase\n";
      // run over all vertices and evaluate proposals
      for (GraphT::iterator v = G.begin(); v != G.end(); ++v) {
        if (v->second.proposals.size() == 0) continue;
            
        uint64_t new_grp = std::max_element(v->second.proposals.begin(),
                                            v->second.proposals.end(),
                                            value_comparer)->first;
        if (new_grp != v->second.group) done = false;
            
        v->second.group = new_grp;
        v->second.proposals.clear();
      }

      if (done) break;
    }
    std::clock_t end = std::clock();
    double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
    std::cout << "Elapsed time: " << elapsed_secs << "s\n";
  }

Loading the graph and running the algorithm.

  // main program: read file, run LPA
  int main(int argc, char* argv[]) {
    // read the file
    std::ifstream graph;
    uint64_t src;
    uint64_t dst;
    
    graph.open(argv[1]);
    if (graph.is_open()) {
      while(graph.good()) {
        graph >> src >> dst;
        if (graph.eof()) break;
            
        G[src].neighbors.push_back(dst);
      }
      graph.close();
    }

    // run the algorithm
    initial_groups();
    run_LPA();
  }

Final Remarks

Let me point out some final thoughts and observations. First, I do not even think that this is the most efficient code out there. For example, STL maps are not exactly known for their performance. I'd bet that a plain C program would even be faster. Moreover, if pursued seriously, one would look around the corner at things like the boost graph library or GraphChi as well.

2 comments:

  1. I'm guessing that Rcpp would give you the best of both worlds, but I don't know C++ at all.

    ReplyDelete
  2. very true! the igraph-package is actually implemented in c++ (with some exceptions). maybe this would be an interesting blog-entry as well: implementing the LPA inside the igraph-c++ package. especially interesting, because i've been playing a bit with their random-walk algorithm and surprisingly it ran out of memory pretty quickly for large graphs. then again, random walks are a completely different story.

    ReplyDelete