Thursday, August 14, 2014

And now for something completely different

I've started this blog roughly two years ago. At the time I felt an occasional frustration with my work as I started to get involved into a completely new kind of science: Big Data, Big Graphs, Statistics, Machine Learning... Whatever you want to call it - it was great and difficult at the same time. While I learned a lot, it also meant dealing with a mouthful of issues, to which I found sub-optimal solutions - then changed directions, found better solutions and so on. Yet, even after many months of hard work I was still far away from publishing at a conference. In fact, I would be for quite a while.

So I started this blog. It became a logbook for many things I did since. In contrast to scientific conferences, there is no PC to get past and nobody dictates what to publish, except me. And better yet, it might even be useful for others. Heck, I'd even go as far as to say that this blog probably has had more readers then some of my actual scientific publications. And after all, it's a lot of fun to write.

So what has changed after these years? Well, I'm on my train to Warsaw where I'll be presenting the work that frustrated me so much and motivated me to start this blog. And it turns out, I'm a pretty lucky man in that my current job allows me to do so many seemingly unrelated things. So while I was busy with doing my research on community detection, I also joined a project with the goal to provide a highly scalable publish/subscribe system. At the time, we called it StreamHUB. It was a massive beast only limited by the number of machines we had to our availability. But it was also hard to configure and run (and well, understand, once there were problems). It was, what people call a "research prototype". Eventually, I had to use it again for a project we received funding for and I knew I'd have to put in a massive amount of work in order to get StreamHUB ready for use within that project. To this day, I'm thankful to a colleague who suggested that if I'd go for a complete rewrite, Erlang might be a better fit to this problem than C++ was.

So I tried and was sold. Having done distributed systems in C++ and a little in Python, I was plain out amazed at what Erlang could do out of the box. Sending messages between processes - regardless on which machine they run without having to choose and then learn some library was like a revelation to me and I finally started to understand what "the right language for the problem" actually means. After a couple of months I was not only on feature parity with StreamHUB but I also had elasticity - the ability to add more silicon to the computation without a restart of the system.

So without any further ado I present uPS, a scalable, elastic content-based publish/subscribe system. Please go ahead and try it out. It's far away from being perfect and there are some things missing (see the Issue-section) but perfection is impossible anyway and it feels like the right time to publish. If you have any thoughts on it, please let me know.

Wednesday, April 23, 2014

Some fun (and insights) with facebook's graph API

Until now I've always used randomly generated graphs which was completely fine but I've always wanted to use a real graph at some point. And since I'm gonna be sitting in a train for a couple of hours and have a thing for social graphs, I thought it'd be nice to look at what information we can extract from Facebook's likes. And if that's not reason enough - this is my first post which is going to be linked from the famous r-bloggers web site.

Introduction


Facebook provides access to the "who likes whom" graph through its graph-api which can be queried through a REST interface: "https://graph.facebook.com/---some-object-id---/likes?access_token=---your-access-token---". The semantics of the returned data depends on the "object-id". In our case, we are interested in public pages. So if we query the interface for - e.g. "Blogf.social", we receive a list of pages which like Blogf. Obviously, we can then recursively get the pages which like those pages and so on. That's essentially a breadth-first search on the like graph of Facebook. So what are we going to find? Blogf is a community of of German speaking female bloggers assembled under the slogan "women write better blogs". Regardless of whether that's true or not - it's an interesting mix of people and topics. In any case, pages which like Blogf would be expected to be somewhat related to women.

Ego Graph


The following graph shows the result after a breadth-first search of depth 3, starting at Blogf. Vertices are pages and edges represent likes - i.e. "A -> B" means "A likes B". Green vertices have substantially fewer likes than red vertices. The vertex- and label-size encode the same information. I also removed the vertices for which we didn't query the likes (because we reached the maximum depth of 3). Blogf is centered in the graph with several clusters of pages around it. What is interesting is that while most of the pages are about women or online media, they cluster into different communities. On the top-left, for example, there is a cluster of pages about the environment connecting through "Netzfrauen (english: Netwomen)" to Blogf. Likewise, to the right, there is a more political cluster unsurprisingly connected through "Gender equality" to Blogf. Interestingly, however, this seems to be a completely different kind of political sites (with an emphasis on human rights) than the crowd on the bottom right (with an emphasis on left-wing politics), connected through "Wer braucht Feminismus (Who needs feminism)".
In the literature, such graphs are often referred to as "ego-graphs" because they represent the network as seen by an individual entity. It has been shown that by using such graphs to detect communities, the result is closer to the expected communities than when a complete graph is used (see: "DEMON: a local-first discovery method for overlapping communities"). However, I didn't want to write yet another article about community detection. My goal was something else: Can we use this graph and group the pages by interest?

Common Interests Graph


To this end we apply a technique that I came across in a security paper: It's a projection of the graph in which vertices are connected by undirected edges if they like the same page. In "Intrusion as (anti)social communication: characterization and detection", the authors use this technique to identify clusters of source IP addresses which talk to same target IP addresses. Most of the IP addresses are then assumed to be embedded within dense clusters but some are connected to many clusters which the authors define as intrusion (or, in their words: "Entering a community to which one does not belong"). The result for our "who likes whom" or "common interest" graph is shown below.

To get an intuitive understanding of the graph, consider the vertices to be people standing in a room talking with their neighbors. Some are part of tight groups (in fact, they are completely connected, forming a clique), some are literally caught between the cracks, talking to several groups at once. Finally, the topic they talk about is the page they like (as denoted by the green labels on the edges). Now it becomes obvious that the graph from before indeed contains a diverse assortment of interests. The large cluster on the top is the set of pages which like "Wer braucht Feminismus (Who needs feminism)". Interestingly, this cluster is disconnected. This means that "Wer braucht Feminismus (Who needs feminism)" is not liked by any other page in the Blogf's ego graph. Also, it is composed of a quite diverse set of pages - including political sites (such as the left-wing party "Die Linke"), several pages from federal states to sites clearly targeted towards women. There is one other large much more homogeneous cluster which is also disconnected: The set of pages which like "Gender Equality Germany", mostly composed of human-rights oriented pages. On the top right we find a community of start-up and business oriented pages and the set of pages which like "Blogf" can be found at the center of the graph. It is well connected to the rest of the graph. Interesting are the pages through which different clusters connect - such as "Gender Equality Germany" because those pages connect different communities through their diverse interests. As such, they are primary targets for closer collaboration (such as exchanging likes, cross-linking and so on).

Conclusion


For me it was quite interesting to see the different topics that circle around a page like Blogf. For Blogf, this does not only mean that there are several pages which still don't know about or like Blogf (or which, as a matter of fact should be liked by Blogf) but it also gives a hint at which kind of topics people that like Blogf are interested in - i.e. it could help the editor to select the right blog-posts from the blogosphere. If you like to run the same analysis for your site, you can find the code here. The crawler which queries the likes from facebook is written in python, the plotting and projection tasks are done in R.

Update (04/28/2014)
Paul Sorenson from metrak.com provided some improvements to the code. Not only can you now use the command line to specify which pages you want to crawl and up to which depth. The real improvement is that the python script now keeps the connection to facebook active - making the crawler run much faster.  He also pointed me to the "combn" function in R which does exactly what is needed to construct the projetion - only much faster than my plain R implementation. Finally, he found a bug in my code - I didn't upload the version which actually prints the edge-labels on the common interest graph. Sorry for that! Check out the update on the Bitbucket repository and many thanks to Paul!

Tuesday, April 1, 2014

From Random Walks to Personalized PageRank

In graph theory (and its applications) it is often required to model how information spreads within a given graph. This is interesting for many applications, such as attack predictionsybil detection, and recommender systems - just to name a few. The question now becomes, can the spread of information be mathematically modeled and generalized, such that we can reuse the same model for any kind of information?

Random Walks

Let's consider the graph below. And let's assume that we drop a piece of information on the red vertex. Then we'd like to know a couple of things. For example, where does it spread first? How far does it spread? Will the vertices, close to the red vertex obtain more information then those far away? And finally, will the information continue to go back and forth forever or will we reach a stable distribution eventually?
Surprisingly, the solution lies on the shoulders of small processes randomly walking through the graph. I say "surprisingly" because at first sight it might seem counter-intuitive to expect a deterministic answer from a random process. But randomness becomes deterministic when applied infinitely often: If you toss a coin, nobody can predict which side will be up. But if you toss a coin for an infinite number of times, we know that 50% of the time, head will be up - given the coin is fair. Of course, in the case of graphs, we can't throw coins. But we can walk along the graph's edges. And well, we can do so randomly. For this purpose, we introduce the random walker. He starts at the red vertex, chooses an edge at random and goes to the vertex on the other side of the edge. If we repeat that experiment an infinite number of times, we obtain a deterministic distribution of random walkers, given that they started at the red vertex and were allowed to travel one hop away from it.
Well that's boring. Of course we knew it: The random walkers had no choice but to walk along a single edge. But what happens if they are allowed to travel along two or three or four consecutive edges? It becomes already more difficult to see how this ends. The good news is that we don't even have to run this experiment infinitely often. In fact, we can use the matrix of transition probabilities A and multiply it with the vector containing the initial distribution of random walkers x: x' = Ax. That's for one hop - the second hop can be computed with x'' = Ax' and so on. Below, you see the distribution of random walkers after different lengths of random walks.
As we see, the random walkers go back and forth and in fact, they will continue to do so forever. This is because the graph is bipartite - meaning that vertices can be grouped in such a way that edges never connect any two vertices from the same group. For such graphs, the distribution of random walkers will never become stable. There is a solution, though: Lazy random walks.

Lazy Random Walks

The idea of lazy random walks is that we allow the random walkers to remain on a vertex with probability 1/2. Hence, our formula becomes x' = 1/2(A + I)*x. In this formula, I is the identity matrix and A is the original matrix of transition probabilities. In the animation below, the thickness of an edge corresponds to the geometric mean of the amount of information of its adjacent vertices. The color of the edge is determined by the "information current": The difference in the amount of information between the adjacent vertices. In other words: The thicker the edge, the more information flows along the edge. Edges with an adjacent vertex that has a high degree tend to be colored red. These are the vertices which contain a significantly larger part of information than the rest of the network.
As we can see, the distribution converges. And not only that: It also becomes less and less apparent where the information was originating from. In fact, for the vertices which are well connected, the distribution of information (or random walkers - whatever you want to call it) approximates their degree distribution. This means that high-degree vertices will contain proportionally more information than low-degree vertices.

But what if we actually wanted the vertex that contains the information initially to play an important role? One way to model that is to not stall on any vertex, but let the random walkers jump back to this specific vertex with a given probability (i.e. the teleport probability, alpha).

Personalized PageRank

It turns out that this is exactly what "Personalized PageRank" is all about. It models the distribution of  rank, given that the distance random walkers (the paper calls them random surfers) can travel from their source (the source is often referred to as "seed") is determined by alpha. In fact the expected walk-length is 1/alpha. The formula now becomes x' = (1-alpha)*Ax + alpha*E. Here, alpha is a constant between 0 and 1 and E is the vector containing the source of information - i.e. in our case it is all zero, except for the red vertex where our information starts to spread.
In this animation, alpha is fixed to 1/2 in order to being able to allow for comparison with the lazy random walks. This is pretty high, though - with the result that our information indeed remains close to the seed vertex. In many cases we will want the random walkers to travel farther. Below, is an animation for alpha = 0.1

Conclusion

Now what is the verdict? First, any diffusion of information in a graph can be modeled with random walks. The most interesting parameter of this algorithm is the length of the random walk. The longer it is, the farther the information spreads. In sybil detection, for example, this is used to detect users which have only very few 'real' friends. The idea is that if we trust some real users of a social network, we can diffuse this trust among close friends of these trusted users by using random walks. If these walks are short enough, the trust does not diffuse to far and sparsely connected users which might by sybils but remains within the well connected group of honest users. Look at the animation depicting the lazy random walks: The information indeed spreads well among the well connected vertices but it does not spread to the cluster on the left which is only connected through a single edge.

For well connected networks it becomes irrelevant quickly (i.e. for short random walks) where the information started to spread - in the case of honest users this is a good property, because we can choose any honest user - no matter which user we choose - the diffusion of trust will converge to the same distribution among these well connected honest users. Sometimes, however, we want the initial source of information to play an important role. The ranking of web-pages is such an example: If you set a homepage in your browser or visit the same set of webpages frequently, search engines use this fact and rank web-pages higher which are closer to the set of web-pages you visit often. This closes the circle to the Personalized PageRank algorithm which was designed to model exactly that. People, however, have applied it to many different domains, such as predicting future targets of cyber attacks or even community detection.

I hope this clarifies some of the parts of Personalized PageRank and how it relates to random walks. While this information is spread across several research papers and lectures I thought it would be good to condense it in a single blog-post. And even better: with animations of what's happening and why. If you want to alter some of the parameters and try for yourself: checkout the source-code for the animations above on my BitBucket repository.

Tuesday, February 26, 2013

The Cost of Convenience

The previous posts were all about saving processing time with R. However, a question that came to my mind was: How fast would it be in C++? Or, in other words: What does the convenience of using R cost? Given, that we consider really large data, this might be an important question to answer before the implementation is started.

Executive Summary

Last time we reached 7s runtime for 10,000,000 edges with the optimized version of LPA on top of R and IGraph. Let's compare that to a simple C++ program - which does nothing but LPA. In a way that is very unfair to R/IGraph. Naturally, they loose performance by offering a more general interface which does much, much more than just LPA or graph processing. But in some cases, it might indeed pay off to implement a specialized function once and have it run many times on large data-sets in a fraction of the time.

If you don't wanna read through the details, here is the executive summary: The C++ LPA finishes in about 0.6s. Note, that it is in no way optimized, other than the '-O3', offered by the compiler. This means its more than 11 times faster than its R counter part! However, as you'll see below the code is longer and quite specialized.

Implementation Details

Let's start with the preamble - the data-structures. We use a map to describe the graph. Intuitively, the keys are vertices and the values are the neighbors. However, we need to store some more information - e.g. the groups. There are, in fact, many options to do that. I chose not to rely on the ordering of the map (i.e. by using a "groups" vector like we used in the R counterpart which shares the index of the map).

Instead, the values in my map are somewhat "fat": They include the group of the vertex, the list of neighbors and a buffer for proposals. The latter is needed because I chose to implement LPA in a two-way manner: First, there is a proposal-phase where every vertex proposes its own group to its neighbors. Then, every vertex evaluates the received proposals and chooses a (possibly) new community, based on that.

  #include <map>
  #include <vector>
  #include <algorithm>
  #include <string>
  #include <iostream>
  #include <fstream>
  #include <stdint.h>
  #include <ctime>

  // contains the neighbors, all proposals and the group of a vertex
  struct info {
    std::vector<uint64_t> neighbors;
    std::map<uint64_t, uint64_t> proposals;
    uint64_t group;
  };

  // the actual graph, implemented as a map
  typedef std::map<uint64_t, struct info> GraphT;
  GraphT G;

  // we use this to obtain the most fequent group in our neighborhood
  bool value_comparer(std::map<uint64_t, uint64_t>::value_type &i1,
                      std::map<uint64_t, uint64_t>::value_type &i2)
  {
    return i1.second < i2.second;
  }

This is the function we use to assign a unique group to all vertices initially, using the vertex names as group. It is actually so boring that we can just skip it ;)

  // assign unique groups initially
  void initial_groups() {
    // run over all vertices and assign unique groups
    for (GraphT::iterator v = G.begin(); v != G.end(); ++v) {
      v->second.group = v->first;
    }
  }

Here, it gets interesting. As noted above, this algorithm has two phases (1: proposal phase, 2: decision phase) which are repeated until no more vertices change their group. It is in no way surprising that this reminds of consensus: LPA is in fact thriving to reach a consensus of densely connected clusters in a graph.

  // the actual algorithm
  void run_LPA() {
    std::clock_t begin = std::clock();
    while(true) {
      std::cout << "proposal phase\n";
      bool done = true;
        
      // run over all vertices and send proposals
      for (GraphT::iterator v = G.begin(); v != G.end(); ++v) {
        for (std::vector<uint64_t>::const_iterator n = v->second.neighbors.begin();
             n != v->second.neighbors.end();
             ++n)
        {
          G[*n].proposals[v->second.group]++;
        }
      }

      std::cout << "decision phase\n";
      // run over all vertices and evaluate proposals
      for (GraphT::iterator v = G.begin(); v != G.end(); ++v) {
        if (v->second.proposals.size() == 0) continue;
            
        uint64_t new_grp = std::max_element(v->second.proposals.begin(),
                                            v->second.proposals.end(),
                                            value_comparer)->first;
        if (new_grp != v->second.group) done = false;
            
        v->second.group = new_grp;
        v->second.proposals.clear();
      }

      if (done) break;
    }
    std::clock_t end = std::clock();
    double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
    std::cout << "Elapsed time: " << elapsed_secs << "s\n";
  }

Loading the graph and running the algorithm.

  // main program: read file, run LPA
  int main(int argc, char* argv[]) {
    // read the file
    std::ifstream graph;
    uint64_t src;
    uint64_t dst;
    
    graph.open(argv[1]);
    if (graph.is_open()) {
      while(graph.good()) {
        graph >> src >> dst;
        if (graph.eof()) break;
            
        G[src].neighbors.push_back(dst);
      }
      graph.close();
    }

    // run the algorithm
    initial_groups();
    run_LPA();
  }

Final Remarks

Let me point out some final thoughts and observations. First, I do not even think that this is the most efficient code out there. For example, STL maps are not exactly known for their performance. I'd bet that a plain C program would even be faster. Moreover, if pursued seriously, one would look around the corner at things like the boost graph library or GraphChi as well.

Thursday, December 13, 2012

Community detection algorithm with igraph and R - (2)

In the last post I presented a slightly modified LPA algorithm from the igraph wiki. This algorithm needed around 40s for 10,000,000 edges and 1000 unique vertices. I promised, that one could do much better. Here you go!

Iterators in igraph

To understand why this is possible, have a look at this quote from the igraph help:
"A note about the performance of the V and E functions, and the selection of edges and vertices. Since all selectors are evaluated as logical vectors on all vertices/edges, their performance is bad on large graphs. (Time complexity is proportional to the total number of vertices/edges.) We suggest using the neighborsincident functions and simple R vector operations for manipulating vertex/edge sequences in large graphs."
So these functions have a bad performance - but can we avoid them in our algorithm and, if yes, how much benefit will we get?

Iteratorless LPA

If you look at the algorithm again, you'll notice that the iterator-function "V" is used to access the "group" property of the nodes. The problem is, that the "group" property is accessed in the inner-most loop of the algorithm. This means that the overhead of each call to "V" is multiplied by how often we visit the inner-most loop - i.e. |V|.
The solution is not to save the group property in the igraph-object, but to save it in a regular (and fast) R vector. It's really as easy as that because the indices in the vector are the same as the indices in the iterator - i.e.: "V(g)[i]$group == vgroups[i]". This yields in the following algorithm (altered lines are bold):

  LargeScaleCommunityFast <- function(g, mode="all") {
    cat("Assigning initial communities...\n")
    vgroups <- V(g)$name
    # random order in which vertices will be processed
    cat("Generating random order...\n")
    order <- sample(vcount(g), vcount(g))
    t <- 0
    done <- FALSE

    while (!done) {

      t <- t + 1
      cat("round: ", t, "\n")

      ## change to FALSE whenever a node changes groups

      done <- TRUE

      for(i in order) {

        ## get the neighbor group frequencies:
        group.freq <- table(vgroups[neighbors(g, i, mode = mode)])
        ## pick one of the most frequent:
        new.group <- sample(names(group.freq)[group.freq == max(group.freq)], 1)
        if (done) {
          ## we are only done if new group is the same as old group
          done <- (new.group == vgroups[i])
        }
        vgroups[i] <- new.group
      }
    }

    cat("Creating community-object...\n")

    comms <- list(membership = as.numeric(vgroups),
                  vcount = vcount(g),
                  algorithm = "LPA",
                  names = V(g)$name)
    class(comms) <- "communities"
    return(comms)
}

Executing the algorithm

Running this algorithm against the same input as used in the previous post results in a runtime of 0.7s for 1,000,000 edges and 5s for 10,000,000 edges. This is almost 10 times faster than the previous (7s, 40s).
You can do the same with properties of edges (like weights) - this is one of the things I'd like to write about next.

Monday, December 10, 2012

Community detection algorithm with igraph and R - (1)

In the first entry on this blog I gave an example on how to load huge graphs with R. The biggest problem, however, is actually doing something useful with huge graphs. This post is somewhat of a preparation for the next post on iterators in igraph. I thought it would be better to explain the limitations and how to avoid them with a real example: Community detection. igraph actually provides several different community-detection algorithms, which are written in plain C and, therefore, run quite fast. However, it does not provide a label propagation algorithm (LPA) which belongs to the broader spectrum of modularity optimization algorithms.

How does it work?

Essentially, LPA tries to find a consensus of labels in densely connected clusters (communities) of a graph. It does so in several iterations. On the first iteration, every vertex in the graph chooses a unique community. This could be, for example, its own label. Every vertex will then "send" its own label to all its neighbors.
On every subsequent iteration, each vertex counts the labels it received from it's neighbors in the previous iteration and selects the label with the highest count. If this is the second iteration, each label will only be received once (it is as if all neighbors sent their own, unique, id). However, the vertices will select a random label, if there is no clear winner in the counts.
Ultimately, the algorithm terminates if, during the current iteration, no vertex had to change its label.

You can see, that eventually some labels will win over others. If you do not believe it, you are not wrong either: "Resolution limit in community detection", is just one of the many scientific publications on issues with community detection.

An implementation in igraph

There is actually a basic implementation of this algorithm on the igraph-wiki. Here is a slightly adapted version which creates an actual community-object. This way, you can use it in other igraph-algorithms as well (such as create nice plots).

  LargeScaleCommunity <- function(g, mode="all") {
    cat("Assigning initial communities...\n")
    V(g)$group <- V(g)$name
    ## random order in which vertices will be processed
    cat("Generating random order...\n")
    order <- sample(vcount(g), vcount(g))
    t <- 0
    done <- FALSE
  
    while (!done) {
      t <- t + 1
      cat("round: ", t, "\n")
      ## change to FALSE whenever a node changes groups
      done <- TRUE

      for(i in order) {
        ## get the neighbor group frequencies:
        group.freq <- table(V(g)[neighbors(g, i, mode = mode)]$group)
        ## pick one of the most frequent:
        new.group <- sample(names(group.freq)[group.freq == max(group.freq)], 1)
        if (done) {
          ## we are only done if new group is the same as old group
          done <- (new.group == V(g)[i]$group)
        }
        V(g)[i]$group <- new.group
      }
    }

    cat("Creating community-object...\n")
    comms <- list(membership = as.numeric(V(g)$group),
                  vcount = vcount(g),
                  algorithm = "LPA",
                  names = V(g)$name)
    class(comms) <- "communities"
    return(comms)
  }

Input

One obvious application for these algorithms are graphs of social interaction. Since I wanted to make this as independent as possible from external data-sources, we'll have to construct an artificial graph. The interesting thing is that their structure differs only in one important point from purely random graphs: All social graphs are "scale-free". This means that very few vertices will have the majority of the edges and the majority of the vertices will have very few edges. It is actually easy to see why that is the case: A famous artist will have many connection (e.g. followers on Twitter or g+). But the ordinary person will have only a few.

Generating such a graph in R is quite easy:

  cat("--- Creating data.frame...\n")
  start <- proc.time()
  df <- data.frame(src = sample(1:100, 1000000, replace = TRUE, prob = exp(0.5*(1:100))),
                   dst = sample(1:100, 1000000, replace = TRUE, prob = exp(0.5*(1:100))))
  cat(sprintf("--- elapsed time: %fs\n\n", (proc.time() - start)[1]))


In contrast to the last time, we provide a vector of probabilities for the sample: "prob". This vector is constructed using a "power law" - i.e. from the 100 available vertices it will decide much more often for the 100, than for the 1.

Running the algorithm

Subsequently, we create the graph-object and run the community-detection:

  cat("--- Creating graph...\n")
  start <- proc.time()
  vertex.attrs <- list(name = unique(c(df$src, df$dst)))
  edges <- rbind(match(df$src, vertex.attrs$name), match(df$dst, vertex.attrs$name))
  G <- graph.empty(n = 0, directed = T)
  G <- add.vertices(G, length(vertex.attrs$name), attr = vertex.attrs)
  G <- add.edges(G, edges)
  cat(sprintf("--- elapsed user-time: %fs\n\n", (proc.time() - start)[1]))

  cat("--- Detecting communities...\n")

  start <- proc.time()
  C <- LargeScaleCommunity(G)

  cat(sprintf("--- elapsed time: %f\n\n", (proc.time() - start)[1]))


Running this example on my laptop results in a runtime of around 7s. Not that bad, right? But wait, it's a fairly small graph with only 100 different vertices and 1,000,000 edges. For 1000 unique vertices and 10,000,000  edges it's already around 40s. I won't dare to run this for a real large graph on my notebook. The good news is, however, it can be much faster. But that's material for the next post
.

Thursday, December 6, 2012

Almost stateless aggregate

Recently, I had to aggregate the 3rd column of a huge file by groups of the first and second column. Now you'll find many examples on the web on how to aggregate a file by groups. However, they usually construct an associative array along the way and print the array at once in the end. But what happens when you run out of memory before you can dump the array to disk?

One solution

One solution is to order the file first - such that the groups appear in blocks, rather than spread across the complete file. This permits a stateless implementation of an aggregate which I'll show below using "awk". Of course, you could use any other tool just as well.

Doing it the *nix way

If you know your way around "sort", it's quite easy to re-order the lines in the file in such a way that the result is grouped by the first two columns:

  sort -k1,2 <input>

This will turn a list like this:

  1,2,10
  4,8,11
  1,4,12
  1,2,13
  4,8,14

into a grouped list:

  1,2,10
  1,2,13
  1,4,12
  4,8,11
  4,8,14


All you need now is this "awk"-script which does the aggregation on such a sorted file without consuming more than what is need to store an int in an associative array.

  # executed before everything else
  # tells awk, that we're dealing with
  # a CSV-file
  BEGIN {
    FS = ","
  }

  # executed on every line of the file
  {
    if (!(($1 "," $2) in current) && NR > 1) {
      for (s in current) {
        print s "," current[s]
        delete current[s]
      }
    }

    current[$1 "," $2] += $3
  }

  # executed after the last line has been read
  END {
    for (s in current) {
      print s "," current[s]
    }
  }

The key is that the script uses the first and second columns of the file as keys for the associative array: "current[$1 "," $2]" therefore points to the aggregated value of the third column for the current group. Once we find a combination of "[$1 "," $2]" which does not exist yet (i.e. a new group), all entries in the array are printed (there is always just one) and deleted. Then the new group is added and so on. The last group is printed when we reach the "END" part of the script.