1. Walkthrough: A MongoDB Map/Reduce Job 2 years ago

    Map/Reduce is a great way to do aggregations and ETL-type operations with MongoDB. Unfortunately, it can be a bit tricky when you’re just getting started. Kyle did a great post explaining the basics, and there are a couple of examples on the MongoDB Cookbook - it’s probably best to read through those first if you’re totally new to map/reduce or MongoDB. Anyway, for this post I thought it’d be helpful to walk through a real-world example of a map/reduce job we’re using at Fiesta.

    The Problem

    Fiesta supports custom domains (white-labelled service) and an API for creating mailing lists  - both of those features are in private beta (let us know if you’re interested in trying them out).

    Recently, another startup started using the API to create lists for one of their projects. Immediately after their service started creating lists, a request came in to provide some analytics on how people were using their lists. This was something we were planning on adding, but hadn’t gotten around to yet. Anyway, I set out to throw some quick analytics together for custom domains.

    The Data

    By default, Fiesta never stores any message content; we do, however, store some message metadata. Here’s an example of a MongoDB document that gets stored for each incoming message:

    {
      "_id": oid,
      "from": from_address,
      "group_name": name,
      "domain": domain,
      "group_id": group_id,
      "message_id": message_id,
      "parent_id": parent_id,
      "thread_id": thread_id
    }

    Since the _id field is an ObjectId, it contains a built-in timestamp. Between the timestamp and the domain field, we’ve got all the data we need to provide some analytics about how many messages were sent on a given custom-domain over time. Let’s jump into the map/reduce that we use to compute those analytics.

    The Map/Reduce

    Let’s start by looking at our map() function:

    function map () {
      var ts = this._id.getTimestamp();
      var month = ts.getUTCFullYear() + "-" + (ts.getUTCMonth() + 1);
      var stats = {};
      stats[month + "-" + ts.getUTCDate()] = 1;
      emit(this.domain, stats);
    }

    The map() function will be called once for each document in the “messages” collection. Our goal here is just to group each message by domain and date. We emit() our statistics using the message’s domain as the key. This means that after the reduce() step we’ll end up with one result per domain. The value that we emit is basically just the date the message was sent, something like this:

    {
      "2011-10-2": 1
    }

    After the map() step is finished, we’ll have a bunch of those timestamp documents associated with each custom domain. The next step is to reduce them into a single document:

    function reduce(key, values) {
      var out = {};
      function merge(a, b) {
        for (var k in b) {
          if (!b.hasOwnProperty(k)) {
            continue;
          }
          a[k] = (a[k] || 0) + b[k];
        }
      }
      for (var i=0; i < values.length; i++) {
        merge(out, values[i]);
      }
      return out;
    }

    The key in this case is a specific domain, like “example.org”, and values is an Array of emitted values for that key. Our goal is just to go through each of the emitted values and sum them up, so we get a total for each date. The important thing to remember, and the cause of most map/reduce bugs, is that the values array can contain values emitted during the map() step or values computed from a previous reduce() iteration. We need to be careful to handle both of those cases in our reduce function (a good approach is to keep the cases similar - make sure the output of a reduce step “looks like” the output of the map step).

    In this case, the merge() function takes care of that for us. We walk through each item in the values Array, and merge it with the running total. All merge() does is get each key in one document and add its value to the value of the corresponding key in the other document (or 0). When all is said and done, we’ll end up with results that look like this:

    {
      "_id": "example.com",
      "value": {
        "2011-9-26": 21.0,
        "2011-9-28": 25.0,
        "2011-10-1": 142.0,
        "2011-10-2": 16.0
      }
    }

    Putting it all Together

    For the sake of completeness, let’s take a look at how we might run this entire job from Python, using PyMongo:

    def compute_domain_analytics():
        """
        Compute statistics about the number of messages being sent to a domain.
        """
        map = """function map () {
      var ts = this._id.getTimestamp();
      var month = ts.getUTCFullYear() + "-" + (ts.getUTCMonth() + 1);
      var stats = {};
      stats[month + "-" + ts.getUTCDate()] = 1;
      emit(this.domain, stats);
    }"""
    
        reduce = """function reduce(key, values) {
      var out = {};
      function merge(a, b) {
        for (var k in b) {
          if (!b.hasOwnProperty(k)) {
            continue;
          }
          a[k] = (a[k] || 0) + b[k];
        }
      }
      for (var i=0; i < values.length; i++) {
        merge(out, values[i]);
      }
      return out;
    }"""
        db.db.messages.map_reduce(map, reduce, out="domain_analytics")

    This function can be called regularly (using a cron job or the like), for Fiesta we run it once an hour to keep the analytics fresh.

  2. Comments
  1. victoriastoica reblogged this from fiestacc
  2. bitpimpin reblogged this from fiestacc
  3. fiestacc posted this