Securing HTTP handlers in Go

So you’ve got a Go web application, and you want to secure your handlers… easy, right?

package user


func routes(p *pat.Router) {

func changePassword(w http.ResponseWriter, req *http.Request) {
    identity, err := data.GetUserFromAuth(req.Header.Get("Authorization"))

    if err != nil {

    // ...

The problem with that? It gets repeated for every handler.

Making it easier

We can create a helper function which does this bit for us – and we can wrap our handlers using it, centralising authentication and enforcing it at the routing layer.

package user


func routes(p *pat.Router) {

func isAuthenticated(f func(w http.ResponseWriter, req *http.Request)) func(w http.ResponseWriter, req *http.Request) {
    return func(w http.ResponseWriter, req *http.Request) {
        identity, err := data.GetUserFromAuth(req.Header.Get("Authorization"))

        if err != nil {

        f(w, req)

func changePassword(w http.ResponseWriter, req *http.Request) {
    // ...

All we’ve done is define a function which accepts a handler function, and returns another handler function. The returned function wraps the one we passed in, performing authentication and potentially returning a 401 error to the client.

The request doesn’t go anywhere near the handler without authentication. If authentication is successful, we call the inner handler as normal.

But there’s a problem

Two, actually.

Passing the identity around

In the example above, we need to change the users password – but which user?

We can pass a header, use a global context variable, or define our own (type-incompatible) implementation of http.Request with an identity field, etc. But they’re not very nice solutions.


This one is more difficult to manage. Since you’re now trusting authentication to a wrapper, we need to make sure you don’t accidentally change your routing to allow unauthenticated requests through. Risky!

This, for example, would now be a very bad idea:


We could add an assertion to every handler, but then there probably wasn’t much point centralising the authentication code into a wrapper in the first place.

A nice solution

There’s a clean solution which solves both of those problems.

package user


func routes(p *pat.Router) {

type Identity string

func isAuthenticated(f func(w http.ResponseWriter, req *http.Request) func (Identity)) func(w http.ResponseWriter, req *http.Request) {
    return func(w http.ResponseWriter, req *http.Request) {
        identity, err := data.GetUserFromAuth(req.Header.Get("Authorization"))

        if err != nil {

        f(w, req)(Identity(identity))

func changePassword(w http.ResponseWriter, req *http.Request) func(Identity) {
    return func (identity Identity) {
        // ...

By getting our handlers to return another function, we’ve avoided potential routing errors. It also gives us the opportunity to use a closure to pass the identity around.

As a extra safety measure, we’ve type aliased Identity – if we add other wrappers in future, we can’t mix them up in our routing, and we can’t accidentally call the handlers from an unauthenticated part of our code – we have compile time handler safety!

Bookmark and Share

Management and the frozen lake (and how coffee helps)

It’s winter, and you’ve gone out for a walk. You’ve already passed the shops and an occasional snowman. You smile, briefly, but carry on. In front of you there should be a lake – you’re convinced there should be a lake – but it looks frosty, dull, and blends into its surroundings. You doubt yourself, perhaps it isn’t a lake after all?

You carry on walking – happy that you can get home sooner. It’s cold out here, and the warmth of home seems far away, you’re in a hurry. Something doesn’t feel right, it sounds a bit crunchier than land, not quite what you’d expect – but you think what the hell, if it looks like land, and it works like land, it must be land. So you continue – you reason with yourself that it’s probably too late to go back now anyway.

After a few minutes, you think you’re getting close. So close you can see home, you can nearly feel it. The tingly cold feeling slowly replaced with tingly warm. Then all of a sudden, it’s gone! Air replaced with water, your vision of home with one of drowning, anxiety and death. It seems it wasn’t land after all.

Months later, following an expensive rescue operation and a long stay in hospital, you’re back home – everything seems fine now, winter has gone and the lake is definitely back (you’ve checked, its water!).

You begin to remember falling in the lake, and soon have a realisation. You’ve fallen in the lake every year – and you keep on doing it! What’s even stranger – you hear of lots of other people doing it too. Somehow, you’ve adapted to ignoring the ice, falling in the lake, and you hadn’t even noticed.

You wonder to yourself, why isn’t everyone falling in the lake? They have the same environment, the same hats, gloves and scarves. Why aren’t they getting wet? How are they getting home? They must just be lucky.

So why are some people just so damn lucky?

Now it’s winter again – you’ve nipped out for your annual walk to wherever. But this year, on your way back, you try something a bit different. You’ve walked past the shops, and the snowmen, but you stop. You buy a coffee. There’s a bench – it’s cold, but you’ve wrapped up warm and you’re in no hurry. It’d be nice to be home, but you don’t really want to spend another few months in hospital.

So you wipe away the snow, take a seat, and sip at your coffee, burning your lips, then pretending to “smoke”. Ahh, how relaxing.

After just a few minutes, someone walks past. They’re in a hurry – you don’t know (or care) why. They get to where you thought the lake should be. They pause for a moment, then carry on – they’ve got somewhere to be. But you don’t follow – besides, you’ve still got some coffee left, and the bench feels a lot warmer than when you got here!

You don’t have to wait long until a few more people walk past, and they all hurry off in the same direction.

After a longer wait (though probably still not that long, your coffee is definitely still hot!), you spot something a bit unusual. Someone has got to where the lake should be, but they’ve stopped. How bizarre, you think to yourself, and take another sip of your coffee.

You see them crouch down, and hear a small scraping noise. It’s a bit too dark to see what they’re doing, but you watch patiently anyway. Unexpectedly, they stand up, turn around, backtrack a few steps, and walk off in a completely different direction. You’re confused, nobody else has done that, but most importantly, you still have coffee.

As time passes, you see more and more people walk past. Some of them stop, crouch down, then walk off in different directions. But most of them carry on walking – they’re in too much of a hurry to stop, or even consider taking the longer route. It is cold after all – they seen the weather warnings, but they weren’t prepared enough for this.

Your coffee’s gone. Damnit, but at least you’re warmed up a bit for the walk home. Just as you stand up you hear a scream (definitely sounded like those people you seen earlier), a deafening crack, and a very wet sounding splash. You see a crack rip towards the shore, and the ice disappears into the water below.

You walk off, with a grin on your face – clearly pleased with yourself – and you make it home. A little later than expected, but sooner than last year! You didn’t follow the others, you did something different too, and although the results weren’t immediate they were definitely worth waiting for.

Uh, seriously… what?

I think I enjoyed writing that way too much – but there is a serious point to be made.

Over time, organisations and businesses can become frozen lakes. Managers and employees become the people and the fish, separated by thick, unbreakable ice. And although the results can occasionally be spectacular, the dangers are hidden, and often overlooked or massively underestimated.

And the solution is to repeat the same mistakes, use the same broken logic and failing communication mechanisms which created exactly the problem we’re trying to solve. And somehow we forget, or remember inaccurately, and do it again and again.

So what does it all mean?

Your perspective on this will depend on your position in your organisation.

The higher up in management you are, the more like the “people” you are. You can hurry out onto the ice, oblivious to the precarious situation you’re in, and all you see is a distorted reflection of your surroundings, everything just looks “kind of ok”. One wrong move, disaster for you – wait too long, disaster for you and the fish!

And lower down, you’re more like the “fish” – or whichever lake-dwelling animal you prefer. You can clearly see the danger, it’s the overpowering dullness above you, but you’re powerless to help. You’re trapped – you need air, but the surface is unreachable. You can scream as much as you can – but the ice, as well as hiding danger, is an effective sound barrier, and nobody hears a thing. And even if they did, they’re in too much of a hurry to notice.

But you saw in the story, some people got home – they didn’t make the same mistakes, they didn’t get wet. And they even helped a few fish along the way.

So what did they do differently? Why didn’t they make the same mistake? And how did they help those damn fish?

You need to gently break the ice, metaphorically speaking

Those people you seen crouching down – they weren’t “managers”.

They were leaders.

And that’s a significant difference!

They were scratching away at the surface – digging deeper and deeper into the ice. Of course, they wanted to serve their own purpose, to find the dangers below to avoid them. But they went about it differently – and it had a side effect, one even they weren’t expecting. They cut a hole in the ice, and a lucky few fish got the air they desperately needed.

Most importantly, instead of trusting that if it looks like ice, it must be ice, they took the initiative to check. And by checking for themselves, they found out that ice could be misleading. It could easily distort their view of the world, hiding dangers and increasing risks, despite the largest population (the fish) seeing things clearly.

The same applies in organisations – and this is the mistake you made in the story. Year after year, despite repeating the same mistakes, and knowing their outcome, you used the same technique (“if it looks like ice”) to deliver the same failed results (expensive rescue and lengthy hospitalisation), then clearly remembered the technique and completely forgot the outcome. Ad infinitum.

The leaders were different. They didn’t trust what worked before – especially since they knew it didn’t actually work last time – they slowed down, and found out for themselves, to avoid repeating the same mistakes again. They didn’t “ask the ice” what was happening below – they broke through it, went straight to the source, and got far clearer, and far more valuable information as a result. And helped a few fish survive of course!

What does this mean for me?

As either a manager or employee, your view of the ice is similar – everything looks dull and bland, but kind of similar to how things look now (the distorted reflections and white dullness). The management-gap between employees and management is the “ice” – and the more levels of management there are, the thicker the ice, the more distorted the view, and the harder it is to break through, from either side.

As a manager, particularly holding a senior position in an organisation, you should stop relying on (often) historical management structures to communicate with your employees. Go straight to the source – talk to them, spend time with them, and you’ll get a lot more back. Don’t ask the ice, break it.

And if you’re near the top, don’t just help the fish near the top, help them all, even the ones right at the bottom. And I don’t mean every year either – I mean every day. Live with the fish if you have to – they provide the food that makes you what you are, protect them and care about them.

It’ll be weird at first, and your employees might panic, but they’ll adjust. They’ll benefit too, but they’ll probably hate you until they can see it. That’s just tough, buy their temporary appreciation with nice toys if you have to – but work hard to earn their respect and trust, and in return you’ll get loyal, engaged and probably happy workers, who’ll back you up – even when you make mistakes.

Um, tell me a bit more about those fish?

If you’re a fish, you have a lot to gain from this too. But often you’ll eventually lose. While the “people” have the power to harm you, it requires commitment and participation from both sides if we want to keep winning.

You can just about survive when the people take the shortcut and smash the ice. It’s a turbulent few minutes, but as an uninjured fish, you’ll be fine again until next year.

Some fish are unlucky – the annual strain gets too much, and eventually they die, or they sadly get their faces smashed in with a flying shard of ice (intensity intended… it happens to some employees, hopefully not literally speaking!). But however they go, they take with them some knowledge, or skill, which would ultimately benefit the school in warmer times.

But when someone does break through the ice, even if only a tiny bit, you get air, a lifeline. It might not be much, but it helps repair just a bit of the damage – and now someone knows, maybe they’ll be able to help a bit more next year?

And what’s this about coffee?

The coffee is a metaphor for taking a break – sitting back and watching. See what people do. Don’t act. It’s far better to do the right thing later than do the wrong thing now, but sadly its too tempting to do the wrong thing now – the results come sooner, and often whether they’re good or bad is unimportant.

Management stuck in the frozen lake cycle just need to take a break, stop intervening – let people do their jobs. And help them when they ask.

If the fish ask for comically small pickaxes, and you just happen to have some (or can afford to buy some), help them, act quickly! They’re probably not crying wolf (or shark, perhaps?). And if they don’t ask, don’t give them anything – if you have to think they need it, then they probably don’t.

But it ONLY applies to management. Clearly fish can’t drink coffee.

If you’re a fish (or an employee, as is more likely!), you need to do the exact opposite. You’re trapped, you can’t get air, and you’re going to die soon. And if you don’t die, you’ll need time to heal, time to rest, a long summer before the next cold winter. You could just go about your fishy business, but you can’t do that for long – one way or another, disaster is coming!

So what can you do? Remember those tiny holes in the surface – the ones the leaders dug? Find them – scream and shout at them, don’t leave them alone. They’re the only ones who stand a chance of hearing you. And importantly, they’re up there – if they shout loud enough too, perhaps they can get other people to help – break apart the ice and save the fish, instead of just making sure they get home safe.

Sadly, as in nature, it might not work. If the people are the top are tricked by the distorted reflection they see, and no leaders walk past to help, you’ll either have some sense of normalcy followed by a major blowout (highly likely), repeated consistently, or you’ll suffer and eventually die (thankfully far less likely).

Happily ever after

So I conclude my story about management and frozen lakes (and some snowmen, fish and a coffee).

Disappointingly, there is no happily ever after. Not in this story, that’s something you’ll have to get over. Please don’t cry. Ok, cry – I’ll wait.

But there is a “happily for as long as we keep trying”. If the people keep digging and the fish keep screaming, there is some hope – and eventually you get home, warm and dry, and the fish get to live (albeit cold and wet). And you’re all better for it – you won’t make the same mistakes next year.

Better still, you won’t even blindly walk around the lake next time – despite your commitment to remembering its a lake – because that won’t help the fish.

Every year, from now on, you’ll stop, have a coffee, dig through the ice, and save the fish! And you can be happy knowing that everything is working just how it should – you’re sure it is, because the ice has gone and you can see clearly again.

And although fish might not be able to scream, you can!

(p.s. generally harassing your managers about stuff is as good as screaming – it’s quite possible that you’ll get escorted off the premises if you literally start screaming in work, and may end up in a mental hospital if you persist – be warned)

Bookmark and Share

Introducing Go-MailHog

Go-MailHog is a lightweight portable application which acts like a SMTP server.

It was inspired by MailCatcher, and does almost exactly the same thing, but without the slow and painful installation you get with Python Ruby.

| edit: it was Ruby, not Python – but painful nonetheless!

It was originally written in Perl based on code from M3MTA, but it’s been rewritten in Go for portability, and now runs on any supported platform without installation.

MailHog is designed for testing emails during application development. Instead of using a real SMTP server which delivers messages or has strict rules on email recipients, you can use MailHog and send messages to any email address with a valid format.

Instead of delivering the messages, they’re stored in MongoDB (edit: or memory!), and you can view them using the MailHog web interface, or retrieve them using an API for automated testing.

Using MailHog is as simple as downloading the binary release from GitHub and running it.

With no configuration, it will listen on port 1025 for SMTP, 8025 for HTTP (both web and API), and will connect to MongoDB running on localhost port 27017.

To view messages, point your browser at http://localhost:8025 or send a GET request to http://localhost:8025/api/v1/messages.


Bookmark and Share

Action composition in Mojolicious

Something about the routing in Mojolicious has been making things difficult, and Play Framework had the answer.

Full source code for these examples can be found on GitHub

Routing differences in Play Framework and Mojolicious

It wasn’t obvious at first, but the routing model in Mojolicious quickly becomes an unfathomable mess – and very difficult to debug.


  • Routes are complicated, can be chained together, and can have bridges and conditions which control the way routes behave:
    • Conditions are synchronous, and can switch between sub-routes on a per-request basis
    • Bridges can be asynchronous, but always ultimately end at the same destination (e.g., you couldn’t have two identical routes running through different bridges, since Mojolicious couldn’t determine which to use or if a route even exists, which is why conditions are synchronous).
  • Controller actions are plain subs, called directly at run-time by the Mojolicious router
    • Actions manipulate the stash and call Mojolicious methods directly (e.g. render)
    • The return value from an action is of limited use

Play Framework:

  • Routes are simple – they’re written statically and have no concept of bridges or conditions.
    • One route can point to one action, limited flexibility at the routing layer
  • Actions are composed by nesting action functions, with outer functions calling inner functions
    • The action returns either a result, or a future result
    • Actions can choose to call other actions, or contain multiple nested actions inside one action

Action composition

At first, the Play Framework routing model seemed very inflexible. Routes are static, and there’s no concept of bridges, conditions or any other way to intercept a request.

With Mojolicious, bridges and chained routes make its easy to abstract the ‘how I get there’ logic away from the ‘once I’ve got here’ logic – and the Play Framework way of life didn’t seem to offer the same flexibility.

Play Framework uses action composition, which means nesting actions inside other actions, and it felt a bit too restrictive.

But action composition quickly proved to be the more powerful of the two. And definitely the easiest to understand, especially when tracing requests through a complicated web application.

Why the Mojolicious model starts well, but ends badly

The Mojolicious model is extremely powerful for a simple application – we can define route stubs or bridges, attach more routes to those, intercept requests in bridges and return 404 or 500 errors. Which is great – but then we end in a messy refactoring nightmare:

  • The code used in routing (e.g. authenticated bridges) needs to be shared between applications
  • We move the code into a shared library, and get it to create named bridges
  • We update our applications to use the named bridge from the shared library

All good so far.

Now we want to use a hook to intercept some part of the request – adding headers for example:

  • We write a plugin which registers a hook
  • The hook adds custom headers to a response
  • We include the plugin in our application, and all responses get new headers

Still good. But once we’ve done this a few times, we end up with all routing and request manipulation being done by shared libraries. We have limited code visibility, and all we get to show for our efforts are:

  • A few ->plugin lines, adding ‘unknown’ functionality to our routing

  • An dangerous action sub, with no idea how the request gets there

    sub this_should_be_authenticated_and_authorised {
      $self->render(text => 'hope I really was authenticated...');

And now we’re potentially screwed.

Mojolicious routing has become well hidden technical debt (or a serious defect/PR disaster) waiting to bite.

Why the Play Framework model is better

Although routing becomes far more static in Play Framework, we can still refactor our routing code into shared libraries.

But there’s one important difference.

With Play Framework action composition, we maintain full code visibility at the controller:

def index = Authenticated { request =>
  Authorised(request, request.user) {
  } otherwise {

Instead of tracing through multiple plugins to find out what happens to a request, it’s all there in front of us. We know the request is Authenticated, and is then Authorised. If we have a bug, its easy to follow a request to see what happens.

We can also choose how much code visibility we get – for example, Authenticated takes care of what happens if user authentication fails, but Authorised leaves it to the developer to decide how to handle an authorisation failure.

So what can we do about it

This is Perl – there’s always a way!

Let’s define the syntax first:

# a plain action
Action 'welcome' => sub {
  shift->render(text => '');

# an asynchronous action (without a render_later call)
Action 'login' => Async {
  my $self = shift;
  Mojo::IOLoop->timer(0.25 => sub {
    $self->render(text => '');

# an authenticated action
Action 'private1' => Authenticated {
  shift->render(text => '');

# nested actions, with parameters
Action 'private2' => Async { Authenticated {
  WithPermission '' => sub {
    shift->render(text => '');
  }, sub {
    shift->render(text => 'not authorised');

Implementing it in Mojolicious

To start with, we need a way to define an action. This is essentially the same as the default sub{}, but lets us capture its contents.

The basic Action action

We need to be able to pass in a sub, and return a sub (though, for the top-level Action, we’ll need to monkey patch it so the Mojolicious router can find it).

Since we need to be able to chain these actions together, we also need to recursively call the inner action. We’ll also need to do that for any other actions we define, so lets make it generic:

sub go {
  my ($controller, $inner) = @_;
  my $i = $inner;
  while($i) {
    my $res = $i->($controller) if ref($i) eq 'CODE';
    $i = ref($res) eq 'CODE' ? $res : undef;
  return undef;

sub Action($$) {
  my ($action, $inner) = @_;
  monkey_patch caller, $action => sub { go(shift, $inner) }

This is enough to let us define a new action like this:

Action 'myaction' => sub {
    shift->render(text => '');

A nested action – Async

Adding another action type is just as easy.

Since Async is our first ‘nested’ action, we’ll implement that:

sub Async(&) {
  my $inner = shift;
  return sub {
    my $controller = shift;
    go($controller, $inner);

Now we can define an Async action, without needing to call render_later:

Action 'login' => Async {
  my $self = shift;
  Mojo::IOLoop->timer(0.25 => sub {
    $self->session(auth => 1);
    $self->render(text => '');

An action with parameters

We’ll skip over the Authenticated action for now – its almost identical to Async, with the exception of needing to perform a session lookup to decide whether to continue the action chain.

Instead, we’ll implement WithPermission – an action with run-time parameters.

We need to be able to pass in some custom parameters, and a sub, and have it return a sub which, like the others, invokes the inner sub when its called:

sub WithPermission($&;&) {
  my ($permission, $inner, $error) = @_;
  return sub {
    my $controller = shift;
    if($permission eq 'blog.delete') {
      go($controller, $inner);
    } else {
      $error ? go($controller, $error) : $controller->render_exception('Unauthorised');
    return undef;

Which lets us define an action like this:

Action 'private1' => Authenticated { WithPermission '' => sub {
  shift->render(text => '');
}, sub {
  shift->render(text => 'need permission');

If the user is authenticated and has the permission ‘’, the first inner action gets executed, otherwise the second is called instead. However, since our WithPermission action only accepts ‘blog.delete’, this will always fail.

The WithPermission action also implements a default failure action, so we could skip the second inner action completely.


The simple implementation above gives us the flexibility to move our bridge and condition code away from the routing and into the controller layer, improving code visibility for developers.

We can get many of the benefits that Play Framework offers, but there’s still one big difference.

In Play Framework, we could invoke an action and inspect its result, and still choose to not return that content to the user. This gives us the flexibility to invoke multiple actions, and decide later which to use, for example:

def index = Authenticated { request =>
  val res1 = Foo(request);
  val res2 = Bar(request);
  if((res1.body.json \ 'some_key').isDefined)
    Authorised(res1, request.user) {

In Mojolicious, the client might end up with a mixture of both responses, or just whichever gets called first. Because Mojolicious doesn’t use the return value, as soon as an action calls ->render it will immediately return a response to the client.

It should be possible – by creating a new stash and fake response objects, and possibly patching ->render – to trick Mojolicious into supporting a future/promise based interface.

Bookmark and Share

Getting started with HashiCorp Serf

In my post about Apache Mesos I briefly mentioned Serf.

Serf (from Hashicorp, who also make Vagrant and Packer) is a decentralised service discovery tool with support for custom events.

By installing a Serf agent on each node in a network, and (maybe) bootstrapping each agent with the IP address of another agent, you are quickly provided with a scalable membership system with the ability to propagate events across the network.

Once it’s installed and agents are started, running serf members from any node will produce output similar to this:

vagrant@master10:~$ serf members
master10    alive    role=master
zk10    alive    role=zookeeper
slave10    alive    role=slave
mongodb10    alive    role=mongodb

Which is when you realise you’ve still got a Mesos cluster running that you’d forgotten about…

The output from Serf shows the hostname, IP address, status and any tags the Serf agent is configured with. In this case, I’ve set a role tag which lets us quickly find a particular instance type on the network:

vagrant@master10:~$ serf members | grep mongodb
mongodb10    alive    role=mongodb

This, with its event system, makes Serf ideal for efficiently maintaining cluster node state, and reducing or eliminating application configuration.

In a Mesos cluster, it lets us make core parts of the infrastructure (like ZooKeepers and Mesos masters) simple to scale with no manual configuration.

Serf has lots of other potential uses too, some of them documented on the Serf website.

Getting started with Serf


Installing Serf couldn’t be easier.

You should be able to run serf from the command line and see a list of available commands.

Trying it out

To try Serf, you need two console windows open and you’re ready!

  • Run serf agent from one console
  • Run serf members from another

You should see something like this:

vagrant@example:~$ serf members
example    alive

That output shows a cluster containing your local machine (with a hostname of ‘example’), available at, and that the node is alive.

It’s that simple!

Starting Serf automatically

This isn’t much use on its own – we need Serf to start every time a node boots up. As soon as a new node comes online, the cluster finds out about it immediately, and the new node can configure itself.

Serf provides us with example scripts for upstart and systemd.

For Ubuntu, copy upstart.conf to /etc/init/serf-agent.conf then run start serf-agent (you might need to modify the upstart script if Serf isn’t installed to /usr/local/bin/serf).


Now we’ve got our Serf agent running, we need to configure it so it knows what to do.

You can configure Serf using either command line options (useful if you’re talking to a remote Serf agent or using non-standard ports), or you can provide configuration files (which are JSON files, loaded from the configuration directory in alphabetical order).

If you’ve used the Ubuntu upstart script, creating config.json in /etc/serf will work.

All of the configuration options are documented on the Serf website.

The examples below are in JSON, but they can all be provided as command line arguments instead.

IP addresses

This caught me out a few times – Serf, by default, will advertise the bind address (usually the IP address of your first network interface, e.g. eth0).

In a Vagrant environment, you will always have a NAT connection as your first interface (the one Vagrant uses to communicate with the VM). This was causing my agents to advertise an IP which other nodes couldn’t connect to.

To fix this, Serf lets us override the IP address it advertises to the cluster:

    "advertise": ""

Setting tags

Serf used to provide a ‘role’ command line option (it still does, but its deprecated). In its place, we have tags, which are far more flexible.

Tags are key-value pairs which provide metadata about the agent. In the example above, I’ve created a tag named role which describes the purpose of the node.

    "tags": {
        "role": "mongodb"

You can set multiple tags, but there is a limit – the Serf documentation doesn’t specify the limit, except to say

There is a byte size limit for the maximum number of tags, but in practice dozens of tags may be used.

You can also replace tags while the Serf agent is still running using the serf tags command, though changes aren’t persisted to configuration files.


You shouldn’t need to set the protocol – it should default to the latest version (currently 3).

It does, when started from the command line. But it didn’t seem to when started using upstart and a configuration directory. Easy to fix though:

    "protocol": 3

This might not be a bad practice anyway, you can update Serf on all nodes without worrying about protocol compatibility.

Forming a cluster

I’ll cover this in more detail later, but you can set either start_join or discover to join your agent to a cluster.

Scripting it

Since Serf is ideal for a cloud environment, its useful to script its installation and configuration.

Here’s an example using bash similar to the one in my vagrant-mongodb example. It installs Serf, configures upstart, and writes an example JSON configuration file.

Because its from a Vagrant build, it uses a workaround to find the correct IP.

mv serf /usr/local/bin
mv upstart.conf /etc/init/serf-agent.conf
mkdir /etc/serf
ip=`ip addr list eth1 | grep "inet " | cut -d ' ' -f6 | cut -d/ -f1`
echo { \"start_join\": [\"$1\"], \"protocol\": 3, \"tags\": { \"role\": \"mongodb\" }, \"advertise\": \"$ip\" } | tee /etc/serf/config.json
exec start serf-agent

Forming a cluster

So far we have just a single Serf agent. The next step is to setup another Serf agent, and join them together, forming a (small) cluster of Serf agents.

Using multicast DNS (mDNS)

Serf supports multicast DNS, so in a contained environment with multicast support we don’t need to provide it with a neighbour.

Using the discover configuration option, we provide Serf with a cluster name which it will use to automatically discover Serf peers.

    "discover": "mycluster"

In a cloud environment this removes the need to bootstrap Serf, making it truly autonomous.

Providing a neighbour

If we can’t use multicast DNS, we can provide a neighbour and Serf will discover the rest of the cluster from there.

This could be problematic, but if we’re in a Mesos cluster it becomes easy. We know at least one Zookeeper must always be available, so we can give Serf the hostnames of our known Zookeeper instances:

    "start_join": [ "zk1", "zk2", "zk3" ]

Or, if we get the Zookeepers to update a load balancer (using Serf!) when they join or leave the cluster, we can make our configuration even easier:

    "start_join": [ "zk" ]

We can also use the same technique to configure Serf on the Zookeeper nodes.

How clusters are formed

A cluster is formed as soon as one agent discovers another (whether this is through multicast DNS or using a known neighbour).

As soon as a cluster is formed, agents will share information between them using the Gossip Protocol.

If agents from two existing clusters discover each other, the two clusters will become a single cluster. Full membership information is propagated to every node.

Once two clusters have merged, it would be difficult to split them without restarting all agents in the cluster or forcing agents to leave the cluster using the force-leave command (and preventing them from discovering each other again!).

Nodes leaving

If a node chooses to leave the cluster (e.g. scaling down or restarting nodes), other nodes in the cluster will be informed with a leave event.

It’s membership information will be updated to show a ‘left’ state:

example    left    role=mongodb

A node leaving the cluster is treated differently to a failure.

This is determined by the signal sent to the Serf agent to terminate the process. An interrupt signal (Ctrl+C or kill -2) will tell the node to leave the cluster, while a kill signal (kill -9) will be treated as a node failure.

Node failures

When a node fails, other nodes are informed with a failed event.

It’s membership information will be updated to show a ‘failed’ state:

example    failed    role=mongodb

Knowledge of the failed node is kept by other Serf agents in the cluster. They will periodically attempt to reconnect to the node, and eventually remove the node if further attempts are unsuccessful.


Serf uses events to propagate membership information across the cluster, either member-join, member-leave, member-failed or member-update.

You can also send custom events (which use the user event type), and provide a custom event name and data payload to send with it:

serf event dosomething "{ \"foo\": \"bar\" }"

Events with the same name within a short time frame are coalesced into one event, although this can be disabled using the -coalesce=false command line argument.

This makes Serf useful as an automation tool – for example, to install applications on cluster nodes or configure ZooKeeper or Mesos instances.

Event handlers

Event handlers are scripts which are executed as a shell command in response to the events.

Shell environment

Within the shell created by Serf, we have the following environment variables available:

  • SERF_EVENT – the event type
  • SERF_SELF_NAME – the current node name
  • SERF_SELF_ROLE – the role of the node, but presumably deprecated
  • SERF_TAG_${TAG} – one for each tag set (uppercased)
  • SERF_USER_EVENT – the user event type, if SERF_EVENT is ‘user’
  • SERF_USER_LTIME – the Lamport timestamp of the event, if SERF_EVENT is ‘user’

Any data payload given by the event is piped to STDIN.

Creating event handlers

Serf’s event handler syntax is quite flexible, and lets you listen to all events or filter based on event type.

  • The most basic option is to invoke a script for every event:

        "event_handlers": [
  • You can listen for a specific event type:

        "event_handlers": [
  • You can specify multiple event types:

        "event_handlers": [
  • You can listen to just user events:

        "event_handlers": [
  • You can listen for specific user event types:

        "event_handlers": [

Multiple event handlers can be specified, and all event handlers which match for an event will be invoked.

Reloading configuration

Serf can reload its configuration without restarting the agent.

To do this, send a SIGHUP signal to the Serf process, for example using killall serf -HUP or kill -1 PID.

You could even use custom user events to rewrite Serf configuration files and reload them across the entire cluster.

Bookmark and Share

A quick introduction to Apache Mesos

Apache Mesos is a centralised fault-tolerant cluster manager. It’s designed for distributed computing environments to provide resource isolation and management across a cluster of slave nodes.

In some ways, Mesos provides the opposite to virtualisation:

  • Virtualisation splits a single physical resource into multiple virtual resources
  • Mesos joins multiple physical resources into a single virtual resource

It schedules CPU and memory resources across the cluster in much the same way the Linux Kernel schedules local resources.

A Mesos cluster is made up of four major components:

  • ZooKeepers
  • Mesos masters
  • Mesos slaves
  • Frameworks


Apache ZooKeeper is a centralised configuration manager, used by distributed applications such as Mesos to coordinate activity across a cluster.

Mesos uses ZooKeeper to elect a leading master and for slaves to join the cluster.

Mesos masters

A Mesos master is a Mesos instance in control of the cluster.

A cluster will typically have multiple Mesos masters to provide fault-tolerance, with one instance elected the leading master.

Mesos slaves

A Mesos slave is a Mesos instance which offers resources to the cluster.

They are the ‘worker’ instances – tasks are allocated to the slaves by the Mesos master.


On its own, Mesos only provides the basic “kernel” layer of your cluster. It lets other applications request resources in the cluster to perform tasks, but does nothing itself.

Frameworks bridge the gap between the Mesos layer and your applications. They are higher level abstractions which simplify the process of launching tasks on the cluster.


Chronos is a cron-like fault-tolerant scheduler for a Mesos cluster.

You can use it to schedule jobs, receive failure and completion notifications, and trigger other dependent jobs.


Marathon is the equivalent of the Linux upstart or init daemons, designed for long-running applications.

You can use it to start, stop and scale applications across the cluster.


There are a few other frameworks:

You can also write your own framework, using Java, Python or C++.

The quick start guide

If you want to get a Mesos cluster up and running, you have a few options:

Using Vagrant

Vagrant and the vagrant-mesos Vagrantfile can help you quickly build:

  • a standalone Mesos instance
  • a multi-machine Mesos cluster of ZooKeepers, masters and slaves

Unfortunately, the network configuration is a bit difficult to work with – it uses a private network between the VMs, and SSH tunnelling to provide access to the cluster.

Using Mesosphere and Amazon Web Services

Mesosphere provide Elastic Mesosphere, which can quickly launch a Mesos cluster using Amazon EC2.

This is far easier to work with than the Vagrant build, but it isn’t free – around $1.50 an hour for 6 instances or $4.50 for 18.

A simpler Vagrant build

I’ve put together some Vagrantfiles to build individual components of a Mesos cluster. It’s a work in progress, but it can already build a working Mesos cluster without the networking issues. It uses bridged networking, with dynamically assigned IPs, so all instances can be accessed directly through your local network.

You’ll need the following GitHub repositories:

At the moment, a cluster is limited to one ZooKeeper, but can support multiple Mesos masters and slaves.

Each of the instances is also built with Serf to provide decentralised service discovery. You can use serf members from any instance to list all other instances.

To help test deployments, there’s also a MongoDB build with Serf installed:

Like the ZooKeeper instances, the MongoDB instance joins the same Serf cluster but isn’t part of the Mesos cluster.

Once your cluster is running

You’ll need to install a framework.

Mesosphere lets you choose to install Marathon on Amazon EC2, so that could be a good place to start.

Otherwise, manually installing and configuring Marathon or another framework is easy. The quick and dirty way is to install them on the Mesos masters, but it would be better if they had their own VMs.

With Marathon or Aurora, you can even run other frameworks in the Mesos cluster for scalability and fault-tolerance.

Bookmark and Share

The automation what-for

Today, our developers and testers were asked to justify the use of test automation – a surprising question after we’ve invested 5 years in writing automated test cases.

The challenge was to prove the value in continuing to automate our test cases, on the basis that it should be up to scrutiny if the value really does exist.

So we tried…

  • Automated tests are repeatable and consistent
  • Automated tests and testing platforms can be easily scaled as the code base grows
  • Automated tests can be executed concurrently against many environments
  • Automated tests can provide rapid feedback on system/code changes

Of course, the same isn’t true for manual testing

  • Manual tests can be unpredictable
    • different testers may produce different results
    • testers may use workarounds to avoid some bugs
  • Manual testing isn’t scalable – employing more people is the only option
  • An individual tester can only (sensibly) test one environment at a time
  • Manual tests are slow – feedback might take days, weeks or months

But these weren’t the right answers.

We couldn’t understand why.

So, lets have a closer look at automation…

What is automation?

Automation, by definition, is:

the technique, method, or system of operating or controlling a process by highly automatic means, as by electronic devices, reducing human intervention to a minimum

Originally used circa 1940, the word was an irregular formation combining “automatic” and “action”, but process automation had become a well established practice long before then.

Everything from manufacturing and agriculture to construction and transportation – in modern history, humans have automated nearly every aspect of their lives.

On an industrial scale the benefits are immediately obvious. A farmer wouldn’t employ manual labour to plough fields any more than a car manufacturer would employ manual labour to assemble cars.

The return on investment (ROI) for the automation of these processes is clear – though of course it wasn’t in the decades leading up to the industrial revolution.

Why do we automate things?

There are many reasons to automate processes – some are purely economic while others are psychological.

  • Simple or boring tasks (paying bills)
  • Time consuming tasks (washing dishes)
  • Beyond our physical capability (lifting shipping containers)
  • To reduce cost (human labour)
  • To reduce risk (bomb disposal robots)

In all cases, our ability to automate something is limited by our mental capacity to perform that task. We can only automate the things we understand, that are simple enough and repeatable enough. We can’t easily automate tasks requiring creativity or emotion.

For example, we can easily automate opening a shop door (we could do it manually), but we would find it difficult to automate brain surgery (most humans couldn’t do it even manually) or software engineering (many have tried).

But sometimes, even when a process can be automated, we decide not to.

Why do we choose not to automate things?

Just as there are some processes we would like to automate, but can’t – there are some processes we could automate, but don’t:

  • Things we enjoy doing
  • Economic cost, e.g. R&D investment is too high or unpredictable
  • Social cost, e.g. unemployment and poverty

Like our ability to automate is limited by our mental capacity to perform a task, our ability to choose not to automate is equally limited by our physical limitations in performing a task (we wouldn’t even consider using human labour to lift a shipping container).

How we decide what to automate

Deciding whether we automate a process comes down to a cost-benefit analysis, determining if the investment required (whether an economic, physical or psychological investment) is worth the benefit we get in return.

As with all cost-benefit analysis, the time-frame over which we calculate the costs and benefits can have a considerable impact on the ROI.

For example, if Ford had only planned to make 1000 cars over a 2 year time frame, then it would be obvious that the ROI on designing, building, testing and deploying an automated car manufacturing process would be terrible, and would probably result in a net loss (or bankruptcy) for the company.

But if Ford wanted to continue producing cars – maybe another 350 million cars over a 109 years – then the ROI becomes far more appealing.

Although the up-front investment in research and development is high, the long-term benefit of this is exponentially higher, ultimately making Ford one of the world’s leading car manufacturers and forging the modern automotive industry.

Why software testing is no different

Just like agriculture and manufacturing, automating software testing comes with a high initial (and sometimes on-going) cost:

  • Developers and testers need to learn how to write automated tests
  • Test suites need to be written and maintained
  • An automated testing platform must be created

And just like agriculture and manufacturing, some of it doesn’t need automation (or can’t be automated):

  • If it’s throwaway/one-use code
  • Exploratory testing which requires creativity
  • Visual testing (does it look/feel right)

But in most cases, well written automated tests provide a level of confidence unmatched by manual testing:

  • Entire system components can be updated or replaced efficiently
  • Codebases can be safely refactored
  • Integration and release can be automated
  • Fixed defects can’t regress
  • More platforms can be tested (desktop, web, mobile, etc)

By developing an automated testing suite, testing resources can then be reallocated to more productive work:

  • Improving test coverage
  • Collaborating with developers
  • Exploratory and visual testing
  • Accessibility testing

So, what was the answer?

It certainly won’t be “because we should” or “it’s the right thing to do”, or even “it’ll reduce defects” or “it’ll improve code quality”.

It will come down to proving, through cost-benefit analysis, that the investment in automated testing provides a strong enough ROI. This will largely depend on the time frame used for the ROI calculation.

If the focus is short-term (“we want a great product now”) then any further investment in test automation will yield no value, and manual testing is the only choice.

But if the focus is long-term (“we still want a great product in 5 years”) then test automation is invaluable (supplemented with appropriate manual testing), and provably so in any cost-benefit analysis.

Is there a middle ground?

The middle ground does seem attractive:

  • Manual testing to get a quick delivery
  • Automate tests longer-term

It seems to promise a good short and long term ROI. We get our quick delivery, to an acceptable standard. We also get our test automation. And eventually we get a high quality product.

But until the test automation happens, developers are constrained by the existing codebase:

  • refactoring becomes difficult or impossible
  • updating components carries significant risk
  • minor changes, bug fixes and features take an inordinate amount of time to develop, test and deploy

This has a substantial consequence for the product or service being delivered:

  • If test automation never happens (no time is made available), the entire product will suffer and eventually adding new features or fixing bugs will become impossible
  • If test automation happens (quickly), new features will be held up while test suites are automated, delaying the creation of business value

Either way, the middle ground eventually becomes technical debt, and the short term business value gained through a reduction in the initial investment must eventually be repaid (through reduced longer-term value).

A cost-benefit analysis

Many cost-benefit analysis of test automation have already been carried out, so I’m not going to write “Yet Another Cost-Benefit Analysis” – but here’s a few links instead:


Given the historical importance of process automation throughout the industrial revolution, the rapid improvement to standards of living that we’re still benefiting from today, and the significant expansion of the human race as a result of the earliest technological automation, it seems counter-intuitive to even question the value in automating software testing.

Though I agree that some tests shouldn’t be automated (or can’t be), when products or services are expected to have a long “shelf-life”, test automation becomes the only sensible solution.

It’s also important to consider the human element in any cost-benefit analysis.

Testers and developers, like anyone else, get bored easily when tasked with simple and repetitive work. If we have the opportunity to automate this work, we leave humans with the more complex and creative work – the stuff we’re really good at, the stuff that we can’t automate, and the stuff that’s just more satisfying to do.

Bookmark and Share

Building a scalable sequence generator (in Scala)

Building a scalable sequence generator was more difficult than I’d anticipated.

The challenge

  • Build a scalable sequence generator (must scale out and provide resilience)
  • Master sequence number is stored in MongoDB, updated atomically using find and modify
  • Sequence numbers must never be repeated (but strict ordering isn’t required)

The problem

Since the sequence number is a single value stored in a single document in a single collection, the document gets locked on every request. MongoDB can’t help with scaling:

  • Starting multiple instances of our sequence generator doesn’t help, they all need to lock the same document
  • Multiple MongoDB nodes doesn’t help – we’d need replica acknowledged write concern to avoid duplicate sequence numbers

The solution

The solution is to take batches of sequence numbers from MongoDB, multiplying the scalability – for example, using a batch size of 10 means we can run (approximately) 10 instances of our sequence generator to our 1 MongoDB document, though any instance failure could waste up to 10 sequence numbers.

Using batches also dramatically improves our performance – we make far fewer MongoDB requests, generating less network traffic and reducing service response times.

The unscalable sequence generator

Building an unscalable sequence generator is easy. We can just find and modify the next sequence, MongoDB takes care of the rest.

An implementation might look a bit like this:

object UnscalableSequenceGenerator extends App {
  // the master sequence number
  var seq = 0

  def nextSeq : Future[Int] = future { blocking {
    // pretend we're doing a find and modify asynchronously
    this.synchronized {
      seq = seq + 1
  } }

  // simulate calling our HTTP service 100 times
  for(i <- 1 to 100) {
    nextSeq map { j =>
      // pretend we're doing something useful with the sequence number
      print(s"$j ")
      if(i % 10 == 0) println


Running that example produces output like this (the exact ordering of numbers may be different):

2 3 1 4 5 7 6 8 9 10 11 12 
14 13 16 17 15 19 18 21 22 20 24 23 25 
26 27 28 30 29 
31 32 34 33 36 35 37 38 39 40 41 43 42 
44 46 45 47 48 49 50 51 52 
53 55 54 56 57 58 60 59 
62 61 63 64 65 66 67 68 69 70 
71 72 74 73 75 76 78 77 80 79 81 
82 83 85 84 86 87 89 88 90 91 
93 95 92 96 97 94 99 98 100 

No duplicates, but it’s not scalable, and the performance is terrible.

Making it scalable

To make it scalable (and get a performance boost), we can use sequence number batches. But that turned out to be more difficult than I’d expected.

The first attempt looked a bit like this:

object BatchedSequenceGenerator extends App {
  // the master sequence number and batch size
  var seq = 0
  val batch_size = 10

  // our current sequence and maximum sequence numbers
  var local_seq = 0
  var local_max = 0

  def newBatch : Future[Int] = future { blocking {
    // pretend we're doing a find and modify asynchronously
    this.synchronized {
        seq = seq + 10
  } }

  def nextSeq : Future[Int] = {
    if(local_seq >= local_max) {
      // Get a new batch of sequence numbers
      newBatch map { new_max =>
        // Update our local sequence
        local_max = new_max
        local_seq = local_max - batch_size
        local_seq = local_seq + 1
    } else {
      // Use our local sequence number
      val next_seq = local_seq
      local_seq = local_seq + 1
      future { next_seq }

  // simulate calling our HTTP service 100 times
  for(i <- 1 to 100) nextSeq map { j =>
    // pretend we're doing something useful with the sequence number
    print(s"$j ")
    if(i % 10 == 0) println


While it does at least take batches of sequence numbers, we get the following unexpected but understandable output:

11 1 41 61 71 91 21 31 121 181 191 131 141 151 161 171 201 211 221 
101 81 111 51 
231 251 241 261 271 281 291 301 
311 321 331 341 351 361 371 381 
391 401 411 421 441 431 451 461 471 
481 491 501 511 521 531 541 551 
561 571 581 591 601 611 621 631 641 651 661 671 681 701 701 731 731 721 
741 751 761 791 781 771 801 811 821 831 841 
861 871 851 881 891 911 
901 921 931 951 961 941 971 
991 981

We’re only using 1/10th of each batch, and we get to 991 in only 100 requests. It’s no more scalable than the unbatched version.

It should probably have been obvious, but the problem is caused by requests arriving between requesting a new batch and getting a response:

  • The 10th request gives out the last local sequence number
  • The 11th request gets a new batch asynchronously
  • The 12th request arrives before we get a new batch, and requests another new batch asynchronously
  • We get the 11th request batch, reset our sequence numbers and return a sequence
  • We get the 12th request batch, and again reset our sequence numbers and return a sequence, wasting the rest of the previous batch

To fix it, we need the 12th request to wait for the 11th request to complete first.

Making it work

This was the tricky bit – implementing it led me down a road of endless compiler errors, but the idea was simple.

When we call nextSeq, we need to know if a new batch request is pending. If it is, instead of requesting a new batch, we need to wait for the existing request to complete, otherwise handle the request as normal.

We can do this by chaining futures together, keeping track of whether a batch request is currently in progress.

It’s a fairly simple change to our batched sequence generator (or at least, in hindsight it is):

object BatchedSequenceGenerator extends App {
  // the master sequence number and batch size
  var seq = 10
  val batch_size = 10

  // our current sequence and maximum sequence numbers
  var local_seq = 0
  var local_max = 10
  var pending : Option[Future[Int]] = None

  def newBatch : Future[Int] = future { blocking {
    // pretend we're doing a find and modify asynchronously
    this.synchronized {
      seq = seq + batch_size
  } }

  def nextSeq : Future[Int] = this.synchronized {
    pending match {
      case None =>
        if(local_seq >= local_max) {
          // Get a new batch of sequence numbers
          pending = Some(newBatch map { new_max =>
            // Update our local sequence
            local_max = new_max
            local_seq = local_max - batch_size + 1
          // Clear the pending future once we've got the batch
          pending.get andThen { case _ => pending = None }
        } else {
          // Use our local sequence number
          local_seq = local_seq + 1
          val seq = local_seq
      case Some(f) =>
        // Wait on the pending future
        f flatMap { f => nextSeq }

  // simulate calling our HTTP service 100 times
  for(i <- 1 to 100) nextSeq map { j =>
    // pretend we're doing something useful with the sequence number
    print(s"$j ")
    if(i % 10 == 0) println


And running that example generates output like this:

3 5 6 2 7 8 9 10 
4 1 13 11 12 14 15 17 19 20 
16 18 23 21 24 26 27 28 29 30 22 
25 34 35 31 33 37 38 39 40 
32 36 45 41 44 46 47 48 43 
49 50 42 52 53 55 51 60 54 56 
57 58 59 62 
64 70 63 61 65 66 67 68 69 72 75 71 
73 74 76 77 78 80 79 82 83 85 84 86 87 81 
89 88 90 92 95 93 99 98 100 97 
96 91 94 

The changes we made are straightforward:

  • When we request a new sequence number, check if a pending future exists
    • If it does, wait on that and return a new call to nextSeq
    • If not, check if a new batch is required
      • If it is, store the future before returning
      • It not, use the existing batch as normal

A limitation of this approach – if we have a sufficiently small batch size with a high volume of requests, the considerable number of chained futures could potentially cause out of memory errors.

Getting it to work felt like an achievement, but I’m still not happy with the code. It looks like there should be a nicer way to do it, and it doesn’t feel all that functional, but I can’t see it yet!

Bookmark and Share

Quick start with Perl and Mojolicious

To get started with Mojolicious, just as quick and dirty as with Scala and Play Framework, you need only these:

Once they’re all installed, its this easy:

  • Open Git Bash
  • Clone my Mojolicious/Perl vagrant repository: git clone mojoserver
  • Change directory: cd mojoserver
  • Start the virtual machine: vagrant up (might take a while, installing Perl is slow!)
  • Once it’s complete, connect using SSH: vagrant ssh
  • Create a new Mojolicious app: mojo generate app MyApp
  • Change directory: cd my_app
  • Start your application: ./script/my_app daemon
  • View your new Mojolicious site in a browser: http://localhost:3000

It installs the latest version of Mojolicious and Mango along with Perl 5.18.2 and cpanminus using Perlbrew.

To help you get started, the Vagrantfile also installs MongoDB and sets up port forwarding for port 3000 and 8080 (Mojolicious with Morbo and Hypnotoad) and port 27017 (MongoDB)

Bookmark and Share

Quick start with Scala and Play Framework

For the quick and dirty way to get Play Framework up and running, you need only these:

Once they’re all installed, its this easy:

  • Open Git Bash
  • Clone my Play/Scala vagrant repository: git clone playserver
  • Change directory: cd playserver
  • Start the virtual machine: vagrant up
  • Once it’s complete, connect using SSH: vagrant ssh
  • Create a new play app: play new MyApp
  • Change directory: cd MyApp
  • Start your application: play run
  • View your new Play site in a browser: http://localhost:9000

If you want to edit your Play project in IntelliJ Idea, create the project files from the command line using play gen-idea.

To help you get started, the Vagrantfile also installs MongoDB and sets up port forwarding for port 9000 (Play) and port 27017 (MongoDB)

Bookmark and Share