Heck - A distributed health checker

Motivation



This talk at QCon 2017 talks about the design of “real” distributed system that is part of Fastly core architecture. Problem is interesting.

Unfortunately he didn’t show any real-code or even open source library. So thought of building one.

Problem

Problem is simple. - A distributed health check monitor

Fastly do edge compute. Let’s assume simple CDN for instance. CDN is just a middle man that connect “your users” to “your origin servers” in an efficient way. Fastly cannot forward traffic to one of your un-healthy origin servers. So there has to be system in place to do the job of health checking of your origin servers

This is where problem get interesting. Client can have thousands of servers even 10-100 of data centres running their origin servers. Meanwhile Fastly also can have 10-100 of data centres  with thousands of servers in each data-center. They all have to know about the health check of the client’s origin servers. (well. technically not all, but every data-center that forwards traffic to origin servers of same client has to know about it)

One naive way to solve the problem is to do health check of all origin servers from every fastly servers. That is called DDOS.

Lets zoom in a bit and look at the problem per data-center in Fastly. So per data center, every server can to do health for one or more origin servers. Now its a distributed systems problem. Primary Goal is, origin servers shouldn’t be bombarded with health-check requests from many Fastly servers (ideally one server owns the one or more origin severs and only that server does health check for those origin servers).

To be precise, 

Given a bunch of nodes, and list of origin servers, monitor the health-check of all the origin servers and distribute the work among available nodes

Goals

  1. System should be eventual consistent. Meaning consensus latency is not-acceptable. 
  1. We are building AP systems from the perspective of CAP theorem.
  1. Should be able to add or remove nodes dynamically
  1. Handle failure detection among the nodes
  1. Fault tolerant 
  1. Should be Co-ordination free system

Sub-problems

  1. Doing-healthcheck: Simple HTTP polling with interval.
  1. Load Balancing/Ownership: How do you find which node is a owner of a “single” origin server(a.k.a Load Balancing). There can be single owner for multiple origin servers but has to deterministic. 
  • Potential Solutions: Any simple Deterministic Hashing? or Consistent Hashing or Rendevous Hashing(as per the talk) 
  1. Failure Detectors/Cluster Membership: We need some way to ensure all the current nodes are working and in good state to assign work. What we need is Membership protocol. we could use SWIM protocol and we have good golang libarary from hashicorp already (https://github.com/hashicorp/memberlist) 
  1. Gossip: Pull based anti-entropy or something simple the only co-ordination the system to exchange states across the nodes (state is basically health check information of the origin severs of which particular node is owner of)
  1. Convergence:  delta-CRDT , vector clocks, version vectors and Eventual consistency

Interfaces

  1. Should be able add/remove node from the cluster at runtime (cluster membership)
  1. Add or remove list of Origin servers (input to the system)
  1. Get health check of specific origin server
  1. Get list of health checks of all the origin server from the any single node.

Before refining this design doc

  1. Create a simple broadcast key-value store with every update on one node broadcasts to others.
  1. There is something already exits using memberlist library. https://github.com/hashicorp/memberlist
  1. Now try to use CRDT instead of normal map(take inspiration from soundcloud roshi project)