The Four Ts of Distributed Systems

The Four Ts of Distributed Systems

Like any theoretical field, software is full of checklists and rules of thumb. For example, in a previous article I referred to Fred George’s rules for microservices, and there’s also a lot to be said for the guidelines of “Twelve Factor Apps”. In this article I shall discuss my own set of hints for distributed systems, which I refer to as “The Four Ts”.

Before I hit the rules themselves, a bit of context. Over the last few decades the spread of inter-computer networking, and especially the collection of protocols we know as the internet, has led to massive growth of distributed systems. Hardly anyone seems to write the stand-alone software we used to call “programs” any more. Everything relies on networks to connect with other systems in order to do its job.

This in turn has led to a proliferation of techniques, frameworks and buzzwords. Once there was DCE and CORBA, now you might encounter SOA, Web Services, XML, HTTP, JSON, REST and more. The list is effectively endless. As a huge generalisation, these things tend to address the communication between two (or occasionally more) systems which already know of each other’s existence, and have a good idea how to communicate once a network connection is established. I’m not going to address that crowded space. The problem of “service discovery” - finding previously unknown remote systems and determining or negotiating whether they can help - is much trickier, and I’m not going to address that one directly either, although it is an important part of the picture.

Instead, I’m hoping to highlight four key areas which are often overlooked when architecting distributed systems. To cut to the chase, my four Ts are:TransienceTrustTime, and Traffic

These are very general terms, chosen more for alliteration than precision, and I’m sure you all consider some aspects of each of these. So lets dig in to what I mean by each one. And before you ask, no, I’m not going to give pat answers and “best practices”. These are complex issues and the best I can offer is a few suggestions and some encouragement to bear these points in mind when working with distributed systems.

Transience

Nothing lasts forever. This is especially true in distributed systems. Hardware and network failures, misconfigurations and maintenance, restarts and upgrades; anything can cause a service to become unavailable. Luckily, most system architects are aware of this, and usually use redundnacy, backups and failovers to keep things goimg. However, all these techniques are still chasing the rainbow of 100% uptime. There are always more weak links and more things which can go wrong.

There’s another, more subtle, issue with this approach, too. The more layers of backups, hot- and cold-standbys, fail-overs and whatnot you pile on a system, the harder it becomes to upgrade any of the components. And this becomes combinatorially worse the more dependencies there are in the system. Several times I have seen a tiny change to one component require a whole massively-redundant system to be brought down for an “upgrade”.

Things which can help with this include designing systems with minimal coupling, components which can work “offline” if their dependencies are temporarily unavailable, and supporting progressive upgrades where multiple versions of components can co-exist in a running system.

Trust

On the surface this seems obvious. Security is always a consideration. We use a wide variety of authentication and authorization techniques to verify that clients connecting to a service have valid credentials. It’s somewhat less common, though, for a client to be able to verify that a service is legitimate. Traditinal distrubuted systems tend to be pre-configured: each client already knows the addresses of all the (implicitly trusted) services it requires.

In a truly distributed system this is less obvious, though. If services are liable to be replaced at any time, then what’s to prevent a third party form introducing a malicious server into a system? The one-sided security of the WWW is particularly prone to this at the moment, with continual warnings about “trojans” and “phishing”. Even most bank web sites fail this test, although I did once use a bank account which would confirm with a token I had specified as part of the login process.

As distributed systems grow and become more eclectic this is bound to become more of an issue. A reasonable first step is to allow clients to validate services using client-supplied, out-of-band data, but in a more general case this may not be directly possible, so smarter options such as third-party accreditation, escrow or cryptocurrency-style encryption chains may be needed.

Time

Time is at once a common part of distributed systems, and one of the aspects most often swept under the carpet. Most remote communication APIs acknowledge the existence of delays and the need for timeouts. Just as with communication protocols, though, the focus is on one “hop” at a time. This largely works for a single client and a single server, but as soon as the system includes multiple levels of dependency or alternate paths with different timing characteristics the system as a whole is open to all sorts of inadvertent failure modes. Timeouts are often treated as static configuration, or worse: hard-coded or even left to manufacturer’s default values.

In a flexible distributed system where the timing characteristics of a service can change at any time, pre-configuring timeouts can be a dangerous oversight. My suggestion for a robust system is that timeouts be established dynamically, and change as the system changes. This may be achieved using a variety of means such as services exposing aggregated timing information for clients, clients indicating their timing requirements as part of a request, some more general negotiation process, and so on.

Traffic

In some sense traffic is a bit like time. Architects try to design systems to cope with anticipated traffic levels using replication, load-balancing, throttling etc. Just as with time, though, the design often includes built-in assumptions which may not hold true for the life of the system. This becomes particularly apparent where there are multiple instances of clients using multiple instances of services. Client services pester like selfish and petulent children, even when doing so makes the whole system run more slowly.

Traditional load-balancing, based on evenly distributing client requests between all available servers, is a step in the right direction, but it becomes considerably less optimal when the servers have differing performance or load characteristics. Just as with timings, treating load-sharing strategies as static configuration or pre-defined architecture can lead to unwanted emergent behaviour.

In general, the more feedback about current load and performance characteristics that a service can provide to its clients, the more they can make sensible decisions about where (or when) to route their requests.