Monday, November 2, 2009

Event Driven Architecture for Systems that Never Stop

InfoQ posted an excellent talk by Joe Armstrong (of Erlang fame) titled Systems that Never Stop (and Erlang). Joe presents 6 laws that he sees as crucial to systems that never stop (i.e. systems with X nines of reliability): Isolation, Concurrency, Failure Detection, Fault Identification, Live Code Upgrade, and Stable Storage.

The Erlang language and runtime were designed for exactly this kind of environment so it comes as no surpise that it shines in all areas. However, it occurred to me that an Event Driven Architecture goes a long way in building a system that never stops with more mainstream tools like Java and a relational database.
  • Isolation: Crashing of one process should not influence any other process, effectively isolating different processes. The big boon here is that this can really improve the reliability of your system: the probability of two independent processes failing is only half that of a single process failing.
    In an event driven world events are typically handled independently of one-another. Failure to process a particular event does not affect the processing of any other event in the system.
  • Concurrency: A runtime environment should not put any undue constraints on how far you can take concurrency, i.e. the number of threads or processes in your system. Designing a system with concurrency in mind allows you to take advantage of current day hardware (multi-cores and the like), but also goes hand in hand with the first law: independent, isolated processes can run concurrently without much problem.
    In an event driver system, multiple independent event processors can run concurrently on a single machine, taking advantage of all CPU cores available on the machine. Furthermore, event driven systems scale horizontally, allowing you to easily add additional machines with even more event processors.
  • Failure Detection: You must be able to detect when things go wrong, for instance by having a nanny like process that can monitor the system and can take appropriate action when failures occur. Furthermore, you should fail early, limiting the potential impact of the failure.
    In an event driven system you typically have failover handling that holds on to events that failed processing for later analysis or retry. One system property that Joe does not mention in his talk but is very relevant in this area is idempotence: being able to retry things without affecting the final result. With some effort, idempotence comes naturally in an event driven system.
  • Fault Identification: If something has gone wrong, and you've detected it as per the previous law, you need to be able to identify what went wrong.
    Again, in event driven systems this comes naturally: the event that failed processing is normally maintained by the failover system, along with error messages and the like, and can easily be used to simulate what happened.
  • Life Code Upgrade: You shouldn't need to bring down your system for an upgrade. Whether or not this is important to you is of course very dependent of the application that you're writing. Still, if you have a cluster of event processors, you can upgrade them one by one, keeping the system live at all time. On a single machine you could use things like OSGi to attain similar results.
  • Stable Storage: This essentially boils down to having transactional (ACID) storage available. In your average enterprise application this storage will be provided by an RDBMS (Oracle, ...). Most current-day MOM solutions (MQ, ...) are also completely transactional, and even XA aware. So there should be no need to ever loose any information in an event driven system build on top of these technologies.

So in all, an event driver architecture goes a long way in bringing you a system that never stops. This of course in stark contrast with batch systems, that have a hard time living up to a number of these laws.


  1. Interesting write up. Thanks!

    Something worth pointing out is that OSGi (specifically the Spring dm server environment) does not handle runtime deployment nicely. It does do so in theory, but after several deployments, you're going to see permgen errors. This is, of course, not Spring dm's fault; it's a well documented issue in the jvm / core libraries (many suspect the bean info classes). You do mention the solution that works for many - using multiple consumers to deal with rolling restarts - but the problem still exists. This is one of those things that we need to deal with. There's no reason we should have to deal with restarting app servers these days. We need to fix java class loading and isolation and fast.

  2. I knew of the classical PermGen like problems typically caused by commons-logging when redeploying apps in for instance Tomcat, and have spent quite some time in the past chasing down a memory leak in the JVM reflection implementation that also caused PermGen OOM errors, but wasn't aware this is still an issue in OSGi environments.
    You are right: we need to fix this!

  3. I'd like to clarify the position regarding SpringSource dm Server, although I accept the general statement that Java applications can encounter permgen leaks if they, or the libraries they depend on, are badly written.

    Permgen leaks depend on the application and many applications can be redeployed arbitrarily often in dm Server without leaking permgen space.

    This is something which is remarked on in dm Server classes in comparison to Tomcat which often needs to be recycled to free up space.

    E. Sammer: if you are aware of a specific application which causes such a leak, please file a bug and we'll see what can be done. (For example, we recently fixed such a bug in the use of the JDBC DriverManager.)

    Glyn Normington
    SpringSource dm Server Development