A few years ago when I was consulting in London, we’d just finished implementing some automation and were planning to roll out a big change on a Friday afternoon to show it off. Firstly, this is a bad idea, as Captain Jim Hopper would say “[Fridays] are for coffee and contemplation”, not rolling out big changes, plus we had a pub to be at in just over 2 hours…
The change we were making was related to the way the backend UNIX fleet authenticated user logins so should have been fairly innocuous, but could definitely still go wrong. After some testing and approvals we rolled out the change and everything was looking good, there weren’t any errors being reported and we could now log in to servers with our regular user accounts which was perfect. Only 45 minutes until the pub. Then the phone started to ring…
“Hey I can’t save PDFs anymore!” Wait, what? I think to myself. We haven’t touched any laptops, how could we possibly have broken the ability to save PDFs? Turns out a whole department was basically at a standstill because nobody could save PDFs and this was absolutely critical to their workflow. We started frantically looking into it and it turns out that they aren’t clicking Print -> Save as PDF like you’d expect, they have an actual printer called “PDF Printer” that they print to instead. Turns out this is an ancient server that someone set up years ago which creates the PDF then runs a script that moves it into the user’s home folder. And of course our change meant that the server was no longer seeing the users in the same way (UID resolution was different) and no longer had permission to move the PDFs into their home folders. Fortunately a few hastily written scripts later the “PDF Printer” was back up and working. Plus we still got to the pub at a reasonable time, though with a bit more excitement than we were intending.
At the pub finally, we were discussing; how could we have prevented something like that, how could we have caught it before the users did, and how could we have figured it out and fixed it faster? Certainly nobody would have thought to test it, because nobody remembered it existed. Also nobody would have thought “better check we don’t have an ancient UNIX server printing PDFs” because why would that exist? Every browser and OS can just save them directly by clicking “Save as PDF”! This is where the idea of Overmind was born, you need something to discover what processes are important to your users and learn how it usually works, not just to watch what it’s told to because most of the time the thing that going to get between you and the pub on a Friday afternoon isn’t something you already knew you needed to be watching for.
Since that eventful afternoon I’ve had a lot more time to think about what the next pillar of observability needs to be in order to allow us to know what is going on in not just the few top-priority applications, but also in the vast majority of apps that have poor metrics, logs, and tracing, and that nobody would know how to interpret even if they did. Overmind will allow users to observe and traverse their cloud, kubernetes or traditional infrastructure in a unified way they have never seen before and will hopefully extend the value that people have been getting when applying observability to top-priority apps to absolutely everything in their infrastructure.
It’s already capable of mapping the config for an entire application, all the way from the issuing certificate authority for the web page, through the load balancers down to the details of the serverless function that hosts it (for example), and allowing users to explore it as a relational graph. All this is without you needing to change any application config, or even know that the application exists in advance. It’s getting better every day and I can’t wait to share more with you all.