A few years ago when I was consulting in London, we’d just finished implementing some automation and were planning to roll out a big change on a Friday afternoon to show it off. Firstly, this is a bad idea, as Captain Jim Hopper would say “[Fridays] are for coffee and contemplation”, not rolling out big changes, plus we had a pub to be at in just over 2 hours…
The change we were making was related to the way the backend UNIX fleet authenticated user logins so should have been fairly innocuous, but could definitely still go wrong. After some testing and approvals we rolled out the change and everything was looking good, there weren’t any errors being reported and we could now log in to servers with our regular user accounts which was perfect. Only 45 minutes until the pub. Then the phone started to ring…
“Hey I can’t save PDFs anymore!” Wait, what? I think to myself. We haven’t touched any laptops, how could we possibly have broken the ability to save PDFs? Turns out a whole department was basically at a standstill because nobody could save PDFs and this was absolutely critical to their workflow. We started frantically looking into it and it turns out that they aren’t clicking Print -> Save as PDF like you’d expect, they have an actual printer called “PDF Printer” that they print to instead, which it appears we've broken somehow. Now to work out what on earth this thing is, why it exists, and how to fix it.
We asked around, but nobody knew about it. There was a wiki, but it wasn't mentioned. There was a CMDB, but it wasn't in there. We ended up just discovering the IP address of this "printer" and trying to log in via SSH, which worked! What followed then was a particularly complicated session of infrastructure palaeontology which, much like regular palaeontology, is much more tedious and less interesting than Indiana Jones makes it look, but we eventually cracked it!
Turns out this is an ancient server that someone set up years ago which creates the PDF then runs a script that moves it into the user’s home folder. And of course our change meant that the server was no longer seeing the users in the same way (UID resolution was different) and no longer had permission to move the PDFs into their home folders. Once we'd decided on a fix, and contented ourselves that the fix wouldn't make everything worse (or break some other unforeseen critical relic), the “PDF Printer” was back up and working. But not before the entire department had gone home without being able to finish their work for the weekend. Thankfully it wasn't so late that we couldn't make it to the pub afterwards.
At the pub finally, we were discussing; how could we have prevented something like that, how could we have caught it before the users did, and how could we have figured it out and fixed it faster? Certainly nobody would have thought to test it, because nobody remembered it existed. Also nobody would have thought “better check we don’t have an ancient UNIX server printing PDFs” because why would that exist? Every browser and OS can just save them directly by clicking “Save as PDF”! This is where the idea of Overmind was born, you need something to discover what processes are important to your users and learn how it usually works, not just to watch what it’s told to because most of the time the thing that going to get between you and the pub on a Friday afternoon isn’t something you already knew you needed to be watching for.
Since that eventful afternoon I’ve had a lot more time to think about what the next pillar of observability needs to be in order to allow us to know what is going on in not just the few top-priority applications, but also in the vast majority of apps that have poor metrics, logs, and tracing, and that nobody would know how to interpret even if they did. Overmind will allow users to observe and traverse their cloud, kubernetes or traditional infrastructure in a unified way they have never seen before and will hopefully extend the value that people have been getting when applying observability to top-priority apps to absolutely everything in their infrastructure.
It’s already capable of mapping the config for an entire application, all the way from the issuing certificate authority for the web page, through the load balancers down to the details of the serverless function that hosts it (for example), and allowing users to explore it as a relational graph. All this is without you needing to change any application config, or even know that the application exists in advance. It’s getting better every day and I can’t wait to share more with you all.