Here are a bunch of things (good) feature-team engineers do on every that infrastructure engineers should really start doing:
- Measure prior to to building anything to validate that there’s actually something worth fixing.
- Measure again during your rollout to make sure you’re seeing the results you expect (i.e. that your code is actually doing anything) and that the error rate is low (that it’s not breaking stuff). Ideally measure accurately/quickly enough that if your change is breaking things that you fix it before your customer complains.
- Measure again after the final rollout to prove you had the improvement you claim you would.
- Talk to your customers (feature engineers) throughout the process. Talk to them before writing a line of code, talk to them prior to doing a partial demo to one of them, talk to them after the demo to see how it went, announce it going live.
- Do a limited rollout to a small set of customers [feature engineers] first. Ask for feedback and accept it graciously, especially if they are going out on a limb and telling you something you don’t want to hear.
- Learn your customers’ (feature engineers) process.
- Have a rollback plan
These may sound obvious, but in practice I rarely see these things done. For example, suppose you want change from CI/CD tool A to CI/CD tool B, and you’re claiming the change will improve reliability/speed/uptime/whatever. If you really believe this (and don’t just want to play with a new toy) then measure all of these things (reliability, speed, uptime) of all your CI/CD jobs before hand and do the math (e.g. if you’re 99.7% reliable then you’re shooting for a maximum .3% reliability increase).
Or if you are saying everybody is going to love switching from tool A to tool B, perhaps send out an informal survey (google forms, or even in slack) on how people feel about a given tool and then you can measure again 6 months later after your migration.
To be fair, a lot of feature engineering has a product owner to focus on a lot of these concerns, and for some odd reason that’s rare in platform infra. But it shouldn’t be. Platform infra teams seems like one of the most likely places in a company for engineers to spend 6+ months on a project that gets canceled or otherwise delivers no measurable improvement to the customers.