Z.07: The Metrics of Merit, Part 1
Meritocracy matters. The jobs of soldiers, marines, airmen, sailors, coasties, and guardians demand people who are committed to what Australian General John Hackett deemed a ‘contract of unlimited liability’. This is because, as servicemembers, we forgo some of the rights conferred by the very document we swear an oath to protect. We can be called upon to fight and to kill over and over until we are ultimately called up to die for the country that constitution established. To be ready to answer these calls to action, every servicemember needs to strive every day to meet the standards of their profession.
The new Secretary of Defense, Pete Hegseth, has vocalized a commitment to meritocracy, repeating it six times during his confirmation hearing and then again in his first message to the troops. This is a good thing.
I am firmly committed to meritocracy. I’ve made my career setting high standards for myself and seeking out new challenges. I’ve gravitated toward units and teams that did the same and, as a result, I have had the privilege to serve alongside hundreds of incredible soldiers.
But meritocracy needs metrics. Metrics are why I argued in favor of CAP; the old way we picked our commanders had none.1 It was highly subjective, with too much room for luck, chance, and bias to determine outcomes. I’ve also argued some jobs should have different standards, designating some as requiring higher physical ones which all soldiers serving in them should meet. And I’ve argued we need commanders who understand data, because commanders need to be data literate to make meritocratic standards and decisions. These are just a few examples of what meritocracy means in practice: selecting people who meet clear standards based on the requirements of the job. So far, we have memos advertising a commitment to the idea of meritocracy. We’ll have to be patient and wait to see what the metrics and the standards are.
Governing Via Hotfix
While we await those standards, we are instead seeing the first priority is an aggressive deletion of everything that might be DEI. So aggressive, in fact, that things that aren’t DEI have been caught up in a moment where there’s unclear guidance and anything can be someone’s version of ‘woke’. In many units, SHARP, EO, and EEO training were suspended only to be renewed once it was clarified these were not to be viewed as DEI.2 DoD social media managers are struggling to find out if an MLK Day post needs to be deleted, and thus far no guidance on what is not DEI has been sent down.
In what appears to be some of the clumsiest cases, the NSA and other agencies has been just running a search for a list of terms and deleting anything they find. This list includes 'DEI’ but also ‘bias’. Some have argued the chaos is the purpose.
You don’t have to search hard on the internet to find cringe-inducing examples of DEI gone wrong. Though, I in fact do have to go search on the internet for examples. Because I don’t have any. I haven’t done any DEI training over any of the last five years.
Not one minute.
That certainly contradicts the image of a DoD overrun with DEI which has been fueled by shitposters on social media for years. When I met up with my old platoon to bury one of our gunners back in 2022, my brothers from Ramadi peppered me with questions about ‘woke’ units with no time to train. I told them I had no idea what they were talking about.
I’m just one soldier and my lone anecdote is surely not everyone’s experience, but when I pinged my peers on NSTR, there was only one fellow soldier who could cite doing any DEI training. Even that was actually part of his civilian graduate education he’s currently attending in California.
Regardless, DEI’s problems aren’t just the examples you find on crazy PowerPoints. There is independent research that has highlighted that some DEI programs can hurt the very people they seek to help. DEI training programs can both fail to reduce bias and can even leave minorities being labeled as mere ‘quota hires’ rather than people who earned their jobs on merit. What drives this isn’t clear. It’s hard to argue DEI was the origin of an idea which a lot of Americans took as fact for most of our history. Regardless, DEI training hasn’t succeeded at eliminating it. While targeted recruitment and mentorship programs were found to be beneficial, a lot of DEI training itself had limited impact that quickly vanished only days later.
The benefits of diverse teams have solid academic research behind it. But a recent study has shown this benefit depends heavily on teams being able to pick their members instead of having them set by algorithm. Meanwhile, groups in the majority often feel threatened by programs intended to ensure diverse teams. It comes down to the signals of fairness, inclusion, and competence, which aren’t going to be read the same way by everyone.
Diversity & Inclusion
I’m in favor of diverse teams because I’ve seen the power of divergent opinions and experiences. I’ve also seen the utter train wreck of multiple commands filled with myopic piles of ‘yes men’. When everyone shares the same background, they tend to approach problems the same way and see the same solutions. When everyone is looking in the same direction, you’re missing something. In my profession, those uncovered flanks get you killed.
One of my bosses had the key insight that diversity itself isn’t enough. He explained that unless your team feels included, they won’t participate in the way you need them to. While I couldn’t agree more, inclusion without diversity is that team of ‘yes men’ that’s barreling toward failure. You need both: diversity and inclusion.
Sidney Dekker, a professor who studied safety extensively, offers a clear example of why diverse perspectives and thinking are critical to success.
“Complex systems can remain resilient if they maintain diversity: the emergence of innovative strategies can be enhanced by ensuring diversity. Diversity also begets diversity: with more inputs into problem assessment, more responses get generated, and new approaches can even grow as the combination of those inputs.”
The key here is finding people who think differently. Being of a different race or gender does not automatically give a person a different experience. People are more complex than their skin color or their genitalia. I'm not ‘espousing the value of minorities', because I don't think being a minority is the thing of value.
Instead, my teams need people whose brains complement each other. This was some of why my co-authors and I argued the army shouldn’t preclude neurodivergent candidates. Neurospicy brains fundamentally work differently and can be real assets to our formations.
But that’s not the same as saying every neurodivergent person should be able to serve, or that a fixed percentage of each unit in the army should be neurodivergent. I’m anti-quota, particularly ones set to match the demographics. People are complex and contradictory, we contain multitudes.
Metrics Matter
When I was just a company commander, my battalion commander asked me to rack and stack my team leaders, which I did fairly hastily in what I thought was a meritocratic way. But he never asked me for metrics, and I never took the time to devise any.
It never occurred to me to even have any metrics until I stumbled across a Harvard Business Review article while I was at HRC. Critically, the authors introduced me to a way to measure the difference between performance and potential. That insight is important because it helps you separate the people who are good at their current job from the people who need to rise to the next one.
Later, when I started battalion command, I looked back at the captains I’d previously bet on as future winners. Some were absolutely crushing it as field grades. But about half weren’t, which meant my assessment wasn’t any better than a coin flip. Lack of metrics led me to misdiagnose potential, just like those cited in the HBR article.
So, I adopted a similar matrix, with different measurable assessments for my commanders to use. We gave a copy to the officers as well, so they could see how they were being assessed. That measurable matrix had the added benefit of helping me spot commanders who clearly started with the ranking they wanted and worked backwards.
Without metrics, we tend to overconfidently rely on our very subjective judgement. Our propensity is ‘ducks pick ducks’. A lot of this isn’t overt or even conscious. It comes down to bias.
There’s Bias in Every Model
A few months back I shared an image I worked on while I was learning about how AI/ML works. Transforming Iron maiden’s British ‘Eddie’ into an American Green Beret ended up taking about 1,000 generations. A sizeable chunk of them were spent on getting the flag right. I tried three different models in Stable Diffusion, but every time the flag would come out wrong. This is because AI/ML doesn’t actually know what it’s drawing. It’s just doing Bayesian math based on what’s in its training model.
American Flags were represented in all three of the models I ended up using. But all the algorithm knew about them was to draw a corner of blue with ‘many’ white stars, and then a series of alternating horizontal red and white stripes. It didn’t know how many stars, nor how many stripes, because it didn’t need to.
Stable Diffusion is an impressive tool, that can even capture the subtle ways a flags folds and ripples in a realistic fashion, with minimal input from an amateur like me. But it’s hamstrung by what’s in its training data. You might have missed a key word above: ‘horizontal’. None of the American flags in the three different models were suspended the way the Eddie’s was, dangling down from the end of a guidon. After a lot of frustration and long nights trying to get the flag right, I hit on an idea and rotated the whole image 90°. It worked almost instantly — scroll back up to see the result.
That’s bias in the model. It’s not something racist, or even intentional. No one took the time to root out images of vertical aligned American flags from the models’ training data as part of some ‘woke’ crusade. It’s just what inevitably happens when you build a model. You only have the data you have. My fix took a literal change in perspective.
Our minds can work in a very similar way.
‘Where do I know you from?’
I have a unique advantage in the army: I’m the default model. When our brigade rushed to equip a couple thousand soldiers in Korea with desert combat uniforms in time for a hasty deployment to Al Anbar, there were plenty of medium-regular tops and bottoms to go around. If I kick the seat up in a Bradley fighting vehicle and just stand on the floor, I’m right at ‘name tape defilade’, which is the height where you can see out of the track but can also duck back inside in a hurry if you need to. My platoon sergeant and my gunners were taller, so they were often stuck holding a very uncomfortable half-crouch for hours on end.
I’m such the default model that at least once a quarter some fellow soldier will swear they saw me somewhere I wasn’t. Lot’s of people are sure they know me from somewhere. There are just a ton of ‘five-foot-nine medium build bald white guys’ running around the army. While that may lead to the occasional inconvenience of mistaken identity, it certainly comes with some advantages.
Just like those horizonal American flags, there are a ton of me in the army’s ‘training data’. This is part of why people like me keep coming up so often. It doesn’t mean I got my job through overt racism, nor does it diminish the hard work I put in to get selected for the jobs I’ve earned. But it does mean it was very easy for my leadership to see me in those jobs. This is what Kahneman and Tversky described as the ‘…inconsistency between the logic of probability and the logic of representativeness’.
The only thing that ever prevented me from being seen as belonging in a room was my rank. Plenty of my peers had to work twice as hard as I did to be seen to have a place in those same rooms. They had narrower margins of error, because any defect meant they stood out even more than when I failed. My failures surprised people, while any of theirs risked confirming previously held suspicions.
Which brings us back to the NSA’s list of 27 banned words. I’ve already been clear on I’m in favor of ‘diversity’ and ‘inclusive’. But that list also bans terms like ‘bias’ and ‘confirmation bias’.
There’s a sizeable volume of scientific research that has established cognitive biases are absolutely a thing. I don’t imagine I’ll change the mind of anyone too partisan to see the science there, so I won’t bother. But more importantly, everyone working in AI knows about bias. It inimical to models, and addressing it is a significant part of the training process. There is zero overlap between a Venn diagram of ‘AI programmer’ and ‘people who don’t believe in bias’. They don’t exist.
There’s a ton of things that can impact what gets into your data set and setting end point outcomes while ignoring the inputs is a surefire way to get the math wrong. Back at HRC, there was a noticeable disparity in how many commanders were selected for CSL across the five groups each year. When we combed through the data though, we found some groups simply had fewer candidates.3 A group with fewer candidates meant they had fewer tickets in the lottery of command. To determine what mattered, we had to find the ‘p-values’ in our data to determine what might be driving outcomes. Probability value is a way of determining if your data set is legit.
This post ended up a lot longer than I’d originally thought it would, so I’m breaking it into two parts. In my next post, I’ll take a look at what the P-value of SF battalion commands looks like.
Command Assessment Program. An assessment run in the fall where the Army evaluates the next generation of the Command Select List (CSL) leaders among its Lieutenant Colonels, Colonels, and Sergeants Major.
Sexual Harassment/Assault Response and Prevention, Equal Opportunity, and Equal Employment Opportunity programs.