While NIST doesn’t speculate as to why, it did find that the performance of 189 facial recognition algorithms from 89 different vendors varied by “race, sex and age” — that is, the systems performed significantly worse when asked to recognize people who weren’t young, white and male.
Note that fixing this isn’t necessarily a good thing. Chinese companies struggled to train accurate facial recognition for people of African ancestry, so they had an African state with a lot of policing and telecoms ties to large Chinese firms furnish them with the national driver’s license photographic database, and now Chinese surveillance works really well on people of African descent. Success?
1. For one-to-one matching, the team saw higher rates of false positives for Asian and African American faces relative to images of Caucasians. The differentials often ranged from a factor of 10 to 100 times, depending on the individual algorithm. False positives might present a security concern to the system owner, as they may allow access to impostors.2. Among U.S.-developed algorithms, there were similar high rates of false positives in one-to-one matching for Asians, African Americans and native groups (which include Native American, American Indian, Alaskan Indian and Pacific Islanders). The American Indian demographic had the highest rates of false positives.
3. However, a notable exception was for some algorithms developed in Asian countries. There was no such dramatic difference in false positives in one-to-one matching between Asian and Caucasian faces for algorithms developed in Asia. While Grother reiterated that the NIST study does not explore the relationship between cause and effect, one possible connection, and area for research, is the relationship between an algorithm’s performance and the data used to train it. “These results are an encouraging sign that more diverse training data may produce more equitable outcomes, should it be possible for developers to use such data,” he said.
4. For one-to-many matching, the team saw higher rates of false positives for African American females. Differentials in false positives in one-to-many matching are particularly important because the consequences could include false accusations. (In this case, the test did not use the entire set of photos, but only one FBI database containing 1.6 million domestic mugshots.)
6. However, not all algorithms give this high rate of false positives across demographics in one-to-many matching, and those that are the most equitable also rank among the most accurate. This last point underscores one overall message of the report: Different algorithms perform differently.
NIST Study Evaluates Effects of Race, Age, Sex on Face Recognition Software [NIST]
(Thanks, Gary Price!)
(Image: Rusty Clark, CC BY, modified)