22 November 2024

AI Magic

'The reanimation of pseudoscience in machine learning and its ethical repercussions' by Mel Andrews, Andrew Smart and Abeba Birhane in (2024) 5(9101027) Cell comments 

 Machine learning has a pseudoscience problem. An abundance of ethical issues arising from the use of machine learning (ML)-based technologies—by now, well documented—is inextricably entwined with the systematic epistemic misuse of these tools. We take a recent resurgence of deep learning-assisted physiognomic research as a case study in the relationship between ML-based pseudoscience and attendant social harms—the standard purview of “AI ethics.” In practice, the epistemic and ethical dimensions of ML misuse often arise from shared underlying reasons and are resolvable by the same pathways. Recent use of ML toward the ends of predicting protected attributes from photographs highlights the need for philosophical, historical, and domain-specific perspectives of particular sciences in the prevention and remediation of misused ML. 

The present perspective outlines how epistemically baseless and ethically pernicious paradigms are recycled back into the scientific literature via machine learning (ML) and explores connections between these two dimensions of failure. We hold up the renewed emergence of physiognomic methods, facilitated by ML, as a case study in the harmful repercussions of ML-laundered junk science. A summary and analysis of several such studies is delivered, with attention to the means by which unsound research lends itself to social harms. We explore some of the many factors contributing to poor practice in applied ML. In conclusion, we offer resources for research best practices to developers and practitioners. 

The fields of AI/machine learning (ML) ethics and responsible AI have documented an abundance of social harms enabled by the methods of ML, both actual and potential. Although the topic is comparatively more obscure, critics have also sought to draw attention to the epistemic failings of ML-based systems: failures of functionality and scientific legitimacy.  The connection between the ethicality and epistemic soundness of deployed ML, however, has received scant attention. 

We urge that if the field of AI ethics is to be efficacious in preventing and remediating the social harms flowing from deployed ML systems, it must first grapple with discrepancies between the presumed epistemic operation of these tools and their in-practice ability to achieve those aims. While such an observation is not novel (see Raji et al.), we build on prior work, both in offering an analysis of the issue from a philosophical vantage point and in venturing into the intricacies of in-practice epistemic and ethical misuses of ML systems. We argue that philosophical, historical, and scientific perspectives are necessary in confronting these issues and that ethical and epistemic issues cannot, and should not, be confronted independently. 

A recent surge of deep learning-based studies have claimed the ability to predict unobservable latent character traits, including homosexuality, political ideology, and criminality, from photographs of human faces or other records of outward appearance, including Alam et al., Chandraprabha et al., Hashemi and Hall, Kabir et al., Kachur et al., Kosinski et al.,  Mindoro et al.,  Parde et al.,  Peterson et al.,Mujeeb Rahman and Subashini,  Reece and Danforth,  Su et al.,  Tsuchiya et al.,  Verma et al.,  Vrskova et al.,  and Wang and Kosinski.  In response, government and industry actors have adapted such methods into technologies deployed on the public in the form of products such as Faception,  Hirevue,  and Turnitin.  The term of art for methods endeavoring to predict character traits from human morphology is “physiognomy.” Research in the physiognomic tradition goes back centuries, and while the methods largely fell out of favor with the downfall of the Third Reich, the prospects of ML have renewed scientific interest in the subject. Much like historical forays into this domain, this new wave of physiognomy, resurrected and yet not, apparently, sufficiently rebranded, has faced harsh criticism on both ethical and epistemic grounds. 

This critical response, however, has yet to explore how the confused inferential bases of these studies are responsible for their ethically problematic nature. There are several conclusions we wish to draw from the detailed study of these examples, which we believe extrapolate to the relation between ethical and epistemic issues in deployments of ML at large. (1) No inference is theory neutral. (2) Leaving a theory or hypothesis tacit means it is not held to account for, and its conclusions are not critically evaluated before the results of such work are deployed or acted upon. (3) If a study informs a policy, intervention, or technology that will materially impact human lives—in other words, if a study is at all informative—and it misrepresents the human reality within which it is being deployed, it should be expected that harms to humans will arise. Wrong theories generate wrong interventions. Wrong interventions cause harm. (4) ML models are developed and deployed to extract complex, high-dimensional statistical patterns from large datasets. These complex patterns are typically taken to represent unobservable latent features of the systems from which their training data were drawn. The norms and procedures established for correctly inferring unobservable latent variables from correlational measures differ by scientific field and must be indexed to subject matter. (5) Meta-narratives and cycles of hype surrounding ML, we argue, play a direct role in encouraging errant usage of the tools. When ML tools are proclaimed to deliver false inferences, the outcomes are rarely ethically innocuous. This is true in general but is all the more salient for ML tools deployed in socially sensitive arenas. In bringing to light the connection between pseudoscientific methods in applied ML and the ethical harms they perpetuate, we hope to encourage greater care in the design and usage of such systems. 

Physiognomy resurrected 

“Physiognomy” is “the facility to identify, from the form and constitution of external parts of the human body, chiefly the face, exclusive of all temporary signs of emotions, the constitution of the mind and the heart.” Georg Christoph Lichtenberg, 1778

Recent years have seen an abundance of papers promulgating physiognomic methods resting on ML models.  Work of this ilk is undertaken by academic research groups, private firms, and government agencies. A number of representative instances of each claim to have trained ML classifiers to predict personality, behavioral, or identity characteristics from image, text, voice, or other biometric data. Inferred labels have included race,  sexuality,  mental illness,  criminal propensity, autism, and neuroticism. These studies have predominantly relied on deep learning neural networks (DNNs), sometimes in tandem with more simplistic regression techniques. The practice of wielding the methods of ML toward the (putative) prediction of internal mental states, dispositions, or behavioral propensities based on outwardly visible morphology has been labeled “AI pseudoscience,” “digital phrenology,” “physiognomic AI,” “AI snake oil,” “bogus AI,” and “junk science.” These technologies, however, do not only exist in the abstract—a growing number of companies now market physiognomic capabilities, including the ability to detect academic dishonesty in students and future performance in prospective employees. Remarkably, a single tool marketed to defense contractors boasts of the ability to predict “pedophilia,” “terrorism,” and “bingo playing.” 

In this section, we review the details of several representative examples of physiognomic ML. These case studies are intended to be illustrative of the kinds of reasoning, epistemic foundations, and logic behind research and applications of automated inference from images portraying human likenesses. The studies presented here are intended to be representative of the genre and not a comprehensive overview. 

Inferring sexual orientation 

Utilizing DNNs,  Wang and Kosinski extract features from images of human faces, which they then regress in a supervised learning task against self-reported sexual orientation labels. The classifier achieved 81% and 71% accuracy scores on sexual orientation for male and female subjects, respectively. These findings represent a higher classification accuracy than experimentally determined human judgment. The researchers scraped their data from social media profiles, claiming that training their classifiers on “self-taken, easily accessible digital facial images increases the ecological validity of our results.”  Wang and Kosinski report that the “findings advance our understanding of the origins of sexual orientation.”  The authors of the study explain the ability of their models to discriminate sexual orientation with the claim that “the faces of gay men and lesbians tend to be gender atypical.”  The validation of this hypothesis depended on the training of an additional DNN for gender discrimination. This classifier assigned a likelihood to each face image of being female. The researchers then interpreted this likelihood as a measure of facial femininity, assessing the faces of homosexual-tagged individuals against an average femininity score for heterosexual individuals. The researchers claimed that their results revealed that “the faces of gay men were more feminine and the faces of lesbians were more masculine than those of their respective heterosexual counterparts.”  “The high accuracy of the classifier,” Wang and Kosinski report, “confirmed that much of the information about sexual orientation is retained in fixed facial features.”  The contention of the researchers is that high classification accuracy of sexual orientation from facial features, alongside the evidence they supply for the gender-atypicality of facial morphology, lends support for a particular theory of the genesis of same-sex attraction. The proposed hypothesis is the prenatal hormone theory (PHT) of homosexuality, which proposes that same-sex attraction is a developmental response to atypical testosterone exposure in fetal development. Wang and Kosinski’s results, they claim in their preprint, “provide strong support for the PHT, which argues that same-gender sexual orientation stems from the underexposure of male fetuses and overexposure of female fetuses to prenatal androgens responsible for the sexual differentiation of faces, preferences, and behavior.” 

Personality psychology 

Kachur et al. write that “morphological and social cues in a human face provide signals of human personality and behaviour.” Their stated hypothesis is that a “photograph contains cues about personality that can be extracted using machine learning.” The authors further claim to have “circumvented the reliability limitations of human raters by developing a neural network and training it on a large dataset labelled with self-reported Big Five traits.” Here, deep learning is invoked as a means to obtain objectivity beyond human judgment; however, the training dataset was self-labeled by human raters. The predictive accuracy is interpreted as prima facie evidence for their hypothesis that structural features of human faces contain information of human personality and behavior, and the authors state that their “study presents new evidence confirming that human personality is related to individual facial appearance.” 

In this study, participants self-reported personality characteristics by completing an online questionnaire and then uploaded several photographs, which the researchers then used to construct their training and test datasets. In this example, as in Wang and Kosinski, researchers used the accuracy of their ML model as confirmatory evidence of a joint causal basis for both facial morphology and self-reported personality. Kachur et al. report “several theoretical reasons to expect associations between facial images and personality” including that “genetic background contributes to both face and personality.” Kachur et al. described their results as being indicative of “a potential biological basis” to the discovered association between face images and self-reported personality characteristics. 

“Abnormality” classification 

A recent study constructed a “normal” and “abnormal” human facial expression dataset for the purpose of automatically detecting such abnormal traits as drug addiction, autism, and criminality from facial images. The authors argued that “facial expression reflects our mental activities and provides useful information on human behaviors.” Kabir et al. “developed a combined method of Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) to classify human abnormalities.” “This approach,” they contend “analyzes the human face and finds the abnormalities, such as Drug addiction, Autism, Criminalism [sic].” 

The researchers utilized images “gathered from the web using the web gathering technique,” although the details of this technique were not further elucidated. It is not made clear within the scope of the manuscript on what basis images were classified as “normal,” “drug addicted,” “autistic,” or “criminal.” The researchers reported a validation accuracy of 89.5% on the four categories. The provenance of the labels is left undisclosed in this study, as are the validation criteria. 

In a similar vein, Vrskova et al. claim to be able to diagnose “abnormal” human activities such as “begging,” “drunkenness,” “robbery,” and “terrorism” from video footage. 

Lie detection 

Automated deception detection has long been of interest to law enforcement, judicial systems, academic institutions, corporations, and governments. A recent study by Tsuchiya et al. utilized facial analysis and ML toward the putative automatic detection of deception for remote job-interview scenarios. The stated purpose of this research was to create an ML-based tool to detect when someone on video call might be lying. Participants in this study were asked to knowingly generate false descriptions of images while being recorded via video and biometric sensors. The researchers then used these data to train an ML model to predict deception-based facial or head movements, pulse rate, or eye movements. The researchers obtained a high accuracy rate using their classifier on the four participants used in the study. As in the other studies reviewed here, the predictive accuracy of the model was taken to substantiate the hypothesis that particular facial features or movements are evidence of unobservable character or behavioral traits—in this instance, deception. 

Criminality detection 

A study by Wu and Zhang purported to “empirically establish the validity of automated face-induced inference on criminality.” The authors trained four canonical ML models on a dataset of ID photographs of Chinese citizens to predict the label of criminality. Wu and Zhang stated that their models detect “criminality based solely on still face images, which is free of any biases of subjective judgments of human observers.”33 The convolutional neural network achieved an accuracy rate of 89.51% at picking out subjects who had been arrested for a crime. Hashemi and Hall claim to have also developed a deep learning-based criminality detector.