"The Impact of Biases in Facial Recognition Artificial Neural Networks" by Ezra Wingard

The Impact of Biases in Facial Recognition Artificial Neural Networks

Ezra Wingard, SUNY Oswego

Abstract: This study probes how biases are formed, and then mitigated within artificial neural networks for facial recognition. In current research on facial recognition neural networks, it has been shown that there are many ways that biases/prejudices can negatively affect the accuracy of the network on characteristics such as gender status and identity. In order to test this, two pre-trained neural networks were fed novel datasets - one on cisgender faces and one on transgender faces. The two pre-trained models were then analyzed with regards to gender identity and status variables on accuracy rates calculated from the direct prediction outputs provided by the neural networks. Notable biases were found within both datasets and models on gender characteristics.

Introduction

Neural Networks

Artificial neural networks (ANNs) are a subset of artificial intelligence and machine learning that can provide us with predictions of important information. Simply put, they consist of inputs and outputs. Inputs can be different types of information and are fed into the Neural Network (NN) in order to make it “learn” and can be used to “test” how well the algorithm performs after it has “learned” sufficient information. This “learning” is achieved through the shifting of weights in the artificial neurons within the neural network and facilitates the “conversations” that happen between neurons (1). These conversations form connections between neurons, sometimes in more than one layer (called a hidden layer) (2), that then converse with each other to form predictions which serve as the output. For example, for the scope of this paper, the inputs will be images of cisgender and transgender people of different races and ages, and the output is a prediction of their gender. The working definition of cisgender at the time of publication is someone who identifies with their assigned gender at birth, and the definition of transgender is someone who does not identify with their assigned gender at birth. The neural network used in this study is a classification algorithm that takes an input and classifies the facial images into demographics that are of interest to be measured.

The specific type of neural network used for this project is a Convolutional Neural Network (CNN). This network is different from a traditional Artificial Neural Network (ANN), because it relies on layers of convolution to compute the output predictions rather than weights and nodes. In short, the difference is that the CNN is a two-part NN: it does feature extraction and feature mapping, while ANNs typically only extract features.

Why Is This Important?

Historically, transgender and non-white people are left out of important discussions, even when it predominantly affects them. Research on artificial neural networks has been progressing rapidly and providing a lot of development with such technology. However, the ethics of using AI lags behind the pace of technological development.

There have been conversations involving research and development on AI for a long time, however according to The AI Index 2022 Annual Report (Zhang et al., 2022), ethics within machine learning started to be discussed more prevalently circa 2014. Additionally, Zhang et al. (2022) reported that publications involving AI ethics have increased by 400% since 2014, with 71% of the publications produced by the industry sector. This report noted that the number of papers is expected to continually grow each year. Such notable literature that are discussed below show how pertinent the issue of ethics is whenever new technology such as NNs are being formed. Scheuerman et al.’s (2019)work on How Computers See Gender has found that transgender people are also often misgendered by computer vision algorithms, while Buolamwini and Gebru (2018) focused on race and gender - specifically bias against Black Women. According to the United States Census bureau, around 0.6% of the US population that responded to the House Pulse Survey identified as transgender (2021). Although transgender individuals are not the majority according to the census, it is still extremely important to include diverse data into training sets to avoid potential bias and discrimination.

Others have noted that there are some governmental programs that have been found to use AI in ethically questionable ways, such as the police in Detroit, Michigan (Johnson, 2022) as well as England and Wales (Radiya-Dixit, 2022). According to several authors, there have been ethical ramifications noted with AI that could misgender transgender people and does not allow for their algorithmically-assigned gender to be changed (Keyes, 2018; Scheuerman et al., 2019, 2020). With the case of AI being discriminatory against transgender people, there may be adverse effects on mental health due to misgendering (McLemore, 2018). The consequences of not fixing such discriminatory AI may lead to more disproportionate targeting of minority groups.

In addition to the research findings mentioned above, it has been documented that these marginalized groups of people are typically underrepresented in the datasets used to train and test the algorithms (Wu et al., 2020; Karkkainen and Joo, 2021). Importantly, during the training phase, if a neural network is not exposed to diverse demographics, it will not perform well on even basic recognition tasks of such individuals. The training phase is critical for the outputs to be accurate and representative of the population it is being tested on. Even though there is an immense need for such diversity, many larger projects use datasets that have been cited more often in AI literature. Although these pre-created datasets may be convenient, they may be detrimental to the equity and accuracy of the most impacted groups that this software will be used on.

There have been other attempts in the past to specifically create datasets to help NNs train on transgender faces, such as the HRT Trans Database (AIAAIC, 2023). This dataset was bashed by several authors such as Keyes and Scheuerman for questionable ethics involving the lack of information on consent from the subjects, lack of publication of the dataset despite public funding, and feuds over the dataset being created for “national security purposes” (AIAAIC, 2023). Other papers including transgender people in their testing and training datasets have opted to not publish their datasets online because of other potential ethical concerns. Gay individuals have also been included within research on computer vision, however such research has also been deemed problematic. In 2017, two authors sought to determine the accuracy rates of Deep Neural Networks on sexuality based on facial image data (Wang and Kosinski, 2017). Many individuals and organizations were quick to point out potential problems with the study, regarding stereotypes of the LGB community and the lack of inclusion of any non-white people (AIAAIC, 2017). Because of transphobic, homophobic, and racist targeting, there may be concerns about datasets that specifically include these populations.

It is important to note that not all of the biases that come from ML algorithms are formed during the training process. There is a lot of research that has been conducted on how biases form and how we can mitigate them before they cause harm (Buolamwini and Gebru, 2018; Scheuerman et al., 2020; Wu et al., 2020; Google, 2022). Balancing datasets is one of many ways that have been proposed to lessen the biases that can be formed within AI applications. Many researchers have noted specific ways that we can reduce biases within these algorithms as a way of harm reduction through dataset balancing (Wu et al., 2020; Joo and Kärkkäinen, 2021). Others have argued that balancing datasets will not be enough for bias mitigation within NN models (Zhang et al., 2018; Wang et al., 2019A; Wang et al., 2019B; Alberio et al., 2020; Gong et al., 2020).

This honors thesis was important to try to see what methods can potentially reduce biases/prejudice against transgender people. By creating novel datasets using scraped images of transgender people and cisgender people on two identity categories, there could be testing for (and mitigation of) potential biases that may arise, even from a model that prides itself on being balanced on race, gender, and age. It was questioned if FairFace’s dataset balancing on cisgender people would show different accuracy rates within non-cisgender populations, and whether or not using a different model trained on a non-balanced dataset would make a difference for gender classification outputs.

Methods

Models

To find and mitigate biases, I used a pre-trained neural network model from the Github repository for FairFace (Karkkainen and Joo, 2021) that was said to be trained on a balanced dataset (3). This is in direct contrast to more well-known and used datasets (which do not have the same claims of being “balanced”) - these “unbalanced” datasets typically contain high amounts of cisgender white men, and do not seem to account for diversity of race, age, and gender during the training phase. It was hypothesized that because of the FairFace model being trained on a more diverse dataset, that it may be less prejudiced against transgender people, especially those of different age and race groups.

The specifications for race, gender, and age outputs used were those that were already specified with the pre-trained model - 4 races (White, Black, Asian, Indian), 7 races (4 races + Latino/Hispanic, Southeast Asian, East Asian, and Middle Eastern), gender, and age. Some gender identity terminology was changed (“Male” to “Man”, etc.). The switch to gender-based rather than biologically sex-based language was designed to include transgender people within this study, as transgender people may not identify with their assigned sex at birth. Several snippets of code from the original FairFace model were modified to fit the purposes of this project.

To empirically test potential differences from the model based on fairness of gender categories, a second pre-trained model was used with the same novel training datasets. The model in question is InceptionResNet v1, which has made no claims to debias or balance the datasets that the neural network was trained on. Similar parameters were from the FairFace model for outputs (7-race, man and woman) for consistency purposes. However, certain characteristics were not able to be parsed into outputs within this model, such as a 4-race contrast, and age outputs. The only outputs that were measured within data analysis, however, were gender groups across both models.

The required code for implementing the InceptionResNet v1 model was taken from the author's GitHub repository (Sandberg, 2023). Original code was written for this model’s classification outputs and predictions. The IRNv1 model was pre-trained on the VGGFace2 dataset, which consists of over 3 million images (Cao et al., 2017). The number of men reported within the dataset is approximately 59.7%, with no reports on racial demographics within the original paper (Cao et al., 2017).

Datasets

Because there are no known ethical datasets that are available on the internet consisting of transgender people for facial recognition purposes, I used the Instaloader Python API (Instaloader, 2023) to scrape images of transgender and cisgender people from Instagram. An API, or Application Programming Interface, can be used for a wide variety of tasks involving communications between applications and platforms. In this case, Instaloader’s API was used to access data from Instagram by way of an original Python script. This script was then used to download necessary images from target demographics for dataset formation. Through this script, a hashtag was inputted as a string for the scraper, so that all images within the inputted hashtag would be downloaded. Examples of hashtags used for transgender populations include #GirlsLikeUs, #FTMtransgender, etc. The scraped images were then cleaned accordingly (images consisting of subject matter other than faces were omitted), and then the testing datasets were thus created. The images from the datasets were found and used based off public information available on Instagram, in accordance with Instagram and Instaloaders privacy regulations and guidelines.

The individuals within the cisgender dataset were gathered based on self-identification of race through hashtags. Categorization of images into folders were done by racial self-identification via hashtags (4) (7 race categories) rather than gender (as done with transgender men/women) to denote that they were in the cisgender dataset. At the time when the images were scraped for the datasets, all individuals self-identified as their respective gender status within this dataset.

While running the pre-trained models, images from the constructed datasets were cleaned and cropped around faces found within the files and automatically sorted into a folder of detected faces. This process happened automatically, without supervision or action required on the part of the author. Final dataset information is found in Table 1.

Analyses

For individuals whose images were included in the test set of this study, their self-identified gender identity was used to mark correct/incorrectness. Because of the nature of the binary categories within the study, the accuracy formula that was adapted for the use of this study is as follows:

In the above adapted formula, the True Positives represent the correctly classified individuals, the False Positives represent the misclassified individuals, and N represents the total number of images within each dataset.

Further analysis on the test datasets (both the transgender and cisgender test sets) on both models were conducted using R Studio and the R programming language. Logistic Regressions were performed to determine analysis of each predictor variable on the outcome variable. Predictor variables within the analyses were model (FairFace vs. InceptionResNetv1 or IRNv1), gender identity (man vs. woman), and gender status (transgender vs. cisgender self-identification). The sole outcome variable measured is the accuracy rates calculated.

Results

Both Models

It is important to note that unless otherwise specified, gender classification rates only include gender identity. Outside of specific instances, analyses to be described below consider general gender identity without respect to gender status. In Table 2, FF and IRNv1 on the vertical axis of the table represent the models (FairFace and InceptionResNetv1) used within the study. The highest accuracy percentage is for cisgender women within the FairFace model (97.5%), and the lowest accuracy is for transgender men within the IRNv1 model (47.4%).

A logistic regression was calculated to determine the odds ratios and effects between both models with respect to the gender identity and gender status on accuracy rates. There was a significant main effect of the model employed, gender status, and gender identity. Regarding the models, it was found that FairFace was 6.27 times more likely to be accurate on general gender classifications than the IRNv1 model, p < 0.001, 95% CI [-0.30, -0.18]. Based on the gender status, the odds of a cisgender person being gendered correctly was 12.34 times more likely than a transgender person, p < 0.001, 95% CI [-0.47, -0.33]. Lastly, with gender identity, it was found that the odds of a woman being gendered correctly was 2.73 times more likely than a man, p < 0.001, 95% CI [-0.02, 0.11].

In the same logistic regression, two out of four interactions were found to be significant. The two significant interactions were between model and gender identity, as well as model and gender status. FairFace was found to have a higher accuracy on men, with the odds of a man being gendered correctly as 2.99 times higher than within the IRNv1 model, p = 0.05, CI [-0.15, 0.03]. Additionally, FairFace was more likely to gender a cisgender person correctly around 4.89 times more than the IRNv1 model, p < 0.001, CI [0.07, 0.28]. The interactions between gender status and gender identity (1.12, p = 0.82, 95% CI [0.10, 0.30]), as well as model, gender identity, and gender status were not significant (2.83, p = 0.1, 95% CI [-0.08, 0.21]).

Within a logistic regression the Odds Ratios provided can be used as the effect size. We can see that the largest effect on the accuracy outputs that was measured was from gender status (an Odds Ratio of 12.34 for the gender status alone on accuracy rates). The smallest effect size noted was from a statistically insignificant interaction between gender status and gender identity (an Odds Ratio of 1.12 for the interaction between gender status and gender identity). Figure 1 is separated between Model - FairFace (FF) and InceptionResNetv1 (IRNv1), Gender Status - cisgender (cis), transgender (trans) and Gender Identity - man (blue / left bars), woman (pink / right bars). The dotted line denotes the chance-level accuracy for classification (50%).

FairFace Model

Preliminary analyses were conducted on accuracy rates for the FairFace model. The binary transgender testing dataset showed an initial accuracy rate of 66.7% overall on binary transgender status. For trans women specifically, the accuracy rate was 77.93%, in contrast to trans men accuracy rates of 53.65%. This is in stark contrast to the binary cisgender testing dataset, where classification rates for cisgender women had a 97.5% accuracy rate, with cisgender men lagging behind with an accuracy rate of 93.5%. The total accuracy rate when tested on cisgender status was 95.3%.

A logistic regression was conducted to determine the FairFace model’s correct rate of gender classification on the identity of woman given the correct rate of gender classification on the identity of man. It was found that, when holding gender status constant, the odds of a woman being gendered correctly within the FairFace model was 2.73 times higher than a man being gendered correctly, p = .05, 95% CI [0.18, 19.95]. With respect to gender status, while holding gender identity constant, the odds of a cisgender person being gendered correctly was found to be 1.08 times higher than transgender people within this model, p < .001, 95% CI [-3.06, -2.01]. The interaction between gender status and gender identity within the FairFace model was not found to be significant (1.12), p = .82, 95% CI [-.92, 1.04].

Figure 1: Examples Of Dataset Inputs To The Fairface And Irnv1 Models

InceptionResNetv1 Model

Preliminary calculations were done on the IRNv1 Model’s accuracy rates. The overall accuracy rates on the binary transgender testing dataset were 60.7%. For trans women specifically, the accuracy rates within the IRNv1 model predictions were 72.2%, while on trans men it was 47.4% (5). The overall accuracy rates on the cisgender test dataset were 68.6%. Cisgender men had an accuracy of 69.5% while cisgender women had an accuracy of 67.5%.

An additional logistic regression was conducted to determine the IRNv1 model’s correct rate of gender classification on the identity of woman given the correct rate of gender classification on the identity of man. There was a significant main effect of gender status and interaction between gender status and gender identity. When holding gender identity constant, the odds of a cisgender person being gendered correctly within the IRNv1 model was 1.47 times higher than a trans person, p < 0.001, 95% CI [-1.32, -0.58]. Additionally, it was found that the odds of a transgender woman being gendered accurately within this model was 3.22 times higher than a cisgender man, p < 0.001, 95% CI [0.64, 1.71]. However, the effect of gender identity within the IRNv1 model on accuracy rates was not found to be significant (2.47), p = 0.56, 95% CI[-0.45, 0.24]. With respect to the interaction between gender identity and gender status, it was found that a transgender gender status created an effect in gender identity that was not present outside of the interaction, because the main effect of gender identity was found to be statistically insignificant.

Discussion

NN Model Outcomes

The results of this study provide important insights into the accuracy of gender identity and status classification between two different Deep Learning models, FairFace and IRNv1. The two models within this study were chosen because of the datasets used within pre-training.

Overall, both models were typically better at gendering women (regardless of gender status) (6), however, with regard to gender status alone, cisgender people were more likely to be gendered correctly than transgender people. The study found that FairFace was significantly more accurate in general gender classification (both gender status and gender identity) than the InceptionResNetv1 model.

Further analysis revealed significant interactions between all predictor variables - model used, gender identity, and gender status. FairFace was more accurate in gendering men and cisgender individuals than InceptionResNetv1. The study also found that the FairFace model was more accurate in correctly gendering women compared to men. The InceptionResNetv1 model, on the other hand, had no significant difference in accuracy between gender identities. This suggests that the FairFace model may be biased towards gendering women correctly.

The findings of this study have important implications for the development and use of gender classification algorithms. It is crucial to consider the potential biases that may exist in these models and to ensure that they are trained on diverse and representative datasets. Additionally, the study highlights the importance of considering the datasets used to pre-train NN models. Although FairFace was technically better on all gender classification tasks than the InceptionResNetv1 model, there were still discrepancies pertaining to gender status classification. Despite claims of FairFace being “fair” and “balanced” with respect to gender, because those operations were done solely on cisgender individuals (7) that may have contributed to a major lapse in accuracy on the transgender dataset. Within both models, the transgender accuracy rates were dismal in comparison to the cisgender accuracy rates, especially in regard to transgender men (8), who were misgendered the most frequently across both models. It remains to be seen if adding transgender individuals into the datasets similar to those used to train FairFace and/or IRNv1 would result in better accuracy rates for such populations.

Interestingly, the findings from this study echoed similar research previously done by authors on the subject of facial recognition software classifying transgender individuals - with transgender women often being misgendered less than transgender men (Scheuerman et al., 2019). Within and between the models, there were noticeable differences in transgender vs. cisgender accuracy rates. Specifically, significant main effects from the logistic regressions with respect to gender status were found in both the IRNv1 and FairFace models, as well as the between model comparison. When both models were considered, in general, cis people were 12.34 times more likely to be gendered correctly on both models than trans people.

When interpreting the Odds Ratio results from the logistic regressions, we can take them into account for effect size as well. Interpretation of the “size” of the Odds Ratio in a similar fashion to Cohen’s d was done following calculations from Chen et al. (2010). So, when discussing the effect that gender status had on each model and across models, we can see that for the IRNv1 and FairFace models, there was a small effect of gender status on accuracy rates. When taking into account both models, there was an extremely large effect of gender status on accuracy rates. The interaction between model and gender status also had a large effect on accuracy rate outcomes.

Within the FairFace model alone, the findings suggest that the FairFace model may have some limitations in accurately classifying gender for transgender individuals, particularly for trans men. Generally, the FairFace model's accuracy rates for gender classification were influenced more by the gender status of the individual (cisgender vs. transgender) rather than their gender identity (male vs. female). Because there was an insignificant interaction between gender status and gender identity within this model, we can see that the model's accuracy in classifying gender status was not massively affected depending on whether the person was a man or woman.

Within the IRNv1 model alone, the results suggest that there is a large discrepancy between gender identity and gender status accuracy rates. It was found that there was a significant interaction between gender identity and gender status but not a significant main effect on gender identity. This suggests that gender identity may be a more critical factor in accurately classifying gender identity for transgender individuals within this model. However, it was found that the model performs significantly better on cisgender people as a whole than transgender people, without respect to gender identity.

As stated above, across both models there seemed to be critical differences within the accuracy rates of transgender individuals and cisgender individuals. Depending on the within-models analyses and between-models analysis, there were differences in gender classification accuracy rates. However, the large takeaway is that there are tangible biases that are seen in the models used within this study, regardless of the debiasing precautions used (i.e., balancing a dataset). This may be because of a lack of inclusion of transgender individuals, or potentially other confounding variables such as race and age. If we solely look at Figure 1, the graph shows that there are extreme differences between models and within models on transgender accuracy rates (9). Further research is needed to understand the reasoning behind the differences discussed here.

Future Work

As suggested by Hamidi et al. (2018), the incorporation of gender identity and presentation as a spectrum would be beneficial in the event of inclusion of gender-diverse individuals such as non-binary individuals. Moving away from a dichotomous view of gender would allow for the incorporation of non-binary individuals and create more ethical algorithms towards cisgender and other individuals who do not abide by strict gender norms. Lastly, by classifying gender on a spectrum, we could better identify biases within humans as well as AI with respect to masculinity and femininity.

The inclusion of transgender individuals in datasets to train ML algorithms is a must. The misgendering of transgender people in real life is already an epidemic, and it would be irresponsible to allow NN algorithms to misgender transgender people as well. More studies should be done on the opinions of marginalized groups (such as transgender and gender-diverse individuals) on the ethics and potential uses of such technology, such as the one conducted by Scheuerman et al. (2019).

Study Limitations

There were several limitations within this study, including but not limited to the testing datasets, racial and age group exclusion, and tangible bias mitigation measures. Specifically, the testing datasets contained uneven amounts of images based on gender identity and gender status. Although the cisgender dataset was more “balanced” with respect to age and gender categories than the transgender dataset (10), the number of images between datasets was also uneven (11). Within the accuracy and logistic regression analyses, race and age were also not considered as factors, which could serve as confounds that could have affected analysis outcomes and thus interpretations of the results. Within the IRNv1 model, age and race were not able to be programmed correctly at the time of gender analyses, and thus were omitted from all calculations and predictions used within this study.

This study also did not include specific bias mitigation measures that were recommended by several researchers within the field of ML. Specific bias mitigation measures that were noted were the inclusion of the gender variables as continuous rather than discrete and the use of bias detection programs such as InsideBias (Hamidi et al., 2018; Serna et al., 2021).Within the context of this study, it was unknown how to implement gender variables as a spectrum, and thus the scope of the study had to change to only include binary transgender individuals.

Conclusion

Although dataset balancing can provide some benefits in regard to gender classification amongst gender identities, NN classification based on gender status is still lagging behind. The findings from this study highlight the urgent need for further research of AI models that are sensitive to the nuances of gender. Additionally, we must critically examine the underlying biases and prejudices that may be ingrained in these models, and work to address and mitigate them. This research also underscores the importance of diverse and inclusive datasets for training AI models, as biased data can lead to biased outcomes.

As AI continues to play an increasingly prominent role in our lives, it is crucial that we strive to ensure that these systems are equitable. If biases within NN algorithms are left unchecked, marginalized groups may be severely affected within real-world applications. By recognizing and addressing biases in NN models, we can move towards a more inclusive future for all individuals, regardless of characteristics like gender.

Footnotes

(1) Conversations in the context of neurons within ANNs means that one neuron or node within the algorithm will pass along its information to similar nodes to form connections.

(2) A hidden layer in NN terminology denotes that there is one or more layers in between the input and output layers that can create more connections and provide more detailed and potentially accurate outputs.

(3) “Balanced” within NN datasets means that the number of images within the dataset that were used to train the neural network were sorted in a way to ensure that the amount of photos pertaining to race, age, and gender were equal. This “balancing” is thought to potentially decrease the negative effects of a NN model experiencing overfitting on one specific population, such as older cisgender white men.

(4) In the transgender dataset, a trans woman would be named something like TW1_1 which denotes an individual face and the first image of said person. For the cisgender dataset, Latino men would be LM#_#, and Black women would be BM#_#.

(5) This accuracy rate of 47.7% was the lowest of ALL average accuracy rates across demographics measured in both models. For contrast, the highest average accuracy rate for a demographic was 98.2% accuracy on cisgender women within the FairFace model.

(6) There is one exception to this generalization across models. Within the cisgender dataset on the IRNv1 model, cisgender men were 2% more likely to be gendered correctly than cisgender women.

(7) Within the FairFace paper, there was no notable mention of the inclusion of transgender people within their attempts to create a balanced dataset. Thus, it is assumed that because there was no identification of transgender individuals there are only cisgender people within the FairFace dataset used for pre-training.

(8) Although transgender men had the worst accuracy rates, both models generally performed worse on men within both gender status groups.

(9) Although the percentages displayed in that figure demonstrate major differences, it is important to note that the data points shown were averages between each demographic. Within the logistic regressions calculated, all data outputs from the model were included.

(10) The cisgender dataset had around a 1% difference in men vs. women, and images were gathered based on racial groups. The transgender dataset had around 7.24%, and images were not balanced on race.

(11) The number of images within the cisgender dataset was 550, and the number of images within the transgender dataset was 414 - a difference of 136 images or 28.22%.

References

AIAAIC - HRT Transgender Dataset. (2023). https://www.aiaaic.org/aiaaic-repository/ai-and-algorithmic-incidents-and-controversies/hrt-transgender-dataset

AIAAIC - Gaydar AI sexual orientation predictions. (2017). https://www.aiaaic.org/aiaaic-repository/ai-and-algorithmic-incidents-and-controversies/gaydar-ai-sexual-orientation-predictions

Albiero, V., S., K. K., Vangara, K., Zhang, K., King, M. C., & Bowyer, K. W. (2020). Analysis of Gender Inequality In Face Recognition Accuracy (arXiv:2002.00065). arXiv. http://arxiv.org/abs/2002.00065

Bhardwaj, A. (2020, October 12). What is a Perceptron? – Basics of Neural Networks. Medium. https://towardsdatascience.com/what-is-a-perceptron-basics-of-neural-networks-c4cfea20c590

Cao, Q., Shen, L., Xie, W., Parkhi, O. M., & Zisserman, A. (2018). VGGFace2: A dataset for recognising faces across pose and age (arXiv:1710.08092). arXiv. http://arxiv.org/abs/1710.08092

Chen, H., Cohen, P., & Chen, S. (2010). How Big is a Big Odds Ratio? Interpreting the Magnitudes of Odds Ratios in Epidemiological Studies. Communications in Statistics - Simulation and Computation, 39(4), 860–864. https://doi.org/10.1080/03610911003650383

Classification: Accuracy | Machine Learning | Google for Developers. (n.d.). https://developers.google.com/machine-learning/crash-course/classification/accuracy

European Parliament. Directorate General for Parliamentary Research Services. (2020). The ethics of artificial intelligence: Issues and initiatives. Publications Office. https://data.europa.eu/doi/10.2861/6644

Facenet/src/models/inception_resnet_v1.py at master · davidsandberg/facenet. (n.d.). https://github.com/davidsandberg/facenet/blob/master/src/models/inception_resnet_v1.py

Gong, S., Liu, X., & Jain, A. K. (2020). Jointly De-biasing Face Recognition and Demographic Attribute Estimation (arXiv:1911.08080). arXiv. http://arxiv.org/abs/1911.08080

Guo, Y., Liu, Y., Oerlemans, A., Lao, S., Wu, S., & Lew, M. S. (2016). Deep learning for visual understanding: A review. Neurocomputing, 187, 27-48.

Group Overview ‹ Ethics and Governance of Artificial Intelligence. (n.d.). MIT Media Lab. https://www.media.mit.edu/groups/ethics-and-governance/overview/

Hamidi, F., Scheuerman, M. K., & Branham, S. M. (2018). Gender Recognition or Gender Reductionism?: The Social Implications of Embedded Gender Recognition Systems. Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, 1–13. https://doi.org/10.1145/3173574.3173582

Instaloader—Download Instagram Photos and Metadata. (n.d.). https://instaloader.github.io/

Kärkkäinen, K., & Joo, J. (2019). FairFace: Face Attribute Dataset for Balanced Race, Gender, and Age (arXiv:1908.04913). arXiv. http://arxiv.org/abs/1908.04913

Keyes, O. (2018). The Misgendering Machines: Trans/HCI Implications of Automatic Gender Recognition. Proceedings of the ACM on Human-Computer Interaction, 2(CSCW), 1–22. https://doi.org/10.1145/3274357

McLemore, K. A. (2018). A minority stress perspective on transgender individuals’ experiences with misgendering. Stigma and Health, 3(1), 53–64. https://doi.org/10.1037/sah0000070

Minsky, M., & Papert, S. (1969). Perceptrons. M.I.T. Press.

Papers with Code—Face Recognition. (n.d.). https://paperswithcode.com/task/face-recognition

Radiya-Dixit, E., & Neff, G. (2023). A Sociotechnical Audit: Assessing Police Use of Facial Recognition. 2023 ACM Conference on Fairness, Accountability, and Transparency, 1334–1346. https://doi.org/10.1145/3593013.3594084

Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386–408. https://doi.org/10.1037/h0042519

Scheuerman, M. K., Paul, J. M., & Brubaker, J. R. (2019). How Computers See Gender: An Evaluation of Gender Classification in Commercial Facial Analysis Services. Proceedings of the ACM on Human-Computer Interaction, 3(CSCW), 1–33. https://doi.org/10.1145/3359246

Scheuerman, M. K., Wade, K., Lustig, C., & Brubaker, J. R. (2020). How We’ve Taught Algorithms to See Identity: Constructing Race and Gender in Image Databases for Facial Analysis. Proceedings of the ACM on Human-Computer Interaction, 4(CSCW1), 1–35. https://doi.org/10.1145/3392866

Serna, I., Peña, A., Morales, A., & Fierrez, J. (2020). InsideBias: Measuring Bias in Deep Networks and Application to Face Gender Biometrics (arXiv:2004.06592). arXiv. http://arxiv.org/abs/2004.06592

Wang, T., Zhao, J., Yatskar, M., Chang, K.-W., & Ordonez, V. (2019). Balanced Datasets Are Not Enough: Estimating and Mitigating Gender Bias in Deep Image Representations (arXiv:1811.08489). arXiv. http://arxiv.org/abs/1811.08489

Wang, Y., & Kosinski, M. (2017). Deep neural networks are more accurate than humans at detecting sexual orientation from facial images. https://doi.org/10.17605/OSF.IO/ZN79K

Wang, Z., Qinami, K., Karakozis, I. C., Genova, K., Nair, P., Hata, K., & Russakovsky, O. (2020). Towards Fairness in Visual Recognition: Effective Strategies for Bias Mitigation (arXiv:1911.11834). arXiv. http://arxiv.org/abs/1911.11834

Wu, W., Protopapas, P., Yang, Z., & Michalatos, P. (2020, July 6). Gender Classification and Bias Mitigation in Facial Images. 12th ACM Conference on Web Science. https://doi.org/10.1145/3394231

Zhang, B. H., Lemoine, B., & Mitchell, M. (2018). Mitigating Unwanted Biases with Adversarial Learning. Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, 335–340. https://doi.org/10.1145/3278721.3278779

Zhang, D., Maslej, N., Brynjolfsson, E., Etchemendy, J., Lyons, T., Manyika, J., Ngo, H., Niebles, J. C., Sellitto, M., Sakhaee, E., Shoham, Y., Clark, J., & Perrault, R. (2022). The AI Index 2022 Annual Report (arXiv:2205.03468). arXiv. http://arxiv.org/abs/2205.03468