research | Jennifer Mickel

My research interests lie in AI and algorithmic fairness, and natural language processing (NLP). The goal of my work, anchored in intersectionality and a multicultural interdisciplinary perspective, is to

understand the social impact of generative AI on users and society (such as understanding social biases and representation)
develop robust algorithms, frameworks, and evaluations for addressing and understanding the social impact of generative AI in varying contexts

2026

Who Evaluates AI’s Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations

Anka Reuel, Avijit Ghosh, Jenny Chim, and 32 more authors

In Proceedings of the 43rd International Conference on Machine Learning (ICML), 2026

Abs Bib PDF

Foundation models are increasingly central to high-stakes AI systems, and governance frameworks now depend on evaluations to assess their risks and capabilities. Although general capability evaluations are widespread, social impact assessments covering bias, fairness, privacy, environmental costs, and labor practices remain uneven across the AI ecosystem. To characterize this landscape, we conduct the first comprehensive analysis of both first-party and third-party social impact evaluation reporting across a wide range of model developers. Our study examines 186 first-party release reports and 183 post-release evaluation sources, and complements this quantitative analysis with interviews of model developers. We find a clear division of evaluation labor: first-party reporting is sparse, often superficial, and has declined over time in key areas such as environmental impact and bias, while third-party evaluators including academic researchers, nonprofits, and independent organizations provide broader and more rigorous coverage of bias, harmful content, and performance disparities. However, this complementarity has limits. Only model developers can authoritatively report on data provenance, content moderation labor, financial costs, and training infrastructure, yet interviews reveal that these disclosures are often deprioritized unless tied to product adoption or regulatory compliance. Our findings indicate that current evaluation practices leave major gaps in assessing AI’s societal impacts, highlighting the urgent need for policies that promote developer transparency, strengthen independent evaluation ecosystems, and create shared infrastructure to aggregate and compare third-party evaluations in a consistent and accessible way
@inproceedings{reuel2065who, title = {Who Evaluates AI's Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations}, author = {Reuel, Anka and Ghosh, Avijit and Chim, Jenny and Tran, Andrew and Long, Yanan and Mickel, Jennifer and Gohar, Usman and Yadav, Srishti and Ammanamanchi, Pawan Sasanka and Allaham, Mowafak and Rahmani, Hossein A. and Akhtar, Mubashara and Friedrich, Felix and Scholz, Robert and Riegler, M. A. and Batzner, Jan and Habba, Eliya and Saxena, Arushi and Kornilova, Anastassia and Wei, Kevin L. and Soni, Prajna and Mathew, Yohan and Klyman, Kevin and Sania, Jeba and Sahoo, Subramanyam and Bruvik, O. and Sadeghi, Pouya and Goswami, Sujata and Wang, Angelina and Jernite, Yacine and Talat, Zeerak and Biderman, Stella and Kochenderfer, Mykel J. and Koyejo, Sanmi and Solaiman, Irene}, year = {2026}, booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)}, }
When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation

Mubashara Akhtar, Anka Reuel, Prajna Soni, and 34 more authors

In Proceedings of the 43rd International Conference on Machine Learning (ICML), 2026

Abs Bib PDF

Artificial Intelligence (AI) benchmarks play a central role in measuring progress in model development and guiding deployment decisions. However, many benchmarks quickly become saturated, meaning that they can no longer differentiate between the best-performing models, diminishing their long-term value. In this study, we analyze benchmark saturation across 60 Large Language Model (LLM) benchmarks selected from technical reports by major model developers. To identify factors driving saturation, we characterize benchmarks along 14 properties spanning task design, data construction, and evaluation format. We test five hypotheses examining how each property contributes to saturation rates. Our analysis reveals that nearly half of the benchmarks exhibit saturation, with rates increasing as benchmarks age. Notably, hiding test data (i.e., public vs. private) shows no protective effect, while expert-curated benchmarks resist saturation better than crowdsourced ones. Our findings highlight which design choices extend benchmark longevity and inform strategies for more durable evaluation.
@inproceedings{akhtar2026when, title = {When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation}, author = {Akhtar, Mubashara and Reuel, Anka and Soni, Prajna and Ahuja, Sanchit and Ammanamanchi, Pawan Sasanka and Rawal, Ruchit and Zouhar, Vilém and Yadav, Srishti and Whitehouse, Chenxi and Ki, Dayeon and Mickel, Jennifer and Coshen, Leshem and Šuppa, Marek and Batzner, Jan and Chim, Jenny and Sania, Jeba and Long, Yanan and Rahmani, Hossein A and Knight, Christina and Nan, Yiyang and Raj, Jyoutir and Fan, Yu and Singh, Shubham and Sahoo, Subramanyam and Habba, Eliya and Gohar, Usman and Pawar, Siddhesh and Scholz, Robert and Subramonian, Arjun and Ni, Jingwei and Kochenderfer, Mykel and Koyejo, Sanmi and Sachan, Mrinmaya and Biderman, Stella and Talat, Zeerak and Ghosh, Avijit and Solaiman, Irene}, year = {2026}, booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)}, }
Challenges to Grassroots Organization Engagement with AI Policy

Carter Buckner, Jennifer Mickel, Nandini Swaminathan, and 5 more authors

In Proceedings of the 2026 ACM Conference on Fairness, Accountability, and Transparency, 2026

Abs Bib

Public policies are being developed around the world to address privacy, economic, intellectual property, energy, and other risks that AI technologies pose. Involvement from the general public is essential to governance as an accountability and alignment mechanism. However, participating in and impacting policymaking can be challenging for sections of the public that lack extensive networks, lobbying capabilities, and other forms of power. This challenge is especially acute for marginalized communities.In this paper, we present a case study of our organization’s efforts to bring participatory design (PD) principles to AI policymaking in the US. We describe our engagements with several US policy bodies, and our participatory development of AI policy for queer people. We highlight challenges with PD practice with marginalized communities, and offer suggestions to alleviate them. We conclude with actionable recommendations for policy makers and other organizers working in marginalized communities.
@inproceedings{buckner2026challenges, title = {Challenges to Grassroots Organization Engagement with AI Policy}, author = {Buckner, Carter and Mickel, Jennifer and Swaminathan, Nandini and Agnew, William and Arora, Sarthak and Lin, Michelle and Long, Yanon and of Queer in AI, Organizers}, year = {2026}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, booktitle = {Proceedings of the 2026 ACM Conference on Fairness, Accountability, and Transparency}, location = {Montreal, Canada}, series = {FAccT '26} }
Queer NLP: A Critical Survey on Literature Gaps, Biases and Trends

Sabine Weber, Angelina Wang, Ankush Gupta, and 16 more authors

2026

Abs Bib PDF

Natural language processing (NLP) technologies are rapidly reshaping how language is created, processed, and analyzed by humans. With current and potential applications in hiring, law, healthcare, and other areas that impact people’s lives, understanding and mitigating harms towards marginalized groups is critical. In this survey, we examine NLP research papers that explicitly address the relationship between LGBTQIA+ communities and NLP technologies. We systematically review all such papers published in the ACL Anthology, to answer the following research questions: (1) What are current research trends? (2) What gaps exist in terms of topics and methods? (3) What areas are open for future work? We find that while the number of papers on queer NLP has grown within the last few years, most papers take a reactive rather than a proactive approach, pointing out bias more often than mitigating it, and focusing on shortcomings of existing systems rather than creating new solutions. Our survey uncovers many opportunities for future work, especially regarding stakeholder involvement, intersectionality, interdisciplinarity, and languages other than English. We also offer an outlook from a queer studies perspective, highlighting understudied topics and gaps in the harms addressed in NLP papers. Beyond being a roadmap of what has been done, this survey is a call to action for work towards more just and inclusive NLP technologies.
@article{weber2026queer, title = {Queer NLP: A Critical Survey on Literature Gaps, Biases and Trends}, author = {Weber, Sabine and Wang, Angelina and Gupta, Ankush and Subramonian, Arjun and Ulmer, Dennis and Tanwar, Eshaan and Aich, Geetanjali and Devinney, Hannah and Hobbs, Jacob and Mickel, Jennifer and Tint, Joshua and Sosto, Mae and Groshan, Ray and Astarita, Simone and Gautam, Vagrant and Blaschke, Verena and Agnew, William and Lee, Wilson Y and Long, Yanan}, year = {2026}, }

2025

More of the Same: Persistent Representational Harms Under Increased Representation

Jennifer Mickel, Maria De-Arteaga, Leqi Liu, and 1 more author

In Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS), 2025

Abs Bib PDF

To recognize and mitigate the harms of generative AI systems, it is crucial to consider whether and how different societal groups are represented by these systems. A critical gap emerges when naively measuring or improving who is represented, as this does not consider how people are represented. In this work, we develop GAS(P), an evaluation methodology for surfacing distribution-level group representational biases in generated text, tackling the setting where groups are unprompted (i.e., groups are not specified in the input to generative systems). We apply this novel methodology to investigate gendered representations in occupations across state-of-the-art large language models. We show that, even though the gender distribution when models are prompted to generate biographies leads to a large representation of women, even representational biases persist in how different genders are represented. Our evaluation methodology reveals that there are statistically significant distribution-level differences in the word choice used to describe biographies and personas of different genders across occupations, and we show that many of these differences are associated with representational harms and stereotypes. Our empirical findings caution that naively increasing (unprompted) representation may inadvertently proliferate representational biases, and our proposed evaluation methodology enables systematic and rigorous measurement of the problem.
@inproceedings{mickel2025more, title = {More of the Same: Persistent Representational Harms Under Increased Representation}, author = {Mickel, Jennifer and De-Arteaga, Maria and Liu, Leqi and Tian, Kevin}, booktitle = {Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS)}, year = {2025}, eprint = {2503.00333}, archiveprefix = {arXiv}, url = {https://neurips.cc/virtual/2025/loc/san-diego/poster/118054}, }
Challenges to Grassroots Organization Engagement with AI Policy

Jennifer Mickel, Carter Buckner, William Agnew, and 8 more authors

In ACA Workshop (oral presentation) @ NeurIPS 2025, 2025

Abs Bib PDF

Around the world, policies are being developed to address privacy, economic, intellectual property, energy, and other risks that AI technologies pose. Simultaneously, institutions are creating standards and best practices to further the use of AI. The development of standards and policies involves many well-resourced actors and opaque development processes, often sidelining the needs of marginalized populations, who often lack extensive networks, lobbying capabilities, and other forms of power. In this paper, we present the participatory development of AI policies that meet the needs of queer people through grassroots advocacy. We use collaborative autoethnography to surface granular challenges our organization has faced in doing so, along with factors that assisted us. We conclude with actionable recommendations for empowering marginalized communities to participate in policy development and insights for other marginalized communities working to develop policy to mitigate harms.
@inproceedings{mickel2025challenges, title = {Challenges to Grassroots Organization Engagement with AI Policy}, author = {Mickel, Jennifer and Buckner, Carter and Agnew, William and Long, Yanon and Lin, Michelle and Alaka, B and Wang, Angelina and Arora, Sarthak and Swaminathan, Nandini and Subramonian, Arjun and of Queer in AI, Organizers}, booktitle = {ACA Workshop (oral presentation) @ NeurIPS 2025}, year = {2025}, }
Write Code that People Want to Use

Stella Biderman, Jennifer Mickel, and Baber Abbasi

In Championing Open-source DEvelopment in ML Workshop @ ICML 2025, 2025

Abs Bib PDF

“Research code” is a common, often self-effacing, term used to refer to the type of code that is commonly released alongside research papers. Research code is notorious for being fragile, poorly documented, and difficult for others to run or extend. In this position paper, we argue that, while research code seems to meet the short-term needs of research projects, in fact the practice hurts researchers by limiting the impact of their work and causing fewer people to build on their research. We explore the structural incentives and dynamics of the field that drive these behaviors. We argue that extensibility matters far more than strict reproducibility for research impact, and propose both pragmatic approaches for individual researchers and institutional reforms to encourage the development of more usable and maintainable research software.
@inproceedings{bidermanwrite, title = {Write Code that People Want to Use}, author = {Biderman, Stella and Mickel, Jennifer and Abbasi, Baber}, booktitle = {Championing Open-source DEvelopment in ML Workshop @ ICML 2025}, year = {2025}, url = {https://openreview.net/forum?id=oH0XhgzJt0}, }

Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures

Tyler A. Chang, Catherine Arnett, Abdelrahman Eldesokey, and 333 more authors

arXiv preprint arXiv:2510.24081, 2025

Abs Bib PDF

To date, there exist almost no culturally-specific evaluation benchmarks for large language models (LLMs) that cover a large number of languages and cultures. In this paper, we present Global PIQA, a participatory commonsense reasoning benchmark for over 100 languages, constructed by hand by 335 researchers from 65 countries around the world. The 116 language varieties in Global PIQA cover five continents, 14 language families, and 23 writing systems. In the non-parallel split of Global PIQA, over 50% of examples reference local foods, customs, traditions, or other culturally-specific elements. We find that state-of-the-art LLMs perform well on Global PIQA in aggregate, but they exhibit weaker performance in lowerresource languages (up to a 37% accuracy gap, despite random chance at 50%). Open models generally perform worse than proprietary models. Global PIQA highlights that in many languages and cultures, everyday knowledge remains an area for improvement, alongside more widely-discussed capabilities such as complex reasoning and expert knowledge. Beyond its uses for LLM evaluation, we hope that Global PIQA provides a glimpse into the wide diversity of cultures in which human language is embedded.

@article{chang2025global,
  title = {Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures},
  author = {Chang, Tyler A. and Arnett, Catherine and Eldesokey, Abdelrahman and Sadallah, Abdelrahman Boda and Kashar, Abeer and Daud, Abolade and Olanihun, Abosede Grace and Mohammed, Adamu Labaran and Praise, Adeyemi and Sharma, Adhikarinayum Meerajita and Gupta, Aditi and Iyigun, Afitab and Simpl'icio, Afonso and Essouaied, Ahmed and Chorana, Aicha and Eppa, Akhil and Oladipo, Akintunde and Ramesh, Akshay and Dorkin, Aleksei and Kondoro, Alfred Malengo and Aji, Alham Fikri and cCetintacs, Ali Eren and Hanbury, Allan and Demb{\'e}l{\'e}, Alou and Niksarli, Alp and Arroyo, 'Alvaro and Bajand, Amin and Khanna, Amol and Chkhaidze, Ana and Condez, Ana Carolina and Mkhonto, Andiswa and Hoblitzell, Andrew and Tran, Andrew and Poulis, Angelos and Majumder, Anirban and Vacalopoulou, Anna and Wong, Annette Kuuipolani Kanahele and Simonsen, Annika and Kovalev, Anton and Ashvanth.S and Lana, Ayodeji Joseph and Kinay, Barkin and Alhafni, Bashar and Busole, Benedict Cibalinda and Ghanem, Bernard and Nathani, Bharti and DJuri'c, Biljana Stojanovska and Agbonile, Bola and Bergsson, Bragi and Fischer, Bruce Torres and Tutar, Burak and cCinar, Burcu Alakucs and Kane, Cade J. Kanoniakapueo and Udomcharoenchaikit, Can and Helwe, Chadi and Nerella, Chaithra Reddy and Liu, Chen Cecilia and Nwokolo, Chiamaka Glory and Espa{\~n}a-Bonet, Cristina and Amol, Cynthia and Lee, DaeYeop and Arad, Dana and Dzenhaliou, Daniil and Pugacheva, Daria and Choi, Dasol and Abolade, Daud Olamide and Liu, David and Semedo, David and Popoola, Deborah and Mataciunas, Deividas and Nyaboke, Delphine and Kumar, Dhyuthy Krishna and Gl'oria-Silva, Diogo and Tavares, Diogo and Goyal, Divyanshu and Lee, DongGeon and Anajemba, Ebele Nwamaka and Grace, Egonu Ngozi and Mickel, Elena and Tutubalina, Elena and Herranen, Elias and Anand, Emile and Habumuremyi, Emmanuel and Ajiboye, Emuobonuvie Maria and Yulianrifat, Eryawan Presma and Adenuga, Esther and Rudnicka, Ewa and Itiola, Faith Olabisi and Butt, Faran Taimoor and Thekkekara, Fathima and Haouari, Fatima and Tjiaranata, Filbert Aurelian and Laakom, Firas and Grasso, Francesca and Orabona, Francesco and Periti, Francesco and Solomon, Gbenga Kayode and Ngo, Gia Nghia and Udhehdhe-oze, Gloria and Martins, Gonccalo Vinagre and Challagolla, Gopi Naga Sai Ram and Son, Guijin and Abdykadyrova, Gulnaz and Einarsson, Hafsteinn and Hu, Hai and Saffari, Hamidreza and Zaidi, Hamza and Zhang, Haopeng and Shairah, Harethah Abu and Vuong, Harry and Kuulmets, Hele-Andra and Bouamor, Houda and Yu, Hwanjo and Debess, Iben Nyholm and Deveci, .Ibrahim Ethem and Hanif, Ikhlasul Akmal and Cho, Ikhyun and Calvo, Ines and Vieira, Ines and Manzi, Isaac and Daud, Ismail and Itzhak, Itay and Alekseenko, Iuliia and Belashkin, Ivan and Spada, Ivan and Zhelyazkov, Ivan and Brinton, Jacob and Isbarov, Jafar and vCibej, Jaka and vCuhel, Jan and Koco'n, Jan and Krito, Jauza Akbar and Purbey, Jebish and Mickel, Jennifer and Za, Jennifer and Kunz, Jenny and Jeong, Jihae and D'avalos, Jimena Tena and Lee, Jinu and Magalhaes, Joao and Yi, John and Kim, Jongin and Chataignon, Joseph and Imperial, Joseph Marvin and Thevakumar, Jubeerathan and Land, Judith and Jiang, Junchen and Kim, Jungwhan and Sirts, Kairit and Kamesh, R and Kamesh, V and Tshinu, Kanda Patrick and Kukk, K{\"a}triin and Ponkshe, Kaustubh and Huseynova, Kavsar and He, Ke and Buchanan, Kelly and Sarveswaran, Kengatharaiyer and Zaman, Kerem and Mrini, Khalil and Kyars, Kian and Kruusmaa, Krister and Chouhan, Kusum and Krishnakumar, Lainitha and S'anchez, Laura Castro and Moscoso, Laura Porrino and Choshen, Leshem and Sencan, Levent and Ovrelid, Lilja and Alazraki, Lisa and Ehimen-Ugbede, Lovina and Thevakumar, Luheerathan and Thavarasa, Luxshan and Malik, Mahnoor and Keita, Mamadou K. and Jangid, Mansi and Santis, Marco De and Garc'ia, Marcos and Suppa, Marek and D'Ciofalo, Mariam and Ojastu, Marii and Sikander, Maryam and Narayan, Mausami and Skandalis, Maximos and Mehak, Mehak and Bozkurt, Mehmet .Ilterics and Workie, Melaku Bayu and Velayuthan, Menan and Leventhal, Michael and Marci'nczuk, Michal and Potovcnjak, Mirna and Shafiei, Mohammadamin and Sharma, Mridul and Indoria, Mrityunjaya and Habibi, Muhammad Ravi Shulthan and Koli'c, Murat and Galant, Nada and Permpredanun, Naphat and Maugin, Narada and Correa, Nicholas Kluge and Ljubevsi'c, Nikola and Thomas, Nirmal and de Silva, Nisansa and Joshi, Nisheeth and Ponkshe, Nitish and Habash, Nizar and Udeze, Nneoma Chinemerem and Thomas, Noel and Ligeti-Nagy, No{\'e}mi and Coulibaly, Nouhoum Souleymane and Faustin, Nsengiyumva and Buliaminu, Odunayo Kareemat and Ogundepo, Odunayo and Fejiro, Oghojafor Godswill and Funmilola, Ogundipe Blessing and God'spraise, Okechukwu and Samuel, Olanrewaju and Oluwaseun, Olaoye Deborah and Akindejoye, Olasoji and Popova, Olga and Snissarenko, Olga and Chiemezie, Onyinye Anulika and Kınay, Orkun and Tursun, Osman and Moses, Owoeye Tobiloba and Joshua, Oyelade Oluwafemi and Fiyinfoluwa, Oyesanmi and Gamallo, Pablo and Fern'andez, Pablo Rodr'iguez and Arora, Palak and Valente, Pedro and Rupnik, Peter and Ekiugbo, Philip Oghenesuowho and Sahoo, Pramit and Prokopidis, Prokopis and Niau-Puhipau, Pua and Yahya, Quadri and Mignone, Rachele and Singhal, Raghav and Kadiyala, Ramyakrishna and Merx, Raphael and Afolayan, Rapheal and Rajalakshmi, Ratnavel and Ghosh, Rishav and Oji, Romina and Solis, Ron Kekeha and Guerra, Rui and Zawar, Rushikesh and Bashir, Sa'ad Nasir and Alzaabi, Saeed and Sandeep, Sahil and Batchu, Sailaja and Kantareddy, Sai Nithin Reddy and Pranida, Salsabila Zahirah and Buchanan, Sam and Rutunda, Samuel and Land, Sander and Sulollari, Sarah and Ali, Sardar and Sapkota, Saroj and Tautvai{\vs}as, Saulius and Sen, Sayambhu and Banerjee, Sayantani and Diarra, S{\'e}bastien and SenthilNathan.M and Lee, Sewoong and Shah, Shaan and Venkitachalam, Shankar and Djurabaeva, Sharifa and Ibejih, Sharon and Dutta, Shivanya Shomir and Gupta, Siddhant and Su'arez, Silvia Paniagua and Ahmadi, Sina and Sukumar, Sivasuthan and Song, Siyuan and Snegha, A and Sofianopoulos, Sokratis and Simon, Sona Elza and Benvcina, Sonja and Gvasalia, Sophie and More, Sphurti Kirit and Dragazis, Spyros and Kaufhold, Stephan P. and Suba.S and Alrashed, Sultan and Ranathunga, Surangika and Someya, Taiga and Pungervsek, Taja Kuzman and Haklay, Tal and Jibril, Tasi'u and Aoyama, Tatsuya and Abashidze, T B and Cruz, Terenz Jomar Dela and Blevins, Terra and Nikas, Themistoklis and Idoko, Theresa Dora and Do, Thu Mai and Chubakov, Tilek and Gargiani, Tommaso and Rathore, Uma and Johannesen, Uni and Ugwu, Uwuma Doris and Putra, Vallerie Alexandra and Kumar, Vanya Bannihatti and Jeyarajalingam, Varsha and Arzt, Varvara and Nedumpozhimana, Vasudevan and Ondrejova, Viktoria and Horbik, Viktoryia and Kummitha, Vishnu Vardhan Reddy and Dini'c, Vuk and Sewunetie, Walelign Tewabe and Wu, Winston and Zhao, Xiaojing and Diarra, Yacouba and Nikankin, Yaniv and Mathur, Yash and Chen, Yixi and Li, Yiyuan and Xavier, Yolanda and Belinkov, Yonatan and Abayomi, Yusuf Ismail and Alyafeai, Zaid and Shan, Zhengyang and Tam, Zhi Rui and Tang, Zilu and Naďov{\'a}, Zuzana and Abbasi, Baber and Biderman, Stella and Stap, David and Ataman, Duygu and Schmidt, Fabian and Gonen, Hila and Wang, Jiayi and Adelani, David Ifeoluwa},
  journal = {arXiv preprint arXiv:2510.24081},
  year = {2025},
  url = {https://arxiv.org/pdf/2510.24081},
}

2024

Evaluating the Social Impact of Generative AI Systems in Systems and Society

Irene Solaiman, Zeerak Talat, William Agnew, and 28 more authors

2024

Abs Bib PDF

Generative AI systems across modalities, ranging from text (including code), image, audio, and video, have broad social impacts, but there is no official standard for means of evaluating those impacts or for which impacts should be evaluated. In this paper, we present a guide that moves toward a standard approach in evaluating a base generative AI system for any modality in two overarching categories: what can be evaluated in a base system independent of context and what can be evaluated in a societal context. Importantly, this refers to base systems that have no predetermined application or deployment context, including a model itself, as well as system components, such as training data. Our framework for a base system defines seven categories of social impact: bias, stereotypes, and representational harms; cultural values and sensitive content; disparate performance; privacy and data protection; financial costs; environmental costs; and data and content moderation labor costs. Suggested methods for evaluation apply to listed generative modalities and analyses of the limitations of existing evaluations serve as a starting point for necessary investment in future evaluations. We offer five overarching categories for what can be evaluated in a broader societal context, each with its own subcategories: trustworthiness and autonomy; inequality, marginalization, and violence; concentration of authority; labor and creativity; and ecosystem and environment. Each subcategory includes recommendations for mitigating harm.
@misc{solaiman2024evaluatingsocialimpactgenerative, title = {Evaluating the Social Impact of Generative AI Systems in Systems and Society}, author = {Solaiman, Irene and Talat, Zeerak and Agnew, William and Ahmad, Lama and Baker, Dylan and Blodgett, Su Lin and Chen, Canyu and III, Hal Daumé and Dodge, Jesse and Duan, Isabella and Evans, Ellie and Friedrich, Felix and Ghosh, Avijit and Gohar, Usman and Hooker, Sara and Jernite, Yacine and Kalluri, Ria and Lusoli, Alberto and Leidinger, Alina and Lin, Michelle and Lin, Xiuzhu and Luccioni, Sasha and Mickel, Jennifer and Mitchell, Margaret and Newman, Jessica and Ovalle, Anaelia and Png, Marie-Therese and Singh, Shubham and Strait, Andrew and Struppek, Lukas and Subramonian, Arjun}, year = {2024}, eprint = {2306.05949}, archiveprefix = {arXiv}, primaryclass = {cs.CY}, url = {https://zeerak.org/papers/Evaluating_the_Social_Impact_of_Generative_AI_Systems_in_Systems_and_Society__preprint_.pdf}, }
Racial/Ethnic Categories in AI and Algorithmic Fairness: Why They Matter and What They Represent

Jennifer Mickel

In Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, 2024

Abs Bib PDF Slides

Racial diversity has become increasingly discussed within the AI and algorithmic fairness literature, yet little attention is focused on justifying the choices of racial categories and understanding how people are racialized into these chosen racial categories. Even less attention is given to how racial categories shift and how the racialization process changes depending on the context of a dataset or model. An unclear understanding of who comprises the racial categories chosen and how people are racialized into these categories can lead to varying interpretations of these categories. These varying interpretations can lead to harm when the understanding of racial categories and the racialization process is misaligned from the actual racialization process and racial categories used. Harm can also arise if the racialization process and racial categories used are irrelevant or do not exist in the context they are applied. In this paper, we make two contributions. First, we demonstrate how racial categories with unclear assumptions and little justification can lead to varying datasets that poorly represent groups obfuscated or unrepresented by the given racial categories and models that perform poorly on these groups. Second, we develop a framework, CIRCSheets, for documenting the choices and assumptions in choosing racial categories and the process of racialization into these categories to facilitate transparency in understanding the processes and assumptions made by dataset or model developers when selecting or using these racial categories.
@inproceedings{mickel2024racial, author = {Mickel, Jennifer}, title = {Racial/Ethnic Categories in AI and Algorithmic Fairness: Why They Matter and What They Represent}, year = {2024}, isbn = {9798400704505}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3630106.3659050}, doi = {10.1145/3630106.3659050}, booktitle = {Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency}, pages = {2484–2494}, numpages = {11}, keywords = {algorithmic fairness, race and ethnicity, racial categories, racialization}, location = {Rio de Janeiro, Brazil}, series = {FAccT '24}, }
Intersectional Insights for Robust Models: Introducing FOG 😶‍🌫️ for Improving Worst Case Performance Without Group Information

Jennifer Mickel

Turing Scholars Honors Thesis, 2024

Abs Bib PDF Slides

Standard training through empirical risk minimization (ERM) can result in seemingly well-performing models that reach high accuracy on average but achieve low accuracy on specific groups. Group-specific low accuracy is especially of concern in cases in which groups are underrepresented in the training data or when spurious correlations are present within data. Furthermore, instances can be a part of multiple groups, as in the case of demographic groups. Previous approaches, such as group distributional robust optimization (Group DRO), achieve high worst-group accuracy yet require group information. Group information is not always available due to legal, data quality, or cost constraints. Other approaches not requiring group information exist, but gaps between these approaches and group DRO persist and seldom consider overlapping groups. We develop a model development cycle and algorithm Fog to improve the performance of the worst-performing group without group information that accounts for overlapping groups. We first train a model using ERM and utilize the model features corresponding with the data to identify groups. We use these identified groups with group DRO to train a new model. This process can be repeated to improve performance. Using our method, we find that we can improve the performance of the worst-performing group compared to ERM and other algorithms not requiring group information, such as JTT.
@article{mickel2024intersectional, title = {Intersectional Insights for Robust Models: Introducing FOG 😶‍🌫️ for Improving Worst Case Performance Without Group Information}, author = {Mickel, Jennifer}, school = {The University of Texas at Austin}, journal = {Turing Scholars Honors Thesis}, year = {2024}, }

2023

The Importance of Multi-Dimensional Intersectionality in Algorithmic Fairness and AI Model Development

Jennifer Mickel

Polymathic Scholars Honors Thesis, 2023

Abs Bib PDF Slides

People are increasingly interacting with artificial intelligence (AI) systems and algorithms, but oftentimes, these models are embedded with unfair biases. These biases can lead to harm when an AI system’s output is implicitly or explicitly racist, sexist, or derogatory. If the output is offensive to a person interacting with it, it can cause the person emotional harm that may manifest physically. Alternatively, if a person agrees with the model’s output, the person’s negative biases may be reinforced, inciting the person to engage in discriminatory behavior. Researchers have recognized the harm AI systems can lead to, and they have worked to develop fairness definitions and methodologies for mitigating unfair biases in machine learning models. Unfortunately, these definitions (typically binary) and methodologies are insufficient for preventing AI models from learning unfair biases. To address this, fairness definitions and methodologies must account for intersectional identities in multicultural contexts. The limited scope of fairness definitions allows for models to develop biases against people with intersectional identities that are unaccounted for in the fairness definition. Existing frameworks and methodologies for model development are based in the US cultural context, which may be insufficient for fair model development in different cultural contexts. To assist machine learning practitioners in understanding the intersectional groups affected by their models, a database should be constructed detailing the intersectional identities, cultural contexts, and relevant model domains in which people may be affected. This can lead to fairer model development, for machine learning practitioners will be better adept at testing their model’s performance on intersectional groups.
@article{mickel2023importance, title = {The Importance of Multi-Dimensional Intersectionality in Algorithmic Fairness and AI Model Development}, author = {Mickel, Jennifer}, school = {The University of Texas at Austin}, journal = {Polymathic Scholars Honors Thesis}, year = {2023}, }

Evaluating the Social Impact of Generative AI Systems

Irene Solaiman, Zeerak Talat, William Agnew, and 29 more authors

In The Oxford Handbook of the Foundations and Regulation of Generative AI, 2023

Abs Bib PDF

Generative artificial intelligence (AI) systems across modalities, ranging from text, code, image, audio, and video, have broad social impacts, but there is little agreement on which impacts to evaluate or how to evaluate them. In this chapter, we present a guide for evaluating base generative AI systems (i.e. systems without predetermined applications or deployment contexts). We propose a framework of two overarching categories: what can be evaluated in a system independent of context and what requires societal context. For the former, we define seven areas of interest: stereotypes and representational harms; cultural values and sensitive content; disparate performance; privacy and data protection; financial costs; environmental costs; and data and content moderation labor costs. For the latter, we present five areas: trustworthiness and autonomy; inequality, marginalization, and violence; concentration of authority; labor and creativity; and ecosystem and environment. For each, we present methods for evaluations and the limitations presented by such methods.
@incollection{10.1093/oxfordhb/9780198940272.013.0025, author = {Solaiman, Irene and Talat, Zeerak and Agnew, William and Ahmad, Lama and Baker, Dylan K. and Blodgett, Su Lin and Chen, Canyu and Daumé, Hal, III and Dodge, Jesse and Duan, Isabella and Evans, Ellie and Friedrich, Felix and Ghosh, Avijit and Gohar, Usman and Hooker, Sara and Jernite, Yacine and Kalluri, Pratyusha Ria and Leidinger, Alina and Lusoli, Alberto and Lin, Michelle and Lin, Xiuzhu and Luccioni, Sasha and Mickel, Jennifer and Mitchell, Margaret and Newman, Jessica and Ovalle, Anaelia and Png, Marie-Therese and Singh, Shubham and Strait, Andrew and Struppek, Lukas and Subramonian, Arjun and Vassilev, Apostol}, isbn = {9780198940272}, title = {Evaluating the Social Impact of Generative AI Systems}, booktitle = {The Oxford Handbook of the Foundations and Regulation of Generative AI}, publisher = {Oxford University Press}, doi = {10.1093/oxfordhb/9780198940272.013.0025}, url = {https://doi.org/10.1093/oxfordhb/9780198940272.013.0025}, eprint = {https://academic.oup.com/book/0/chapter/544536706/chapter-ag-pdf/66118359/book_59908_section_544536706.ag.pdf}, }