research
My research interests lie in AI and algorithmic fairness, and natural language processing (NLP). The goal of my work, anchored in intersectionality and a multicultural interdisciplinary perspective, is to
- understand the social impact of generative AI on users and society (such as understanding social biases and representation)
- develop robust algorithms, frameworks, and evaluations for addressing and understanding the social impact of generative AI in varying contexts
2025
- More of the Same: Persistent Representational Harms Under Increased RepresentationJennifer Mickel, Maria De-Arteaga, Leqi Liu, and 1 more authorIn Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS), 2025
To recognize and mitigate the harms of generative AI systems, it is crucial to consider whether and how different societal groups are represented by these systems. A critical gap emerges when naively measuring or improving who is represented, as this does not consider how people are represented. In this work, we develop GAS(P), an evaluation methodology for surfacing distribution-level group representational biases in generated text, tackling the setting where groups are unprompted (i.e., groups are not specified in the input to generative systems). We apply this novel methodology to investigate gendered representations in occupations across state-of-the-art large language models. We show that, even though the gender distribution when models are prompted to generate biographies leads to a large representation of women, even representational biases persist in how different genders are represented. Our evaluation methodology reveals that there are statistically significant distribution-level differences in the word choice used to describe biographies and personas of different genders across occupations, and we show that many of these differences are associated with representational harms and stereotypes. Our empirical findings caution that naively increasing (unprompted) representation may inadvertently proliferate representational biases, and our proposed evaluation methodology enables systematic and rigorous measurement of the problem.
@inproceedings{mickel2025more, title = {More of the Same: Persistent Representational Harms Under Increased Representation}, author = {Mickel, Jennifer and De-Arteaga, Maria and Liu, Leqi and Tian, Kevin}, booktitle = {Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS)}, year = {2025}, eprint = {2503.00333}, archiveprefix = {arXiv}, url = {https://neurips.cc/virtual/2025/loc/san-diego/poster/118054}, } - Write Code that People Want to UseStella Biderman, Jennifer Mickel, and Baber AbbasiIn Championing Open-source DEvelopment in ML Workshop@ ICML25, 2025
“Research code” is a common, often self-effacing, term used to refer to the type of code that is commonly released alongside research papers. Research code is notorious for being fragile, poorly documented, and difficult for others to run or extend. In this position paper, we argue that, while research code seems to meet the short-term needs of research projects, in fact the practice hurts researchers by limiting the impact of their work and causing fewer people to build on their research. We explore the structural incentives and dynamics of the field that drive these behaviors. We argue that extensibility matters far more than strict reproducibility for research impact, and propose both pragmatic approaches for individual researchers and institutional reforms to encourage the development of more usable and maintainable research software.
@inproceedings{bidermanwrite, title = {Write Code that People Want to Use}, author = {Biderman, Stella and Mickel, Jennifer and Abbasi, Baber}, booktitle = {Championing Open-source DEvelopment in ML Workshop@ ICML25}, year = {2025}, url = {https://openreview.net/forum?id=oH0XhgzJt0}, } - Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and CulturesTyler A. Chang, Catherine Arnett, Abdelrahman Eldesokey, and 333 more authorsarXiv preprint arXiv:2510.24081, 2025
To date, there exist almost no culturally-specific evaluation benchmarks for large language models (LLMs) that cover a large number of languages and cultures. In this paper, we present Global PIQA, a participatory commonsense reasoning benchmark for over 100 languages, constructed by hand by 335 researchers from 65 countries around the world. The 116 language varieties in Global PIQA cover five continents, 14 language families, and 23 writing systems. In the non-parallel split of Global PIQA, over 50% of examples reference local foods, customs, traditions, or other culturally-specific elements. We find that state-of-the-art LLMs perform well on Global PIQA in aggregate, but they exhibit weaker performance in lowerresource languages (up to a 37% accuracy gap, despite random chance at 50%). Open models generally perform worse than proprietary models. Global PIQA highlights that in many languages and cultures, everyday knowledge remains an area for improvement, alongside more widely-discussed capabilities such as complex reasoning and expert knowledge. Beyond its uses for LLM evaluation, we hope that Global PIQA provides a glimpse into the wide diversity of cultures in which human language is embedded.
@article{chang2025global, title = {Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures}, author = {Chang, Tyler A. and Arnett, Catherine and Eldesokey, Abdelrahman and Sadallah, Abdelrahman Boda and Kashar, Abeer and Daud, Abolade and Olanihun, Abosede Grace and Mohammed, Adamu Labaran and Praise, Adeyemi and Sharma, Adhikarinayum Meerajita and Gupta, Aditi and Iyigun, Afitab and Simpl'icio, Afonso and Essouaied, Ahmed and Chorana, Aicha and Eppa, Akhil and Oladipo, Akintunde and Ramesh, Akshay and Dorkin, Aleksei and Kondoro, Alfred Malengo and Aji, Alham Fikri and cCetintacs, Ali Eren and Hanbury, Allan and Demb{\'e}l{\'e}, Alou and Niksarli, Alp and Arroyo, 'Alvaro and Bajand, Amin and Khanna, Amol and Chkhaidze, Ana and Condez, Ana Carolina and Mkhonto, Andiswa and Hoblitzell, Andrew and Tran, Andrew and Poulis, Angelos and Majumder, Anirban and Vacalopoulou, Anna and Wong, Annette Kuuipolani Kanahele and Simonsen, Annika and Kovalev, Anton and Ashvanth.S and Lana, Ayodeji Joseph and Kinay, Barkin and Alhafni, Bashar and Busole, Benedict Cibalinda and Ghanem, Bernard and Nathani, Bharti and DJuri'c, Biljana Stojanovska and Agbonile, Bola and Bergsson, Bragi and Fischer, Bruce Torres and Tutar, Burak and cCinar, Burcu Alakucs and Kane, Cade J. Kanoniakapueo and Udomcharoenchaikit, Can and Helwe, Chadi and Nerella, Chaithra Reddy and Liu, Chen Cecilia and Nwokolo, Chiamaka Glory and Espa{\~n}a-Bonet, Cristina and Amol, Cynthia and Lee, DaeYeop and Arad, Dana and Dzenhaliou, Daniil and Pugacheva, Daria and Choi, Dasol and Abolade, Daud Olamide and Liu, David and Semedo, David and Popoola, Deborah and Mataciunas, Deividas and Nyaboke, Delphine and Kumar, Dhyuthy Krishna and Gl'oria-Silva, Diogo and Tavares, Diogo and Goyal, Divyanshu and Lee, DongGeon and Anajemba, Ebele Nwamaka and Grace, Egonu Ngozi and Mickel, Elena and Tutubalina, Elena and Herranen, Elias and Anand, Emile and Habumuremyi, Emmanuel and Ajiboye, Emuobonuvie Maria and Yulianrifat, Eryawan Presma and Adenuga, Esther and Rudnicka, Ewa and Itiola, Faith Olabisi and Butt, Faran Taimoor and Thekkekara, Fathima and Haouari, Fatima and Tjiaranata, Filbert Aurelian and Laakom, Firas and Grasso, Francesca and Orabona, Francesco and Periti, Francesco and Solomon, Gbenga Kayode and Ngo, Gia Nghia and Udhehdhe-oze, Gloria and Martins, Gonccalo Vinagre and Challagolla, Gopi Naga Sai Ram and Son, Guijin and Abdykadyrova, Gulnaz and Einarsson, Hafsteinn and Hu, Hai and Saffari, Hamidreza and Zaidi, Hamza and Zhang, Haopeng and Shairah, Harethah Abu and Vuong, Harry and Kuulmets, Hele-Andra and Bouamor, Houda and Yu, Hwanjo and Debess, Iben Nyholm and Deveci, .Ibrahim Ethem and Hanif, Ikhlasul Akmal and Cho, Ikhyun and Calvo, Ines and Vieira, Ines and Manzi, Isaac and Daud, Ismail and Itzhak, Itay and Alekseenko, Iuliia and Belashkin, Ivan and Spada, Ivan and Zhelyazkov, Ivan and Brinton, Jacob and Isbarov, Jafar and vCibej, Jaka and vCuhel, Jan and Koco'n, Jan and Krito, Jauza Akbar and Purbey, Jebish and Mickel, Jennifer and Za, Jennifer and Kunz, Jenny and Jeong, Jihae and D'avalos, Jimena Tena and Lee, Jinu and Magalhaes, Joao and Yi, John and Kim, Jongin and Chataignon, Joseph and Imperial, Joseph Marvin and Thevakumar, Jubeerathan and Land, Judith and Jiang, Junchen and Kim, Jungwhan and Sirts, Kairit and Kamesh, R and Kamesh, V and Tshinu, Kanda Patrick and Kukk, K{\"a}triin and Ponkshe, Kaustubh and Huseynova, Kavsar and He, Ke and Buchanan, Kelly and Sarveswaran, Kengatharaiyer and Zaman, Kerem and Mrini, Khalil and Kyars, Kian and Kruusmaa, Krister and Chouhan, Kusum and Krishnakumar, Lainitha and S'anchez, Laura Castro and Moscoso, Laura Porrino and Choshen, Leshem and Sencan, Levent and Ovrelid, Lilja and Alazraki, Lisa and Ehimen-Ugbede, Lovina and Thevakumar, Luheerathan and Thavarasa, Luxshan and Malik, Mahnoor and Keita, Mamadou K. and Jangid, Mansi and Santis, Marco De and Garc'ia, Marcos and Suppa, Marek and D'Ciofalo, Mariam and Ojastu, Marii and Sikander, Maryam and Narayan, Mausami and Skandalis, Maximos and Mehak, Mehak and Bozkurt, Mehmet .Ilterics and Workie, Melaku Bayu and Velayuthan, Menan and Leventhal, Michael and Marci'nczuk, Michal and Potovcnjak, Mirna and Shafiei, Mohammadamin and Sharma, Mridul and Indoria, Mrityunjaya and Habibi, Muhammad Ravi Shulthan and Koli'c, Murat and Galant, Nada and Permpredanun, Naphat and Maugin, Narada and Correa, Nicholas Kluge and Ljubevsi'c, Nikola and Thomas, Nirmal and de Silva, Nisansa and Joshi, Nisheeth and Ponkshe, Nitish and Habash, Nizar and Udeze, Nneoma Chinemerem and Thomas, Noel and Ligeti-Nagy, No{\'e}mi and Coulibaly, Nouhoum Souleymane and Faustin, Nsengiyumva and Buliaminu, Odunayo Kareemat and Ogundepo, Odunayo and Fejiro, Oghojafor Godswill and Funmilola, Ogundipe Blessing and God'spraise, Okechukwu and Samuel, Olanrewaju and Oluwaseun, Olaoye Deborah and Akindejoye, Olasoji and Popova, Olga and Snissarenko, Olga and Chiemezie, Onyinye Anulika and Kınay, Orkun and Tursun, Osman and Moses, Owoeye Tobiloba and Joshua, Oyelade Oluwafemi and Fiyinfoluwa, Oyesanmi and Gamallo, Pablo and Fern'andez, Pablo Rodr'iguez and Arora, Palak and Valente, Pedro and Rupnik, Peter and Ekiugbo, Philip Oghenesuowho and Sahoo, Pramit and Prokopidis, Prokopis and Niau-Puhipau, Pua and Yahya, Quadri and Mignone, Rachele and Singhal, Raghav and Kadiyala, Ramyakrishna and Merx, Raphael and Afolayan, Rapheal and Rajalakshmi, Ratnavel and Ghosh, Rishav and Oji, Romina and Solis, Ron Kekeha and Guerra, Rui and Zawar, Rushikesh and Bashir, Sa'ad Nasir and Alzaabi, Saeed and Sandeep, Sahil and Batchu, Sailaja and Kantareddy, Sai Nithin Reddy and Pranida, Salsabila Zahirah and Buchanan, Sam and Rutunda, Samuel and Land, Sander and Sulollari, Sarah and Ali, Sardar and Sapkota, Saroj and Tautvai{\vs}as, Saulius and Sen, Sayambhu and Banerjee, Sayantani and Diarra, S{\'e}bastien and SenthilNathan.M and Lee, Sewoong and Shah, Shaan and Venkitachalam, Shankar and Djurabaeva, Sharifa and Ibejih, Sharon and Dutta, Shivanya Shomir and Gupta, Siddhant and Su'arez, Silvia Paniagua and Ahmadi, Sina and Sukumar, Sivasuthan and Song, Siyuan and Snegha, A and Sofianopoulos, Sokratis and Simon, Sona Elza and Benvcina, Sonja and Gvasalia, Sophie and More, Sphurti Kirit and Dragazis, Spyros and Kaufhold, Stephan P. and Suba.S and Alrashed, Sultan and Ranathunga, Surangika and Someya, Taiga and Pungervsek, Taja Kuzman and Haklay, Tal and Jibril, Tasi'u and Aoyama, Tatsuya and Abashidze, T B and Cruz, Terenz Jomar Dela and Blevins, Terra and Nikas, Themistoklis and Idoko, Theresa Dora and Do, Thu Mai and Chubakov, Tilek and Gargiani, Tommaso and Rathore, Uma and Johannesen, Uni and Ugwu, Uwuma Doris and Putra, Vallerie Alexandra and Kumar, Vanya Bannihatti and Jeyarajalingam, Varsha and Arzt, Varvara and Nedumpozhimana, Vasudevan and Ondrejova, Viktoria and Horbik, Viktoryia and Kummitha, Vishnu Vardhan Reddy and Dini'c, Vuk and Sewunetie, Walelign Tewabe and Wu, Winston and Zhao, Xiaojing and Diarra, Yacouba and Nikankin, Yaniv and Mathur, Yash and Chen, Yixi and Li, Yiyuan and Xavier, Yolanda and Belinkov, Yonatan and Abayomi, Yusuf Ismail and Alyafeai, Zaid and Shan, Zhengyang and Tam, Zhi Rui and Tang, Zilu and Naďov{\'a}, Zuzana and Abbasi, Baber and Biderman, Stella and Stap, David and Ataman, Duygu and Schmidt, Fabian and Gonen, Hila and Wang, Jiayi and Adelani, David Ifeoluwa}, journal = {arXiv preprint arXiv:2510.24081}, year = {2025}, url = {https://arxiv.org/pdf/2510.24081}, }
2024
- Evaluating the Social Impact of Generative AI Systems in Systems and SocietyIrene Solaiman, Zeerak Talat, William Agnew, and 28 more authors2024
Generative AI systems across modalities, ranging from text (including code), image, audio, and video, have broad social impacts, but there is no official standard for means of evaluating those impacts or for which impacts should be evaluated. In this paper, we present a guide that moves toward a standard approach in evaluating a base generative AI system for any modality in two overarching categories: what can be evaluated in a base system independent of context and what can be evaluated in a societal context. Importantly, this refers to base systems that have no predetermined application or deployment context, including a model itself, as well as system components, such as training data. Our framework for a base system defines seven categories of social impact: bias, stereotypes, and representational harms; cultural values and sensitive content; disparate performance; privacy and data protection; financial costs; environmental costs; and data and content moderation labor costs. Suggested methods for evaluation apply to listed generative modalities and analyses of the limitations of existing evaluations serve as a starting point for necessary investment in future evaluations. We offer five overarching categories for what can be evaluated in a broader societal context, each with its own subcategories: trustworthiness and autonomy; inequality, marginalization, and violence; concentration of authority; labor and creativity; and ecosystem and environment. Each subcategory includes recommendations for mitigating harm.
@misc{solaiman2024evaluatingsocialimpactgenerative, title = {Evaluating the Social Impact of Generative AI Systems in Systems and Society}, author = {Solaiman, Irene and Talat, Zeerak and Agnew, William and Ahmad, Lama and Baker, Dylan and Blodgett, Su Lin and Chen, Canyu and III, Hal Daumé and Dodge, Jesse and Duan, Isabella and Evans, Ellie and Friedrich, Felix and Ghosh, Avijit and Gohar, Usman and Hooker, Sara and Jernite, Yacine and Kalluri, Ria and Lusoli, Alberto and Leidinger, Alina and Lin, Michelle and Lin, Xiuzhu and Luccioni, Sasha and Mickel, Jennifer and Mitchell, Margaret and Newman, Jessica and Ovalle, Anaelia and Png, Marie-Therese and Singh, Shubham and Strait, Andrew and Struppek, Lukas and Subramonian, Arjun}, year = {2024}, eprint = {2306.05949}, archiveprefix = {arXiv}, primaryclass = {cs.CY}, url = {https://zeerak.org/papers/Evaluating_the_Social_Impact_of_Generative_AI_Systems_in_Systems_and_Society__preprint_.pdf}, } - Racial/Ethnic Categories in AI and Algorithmic Fairness: Why They Matter and What They RepresentJennifer MickelIn Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, 2024
Racial diversity has become increasingly discussed within the AI and algorithmic fairness literature, yet little attention is focused on justifying the choices of racial categories and understanding how people are racialized into these chosen racial categories. Even less attention is given to how racial categories shift and how the racialization process changes depending on the context of a dataset or model. An unclear understanding of who comprises the racial categories chosen and how people are racialized into these categories can lead to varying interpretations of these categories. These varying interpretations can lead to harm when the understanding of racial categories and the racialization process is misaligned from the actual racialization process and racial categories used. Harm can also arise if the racialization process and racial categories used are irrelevant or do not exist in the context they are applied. In this paper, we make two contributions. First, we demonstrate how racial categories with unclear assumptions and little justification can lead to varying datasets that poorly represent groups obfuscated or unrepresented by the given racial categories and models that perform poorly on these groups. Second, we develop a framework, CIRCSheets, for documenting the choices and assumptions in choosing racial categories and the process of racialization into these categories to facilitate transparency in understanding the processes and assumptions made by dataset or model developers when selecting or using these racial categories.
@inproceedings{mickel2024racial, author = {Mickel, Jennifer}, title = {Racial/Ethnic Categories in AI and Algorithmic Fairness: Why They Matter and What They Represent}, year = {2024}, isbn = {9798400704505}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3630106.3659050}, doi = {10.1145/3630106.3659050}, booktitle = {Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency}, pages = {2484–2494}, numpages = {11}, keywords = {algorithmic fairness, race and ethnicity, racial categories, racialization}, location = {Rio de Janeiro, Brazil}, series = {FAccT '24}, } - Intersectional Insights for Robust Models: Introducing FOG 😶🌫️ for Improving Worst Case Performance Without Group InformationJennifer MickelTuring Scholars Honors Thesis, 2024
Standard training through empirical risk minimization (ERM) can result in seemingly well-performing models that reach high accuracy on average but achieve low accuracy on specific groups. Group-specific low accuracy is especially of concern in cases in which groups are underrepresented in the training data or when spurious correlations are present within data. Furthermore, instances can be a part of multiple groups, as in the case of demographic groups. Previous approaches, such as group distributional robust optimization (Group DRO), achieve high worst-group accuracy yet require group information. Group information is not always available due to legal, data quality, or cost constraints. Other approaches not requiring group information exist, but gaps between these approaches and group DRO persist and seldom consider overlapping groups. We develop a model development cycle and algorithm Fog to improve the performance of the worst-performing group without group information that accounts for overlapping groups. We first train a model using ERM and utilize the model features corresponding with the data to identify groups. We use these identified groups with group DRO to train a new model. This process can be repeated to improve performance. Using our method, we find that we can improve the performance of the worst-performing group compared to ERM and other algorithms not requiring group information, such as JTT.
@article{mickel2024intersectional, title = {Intersectional Insights for Robust Models: Introducing FOG 😶🌫️ for Improving Worst Case Performance Without Group Information}, author = {Mickel, Jennifer}, school = {The University of Texas at Austin}, journal = {Turing Scholars Honors Thesis}, year = {2024}, }
2023
- The Importance of Multi-Dimensional Intersectionality in Algorithmic Fairness and AI Model DevelopmentJennifer MickelPolymathic Scholars Honors Thesis, 2023
People are increasingly interacting with artificial intelligence (AI) systems and algorithms, but oftentimes, these models are embedded with unfair biases. These biases can lead to harm when an AI system’s output is implicitly or explicitly racist, sexist, or derogatory. If the output is offensive to a person interacting with it, it can cause the person emotional harm that may manifest physically. Alternatively, if a person agrees with the model’s output, the person’s negative biases may be reinforced, inciting the person to engage in discriminatory behavior. Researchers have recognized the harm AI systems can lead to, and they have worked to develop fairness definitions and methodologies for mitigating unfair biases in machine learning models. Unfortunately, these definitions (typically binary) and methodologies are insufficient for preventing AI models from learning unfair biases. To address this, fairness definitions and methodologies must account for intersectional identities in multicultural contexts. The limited scope of fairness definitions allows for models to develop biases against people with intersectional identities that are unaccounted for in the fairness definition. Existing frameworks and methodologies for model development are based in the US cultural context, which may be insufficient for fair model development in different cultural contexts. To assist machine learning practitioners in understanding the intersectional groups affected by their models, a database should be constructed detailing the intersectional identities, cultural contexts, and relevant model domains in which people may be affected. This can lead to fairer model development, for machine learning practitioners will be better adept at testing their model’s performance on intersectional groups.
@article{mickel2023importance, title = {The Importance of Multi-Dimensional Intersectionality in Algorithmic Fairness and AI Model Development}, author = {Mickel, Jennifer}, school = {The University of Texas at Austin}, journal = {Polymathic Scholars Honors Thesis}, year = {2023}, }