Summary

Coloniality, the continuation of colonial harms beyond "official" colonization, has pervasive effects across society and scientific fields, including Natural Language Processing (NLP). We perform a quantitative survey of the geography of different types of NLP research using the ACL Anthology. Based on resulting evidence, we argue that combating coloniality in NLP requires not only changing current values but also active work to remove the accumulation of colonial ideals in our foundational data and algorithms. Read the paper for theoretical grounding, further quantitative details, and two qualitative case studies.

Regionality and Linguistic Diversity of NLP Over Time

Both languages studied and researcher affiliations reflect Western Eurocentrism. Even the study of new languages can often reflect the shifting interests of existing powers, called interest convergence. For example, using the interactive plots below, compare the amount of research on languages of what was the Soviet Union from 1960-1980 and the Middle East from 2000-2010 compared to the percent of researchers affiliated with those regions.


From Data Bias to Deployment

Using Actor-Network Theory, our work surveys NLP literature to understand how earlier resources and inventions influence the downstream priorities of the field. We analyze how disparities in unlabeled data compound in labeled data disparities, the resulting datasets influence assumptions in model design, and ultimately how these models affect real people all over the world. This process creates increasing disparities across geographic borders, especially for colonized nations.

Unlabeled Data Production

Annotation and Evaluation

Modeling and Algorithm Design

NLP Deployment

Citation & Further Reading

If you find our work interesting, we encourage you to read the foundational works exploring coloniality in AI and NLP it is built upon, especially Decolonial AI, Decolonising Speech and Language Technology, Decolonizing NLP for “Low-resource Languages”, and Ethical Considerations for Machine Translation of Indigenous Languages.

Acknowledgements

We are grateful to Azure Zhou, Caleb Ziems, Dan Jurafsky, Dora Zhao, Irene Solaiman, Michael Li, Myra Cheng, Omar Shaikh, Pooja Casula, Pratyusha Ria Kalluri, Sachin Pendse, Tolúlọpẹ́ Ògúnrẹ̀mí, Tony Wang, Yanzhe Zhang, and Zhehao Zhang for feedback and suggestions at different stages of this work.


The website template originates from Michaël Gharbi and was found via Ref-NERF. Thanks to Yanzhe Zhang, Yanchen Liu, Seonghee Lee, Matthias Gerstgrasser and the broader SALT lab for testing this site for responsiveness!