Common Flaws in Running Human Evaluation Experiments in NLP

Thomson, CraigReiter, EhudBelz, Anya2024-07-092024-07-092024-06-01Thomson, C, Reiter, E & Belz, A 2024, 'Common Flaws in Running Human Evaluation Experiments in NLP', Computational Linguistics, vol. 50, no. 2, pp. 795–805. https://doi.org/10.1162/coli_a_005080891-2017RIS: urn:997FD788D81418EA68853A3BF7F47D77ORCID: /0000-0002-7548-9504/work/150764475https://hdl.handle.net/2164/23764We would first like to thank all authors who took the time to respond to our requests for information; we could not have done this work without their help! We would also like to thank all the people at the ReproHum partner labs who helped us carry out Phase 1 of the multi-lab multi-test study; Gavin Abercrombie, Jose M. Alonso-Moral, Mohammad Arvan, Anouck Braggaar, Jackie Cheung, Mark Cieliebak, Elizabeth Clark, Kees van Deemter, Tanvi Dinkar, Ondˇrej Dušek, Steffen Eger, Qixiang Fang, Mingqi Gao, Albert Gatt, Dimitra Gkatzia, Javier González-Corbelle, Dirk Hovy, Manuela Hürlimann, Takumi Ito, John D. Kelleher, Emiel Khramer, Filip Klubicka, Huiyuan Lai, Chris van der Lee, Yiru Li, Saad Mahamood, Margot Mieskes, Emiel van Miltenburg, Pablo Mosteiro, Malvina Nissim, Natalie Parde, Ondˇrej Plátek, Verena Rieser, Jie Ruan, Joel Tetreault, Antonio Toral, Xiaojun Wan, Leo Wanner, Lewis Watson, and Diyi Yang. Special thanks to Mohammad Arvan, Saad Mahamood, Emiel van Miltenburg, Natalie Parde, Barkavi Sundararajan, as well as the editor and anonymous reviewers for their very helpful suggestions for improving this paper. We also thank Cindy Robinson for providing information about errata in TACL.11210807engQA76 Computer softwareQA76Common Flaws in Running Human Evaluation Experiments in NLPJournal article10.1162/coli_a_00508https://abdn.elsevierpure.com/en/publications/bdae6575-848e-4639-ae80-0a7f6dbe0f02502