University of Aberdeen logo

AURA - Aberdeen University Research Archive

 

Common Flaws in Running Human Evaluation Experiments in NLP

dc.contributor.authorThomson, Craig
dc.contributor.authorReiter, Ehud
dc.contributor.authorBelz, Anya
dc.contributor.institutionUniversity of Aberdeen.Computing Scienceen
dc.contributor.institutionUniversity of Aberdeen.Computational Linguistics at Aberdeenen
dc.date.accessioned2024-07-09T09:36:05Z
dc.date.available2024-07-09T09:36:05Z
dc.date.issued2024-06-01
dc.descriptionWe would first like to thank all authors who took the time to respond to our requests for information; we could not have done this work without their help! We would also like to thank all the people at the ReproHum partner labs who helped us carry out Phase 1 of the multi-lab multi-test study; Gavin Abercrombie, Jose M. Alonso-Moral, Mohammad Arvan, Anouck Braggaar, Jackie Cheung, Mark Cieliebak, Elizabeth Clark, Kees van Deemter, Tanvi Dinkar, Ondˇrej Dušek, Steffen Eger, Qixiang Fang, Mingqi Gao, Albert Gatt, Dimitra Gkatzia, Javier González-Corbelle, Dirk Hovy, Manuela Hürlimann, Takumi Ito, John D. Kelleher, Emiel Khramer, Filip Klubicka, Huiyuan Lai, Chris van der Lee, Yiru Li, Saad Mahamood, Margot Mieskes, Emiel van Miltenburg, Pablo Mosteiro, Malvina Nissim, Natalie Parde, Ondˇrej Plátek, Verena Rieser, Jie Ruan, Joel Tetreault, Antonio Toral, Xiaojun Wan, Leo Wanner, Lewis Watson, and Diyi Yang. Special thanks to Mohammad Arvan, Saad Mahamood, Emiel van Miltenburg, Natalie Parde, Barkavi Sundararajan, as well as the editor and anonymous reviewers for their very helpful suggestions for improving this paper. We also thank Cindy Robinson for providing information about errata in TACL.en
dc.description.statusPeer revieweden
dc.format.extent11
dc.format.extent210807
dc.identifier283538319
dc.identifierbdae6575-848e-4639-ae80-0a7f6dbe0f02
dc.identifier85185871462
dc.identifier.citationThomson, C, Reiter, E & Belz, A 2024, 'Common Flaws in Running Human Evaluation Experiments in NLP', Computational Linguistics, vol. 50, no. 2, pp. 795–805. https://doi.org/10.1162/coli_a_00508en
dc.identifier.doi10.1162/coli_a_00508
dc.identifier.iss2en
dc.identifier.issn0891-2017
dc.identifier.otherRIS: urn:997FD788D81418EA68853A3BF7F47D77
dc.identifier.otherORCID: /0000-0002-7548-9504/work/150764475
dc.identifier.urihttps://hdl.handle.net/2164/23764
dc.identifier.urlhttps://abdn.elsevierpure.com/en/publications/bdae6575-848e-4639-ae80-0a7f6dbe0f02en
dc.identifier.vol50en
dc.language.isoeng
dc.relation.ispartofComputational Linguisticsen
dc.subjectQA76 Computer softwareen
dc.subject.lccQA76en
dc.titleCommon Flaws in Running Human Evaluation Experiments in NLPen
dc.typeJournal articleen

Files

Original bundle

Now showing 1 - 1 of 1
Thumbnail Image
Name:
Thomson_etal_CL_Common_Flaws_in_VOR.pdf
Size:
205.87 KB
Format:
Adobe Portable Document Format

Collections