Common Flaws in Running Human Evaluation Experiments in NLP

Thomson, Craig; Reiter, Ehud; Belz, Anya

Common Flaws in Running Human Evaluation Experiments in NLP

dc.contributor.author	Thomson, Craig
dc.contributor.author	Reiter, Ehud
dc.contributor.author	Belz, Anya
dc.contributor.institution	University of Aberdeen.Computing Science	en
dc.contributor.institution	University of Aberdeen.Computational Linguistics at Aberdeen	en
dc.date.accessioned	2024-07-09T09:36:05Z
dc.date.available	2024-07-09T09:36:05Z
dc.date.issued	2024-06-01
dc.description	We would first like to thank all authors who took the time to respond to our requests for information; we could not have done this work without their help! We would also like to thank all the people at the ReproHum partner labs who helped us carry out Phase 1 of the multi-lab multi-test study; Gavin Abercrombie, Jose M. Alonso-Moral, Mohammad Arvan, Anouck Braggaar, Jackie Cheung, Mark Cieliebak, Elizabeth Clark, Kees van Deemter, Tanvi Dinkar, Ondˇrej Dušek, Steffen Eger, Qixiang Fang, Mingqi Gao, Albert Gatt, Dimitra Gkatzia, Javier González-Corbelle, Dirk Hovy, Manuela Hürlimann, Takumi Ito, John D. Kelleher, Emiel Khramer, Filip Klubicka, Huiyuan Lai, Chris van der Lee, Yiru Li, Saad Mahamood, Margot Mieskes, Emiel van Miltenburg, Pablo Mosteiro, Malvina Nissim, Natalie Parde, Ondˇrej Plátek, Verena Rieser, Jie Ruan, Joel Tetreault, Antonio Toral, Xiaojun Wan, Leo Wanner, Lewis Watson, and Diyi Yang. Special thanks to Mohammad Arvan, Saad Mahamood, Emiel van Miltenburg, Natalie Parde, Barkavi Sundararajan, as well as the editor and anonymous reviewers for their very helpful suggestions for improving this paper. We also thank Cindy Robinson for providing information about errata in TACL.	en
dc.description.status	Peer reviewed	en
dc.format.extent	11
dc.format.extent	210807
dc.identifier	283538319
dc.identifier	bdae6575-848e-4639-ae80-0a7f6dbe0f02
dc.identifier	85185871462
dc.identifier.citation	Thomson, C, Reiter, E & Belz, A 2024, 'Common Flaws in Running Human Evaluation Experiments in NLP', Computational Linguistics, vol. 50, no. 2, pp. 795–805. https://doi.org/10.1162/coli_a_00508	en
dc.identifier.doi	10.1162/coli_a_00508
dc.identifier.iss	2	en
dc.identifier.issn	0891-2017
dc.identifier.other	RIS: urn:997FD788D81418EA68853A3BF7F47D77
dc.identifier.other	ORCID: /0000-0002-7548-9504/work/150764475
dc.identifier.uri	https://hdl.handle.net/2164/23764
dc.identifier.url	https://abdn.elsevierpure.com/en/publications/bdae6575-848e-4639-ae80-0a7f6dbe0f02	en
dc.identifier.vol	50	en
dc.language.iso	eng
dc.relation.ispartof	Computational Linguistics	en
dc.subject	QA76 Computer software	en
dc.subject.lcc	QA76	en
dc.title	Common Flaws in Running Human Evaluation Experiments in NLP	en
dc.type	Journal article	en

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Thomson_etal_CL_Common_Flaws_in_VOR.pdf
Size:: 205.87 KB
Format:: Adobe Portable Document Format

Download

Collections

All research