研究生: |
邵越洋 Shao, Yue-Yang |
---|---|
論文名稱: |
評分者趨中效應指標表現效果探討 Effects of five indicators to detect the Rater’s Centrality |
指導教授: |
陳柏熹
Chen, Po-Hsi |
學位類別: |
碩士 Master |
系所名稱: |
教育心理與輔導學系 Department of Educational Psychology and Counseling |
論文出版年: | 2018 |
畢業學年度: | 106 |
語文別: | 中文 |
論文頁數: | 77 |
中文關鍵詞: | 評分者趨中效應 、評分者嚴苛度 、評分樣本數 |
英文關鍵詞: | rater’s centrality, severity, rating samples |
DOI URL: | http://doi.org/10.6345/THE.NTNU.DEPC.029.2018.F02 |
論文種類: | 學術論文 |
相關次數: | 點閱:150 下載:14 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本研究目的在於探討五種不同的指標在判斷評分者趨中效應程度時的表現效果並提出使用建議。本研究分成兩個部分,包含模擬研究及實證研究,模擬研究部分的三個自變項分別為五種不同的指標、評分者評分樣本數和評分者嚴苛度的程度;依變項為不同指標的準確性、敏感性以及不同指標基於的IRT模式在估計受試者能力值時的精準度。實證研究部分,分析不同指標在偵測評分者趨中效應程度時結果是否一致,並針對不一致的結果進行分析討論。
本研究結果顯示無論評分樣本數多少,五種指標的準確性排序結果都相同。五種指標的表現效果會隨著評分樣本數的增加而愈好,且當評分者嚴苛度適中時,五種指標的表現均有所提升(相較於評分者是寬鬆或嚴苛時),在對受試者能力估計時,兩種模式存在一定差異。因此建議在判斷評分者趨中效應程度需使用何種指標依照評分樣本數和評分者嚴苛度作為判斷依據:當評分者樣本數較少時(例如本研究採50人),無論評分者嚴苛度為何,建議使用基於MFRM模式的r_(measure,res)和r_(exp,res);當評分者樣本數較多時(例如本研究採100人),若評分者嚴苛度適中,則r_(measure,res)、r_(exp,res)、r_(score,measure)、SD和基於MF-RC模式的ω_k'均表現良好;但若評分者較寬鬆或嚴苛時,建議使用r_(measure,res)、r_(exp,res)和ω_k'。建議後續研究可以探討不同指標的切截分數。
This research illustrates effects of four indicators in the framework of IRT and one indicator in the framework of Non-IRT to detect the rater’s centrality by using simulation methods. This research includes two parts: The simulation study and empirical study. In simulation study, three independent variables are used including five different indicators、the number of ratees and the severity of raters. Dependent variables are Spearman's rank correlation coefficient which is used to judge the effects of indicators 、RMSE for two IRT models which can be used to judge the accuracy of ability estimations. In empirical study, five indicators are used to judge which raters are of high rank of centrality, some discussions are provided if the conclusions are inconsistent while using different indicators.
The results show that when the numbers of rating samples are growing the effects of five indicators tend to be better, when rater’s severity is moderate consequences are better as well. Besides there are differences between using MFRM model and MF-RC model to estimate the abilities of students.
There are some advice for using indicators to detect the rater’s centrality: If the rating samples are too small(e.g., 50), r_(measure,res) and r_(exp,res) are recommended. If the rating samples are large(e.g., 100), when rater’s severity is moderate, r_(measure,res)、r_(exp,res)、r_(score,measure)、SD and ω_k' which is based on MF-RC model are all performing acceptable, however when rater’s severity is too lenient or harsh, then r_(measure,res)、r_(exp,res) and ω_k' are recommended.
Some researches about the cut score of indicators to detect the rater’s centrality can be done in the future.
中文部分
白麗芳(2017)。國內外作文自動批改系統效度研究概述。教育現代化,40, 66。
王博、王麗娜、劉燕偉、丁玎、劉遠我(2011)。主觀評分中分數趨中現象的形成及控制增強心理學服務社會的意識和功能。中國心理學會主辦,「中國心理學會成立 90 周年紀念大會暨第十四屆全國心理學學術會議」(西安)。
王文中(1993)以項目反應模式來探討評分者的評分標準與嚴荷性。國立政治大學教育學院。教育與心理研究,16,83-105。
余民寧(1991)。試題反應理論的介紹(一)測驗理論的發展趨勢。研習資訊,8(6),13-18。
俞韞燁、謝小慶(2012)基於多面Rasch模型的作文網上評卷「趨中評分」判定研究。中國考試,1,6-13。
饶水知音(2018)。2017年高考全國I卷優秀作文報告【部落格文字資料】。取自http://blog.sina.com.cn/s/blog_4a348da30102xi05.html
英文部分
AlFallay, Ibrahim. (2004). The role of some selected psychological and personality traits of the rater in the accuracy of self-and peer-assessment. System, 32(3), 407-425.
Baird, Jo-Anne, Meadows, Michelle, Leckie, George, & Caro, Daniel. (2017). Rater accuracy and training group effects in Expert-and Supervisor-based monitoring systems. Assessment in Education: Principles, Policy & Practice, 24(1), 44-59.
A. Biem. (2003) . A model selection criterion for classification: Application to HMM topology optimization.Proc. of ICDAR,3, 104–108.
Borman, Walter C, & Dunnette, MD. (1975). Behavior-based versus trait-oriented performance ratings: An empirical study. Journal of Applied psychology, 60, 561.
Brooks, Stephen P, & Gelman, Andrew. (1998). General methods for monitoring convergence of iterative simulations. Journal of computational and graphical statistics, 7(4), 434-455.
Christensen, Karl Bang, & Kreiner, Svend. (2013). Rasch models in health. london: ISTE,Inc.
DeCoths, Thomas A. (1977). An analysis of the external validity and applied relevance of three rating formats. Organizational Behavior and Human Performance, 19(2), 247-266.
Eckes, Thomas. (2005). Examining rater effects in TestDaF writing and speaking performance assessments: A many-facet Rasch analysis. Language Assessment Quarterly: An International Journal, 2(3), 197-221.
Eckes, Thomas. (2009). Many-facet Rasch measurement. Reference supplement to the manual for relating language examinations to the Common European Framework of Reference for Languages: Learning, teaching, assessment (Section H).Strasbourg, France: Council of Europe/Language Policy Division.
Embretson, Susan E, & Reise, Steven P. (2013). Item response theory. London: Psychology Press.
Embretson SE, Reise SP. (2000) . Item response theory for psychologists. Mahwah :Lawrence Erlbaum Associates.
Engelhard, George. (1994). Examining rater errors in the assessment of written composition with a many‐faceted Rasch model. Journal of Educational Measurement, 31(2), 93-112.
Gelman, Andrew, & Hill, Jennifer. (2007). Data analysis using regression and multilevelhierarchical models (Vol. 1). New York:Cambridge University Press.
Hauenstein, Neil, & McCusker, Maureen E. (2017). Rater training: Understanding effects of training content, practice ratings, and feedback. International Journal of Selection and Assessmentc 25(3), 253-266.
Kang, Taehoon, & Cohen, Allan S. (2007). IRT model selection methods for dichotomous items. Applied Psychological Measurement, 31(4), 331-358.
Kneeland, Natalie. (1928). That lenient tendency in rating. Personnel Journal,7, 356-366.
Korman, A. (1971). Industrial and organizational psychology. Englewood Cliffs, NJ: Prentice Hall.
Linacre, J. M. (2015a). Correlations: Point-biserial, point-measure, residual. Retrieved from http://www.winsteps.com/winman/correlations.htm
Linacre, J. M. (1989). Many-facet Rasch measurement. Chicago: Measurement, Evaluation, Statistics and Assessment Press.
Lunz, M.E., Wright, B.D.& Linacre, J.M. 1990: Measuring the impact of judge severity on examination scores. Applied Measurement in Education, 3, 331–45.
Myford, Carol M, & Wolfe, Edward W. (2003). Detecting and measuring rater effects using many-facet Rasch measurement: Part I. Journal of applied measurement, 4(4), 386-422.
Myford, Carol M, & Wolfe, Edward W. (2004). Detecting and measuring rater effects using many-facet Rasch measurement: Part II. Journal of applied measurement, 5(2), 189-227.
Scheffe, Henry. (1947). The relation of control charts to analysis of variance and chi-square tests. Journal of the American Statistical Association, 42(239), 425-431.
Song, Tian, & Wolfe, Edward W. (2015). Distinguishing Several Rater Effects with the Rasch Model. Paper presented at National Council of Measurement in Education Annual Meeting, Chicago, IL.
Spearman, Charles. (1904). The proof and measurement of association between two things. The American journal of psychology, 15(1), 72-101.
Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & Van Der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(4), 583-639.
Wolfe, E. W., & Myford, C. M. (1997). Detecting rater effects with a multi-faceted rating scale model. Paper presented at the Annual Meeting of the National Council on Measurement in Education, Chicago, IL.
Wolfe, Edward W. (2004). Identifying rater effects using latent trait models. Psychology Science, 46, 35-51.
Wolfe, Edward W, Chiu, Chris WT, & Myford, Carol M. (2000). Detecting rater effects in simulated data with a multi-faceted Rasch rating scale model. Objective measurement: Theory into practice, 5, 147-164.
Wolfe, Edward W, & McVay, Aaron. (2010). Rater effects as a function of rater training context. Retrieved from Pearson at http://images.pearsonassessments. com/images/tmrs/tmrs_rg/RaterEffects_101510. pdf.
Wolfe, Edward W, & McVay, Aaron. (2012). Application of latent trait models to identifying substantively interesting raters. Educational Measurement: Issues and Practice, 31(3), 31-37.
Wright, Benjamin D. (1996). Reasonable mean-square fit values. Rasch measurement transactions, 2, 370.
Wright, B., & Panchapakesan, N. (1969). A procedure for sample-free item analysis. Educational and Psychological Measurement, 29, 23–48.
Yue, X. (2011). Detecting rater centrality effect using simulation methods and Raschmeasurement analysis. (Doctoral Thesis). Virginia State University, Petersburg.
Wolfe, E. W. (1998). A two-parameter logistic reader model (2PLRM): Detecting reader haRChness and centrality. Paper presented at the annual meeting of the American Educational Research Association, San Diego, CA.
Qiu, X.-L., & Wang, W.-C. (2017). New item response theory models for rater errors. Paper presented at the International Meeting of the Psychometric Society 2017, Zürich, Switzerland.
Jin, K.-Y., & Wang, W.-C. (2017). A new Rasch facets model for rater’s centrality/extremity response style. Paper presented at the International Meeting of the Psychometric Society 2017, Zürich, Switzerland.