研究生: |
戴禮 Nigel P. Daly |
---|---|
論文名稱: |
單字階層測驗之局部獨立性檢測 Investigating Local Item Dependence in the Vocabulary Levels Test |
指導教授: |
劉宇挺
Liu, Yeu-Ting |
學位類別: |
博士 Doctor |
系所名稱: |
英語學系 Department of English |
論文出版年: | 2019 |
畢業學年度: | 107 |
語文別: | 英文 |
論文頁數: | 193 |
中文關鍵詞: | 單字測驗 、局部依賴 、單向度 、Rasch模式 、潛在特質 、單字階層測驗 |
英文關鍵詞: | vocabulary testing, local item dependence, unidimensionality, Rasch model, latent trait, Vocabulary Levels Test |
DOI URL: | http://doi.org/10.6345/NTNU201901111 |
論文種類: | 學術論文 |
相關次數: | 點閱:221 下載:15 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
單字階層測驗(Vocabulary Levels Test, VLT)在一般前後測之實驗設計研究中,常作為分班測驗,診斷測驗,和學習的基準。相較於其它的詞彙量測驗,像是VST或者是Yes/No測驗,單字階層測驗在過去的35年間受到最多的注目,儘管此單字階層測驗的項目題組形式遭到一些質疑。因為單字階層測驗包含三個項目(定義),和六個選項(單字)。因為三個項目組合為同一題組的選項,曾有質疑指出回答其中一個項目會不公平地影響(或決定)同一題組的其他選項的答覆。這種局部依賴稱作為項目鍊(item chaining),且此種現象明顯地違反經典測驗理論和試題反應理論的項目獨立之基本的假設。假若項目鍊在測驗中是一種普遍的現象,此同時也挑戰另一個測驗理論的基本假設:單向度或者是測驗本身設計之評量能力,此以單字能力為例。若因為項目依賴違反兩個在測驗中基本理論假設,測驗的信度和效度將令人存疑。
本論文的目標為檢測一個簡短版本的單字階層測驗之項目獨立性,其中包含三個階層而不是五個階層。利用更廣泛的Rasch模式,以檢測在單字階層測驗中的項目獨立性之現象和範圍。本論文的資料蒐集包含302位大學和研究生的測驗資料,主要利用Winstep軟體在20個不同資料階層中,進行兩種類型的單項度測驗(1. 主成分殘差分析(PCAR)和2. Yen的 Q3值,此數值可以找出局部依賴的項目)。
1.結合三個單字階層測驗2,3和5(一個資料階層)
2.每一個獨立單字階層測驗(三個資料階層)
3.四個能力組別和所有的單字階層測驗(四個資料階層)
4.四個能力組別和三個獨立單字階層測驗 (十二個資料階層)
另外執行兩項分析;模擬資料包含非隨機殘差和實證資料之比較,另一項是用Rasch模式分析三項目組合的題組。總結,本研究綜合分析42個不同分析量化結果和質性分析有問題的題目,包含以下兩種方法:1. 作答規律包含答案,誘答選項,和未回應的選項;2. 利用COCA蒐集的單字頻率和分佈資料,COCA是目前最大的英文語料庫(Davies, 2008-)。
相似於文獻中的一些研究發現,其中單向度的Rasch分析結果顯示可接受之配適度,個人和項目之可信度。另外,和模擬資料比較時,亦很少不可解釋的變異數。這些數值顯示,單字階層測驗項目分析結果沒有發現明顯的或是有問題的測驗題目。但是,透過20個階層資料的組合分析顯示超過三分之一的測驗題目有以下傾項:1. 有兩項題目有局部依賴,依賴程度為弱到中等程度(相關係數0.3-0.7);和/或者2. 測驗題目中未在Rasch單字知識向度中,卻依據主成分殘差分析有顯著負荷量(超過 +/- 0.3)。執行質性分析以進一步了解Rasch統計檢測之結果。結果顯示由上述至少在兩項上述分析中,發現有一小組七個題組為可能有問題的局部依賴項目,而這些題組將進行題目敘述和單字頻率檢視。
雖然統計和質化分析的結果不能將局部依賴歸咎於項目鍊,這七個題組項目確實有一些共同的性質造成一些問題降低了測驗的能力。這些性質包含兩個項目在困難度上面有相當大的差別,於此論文中稱作 “2-vs-1 困難群”;事實上,在30個題組中就有19個題組項目有此傾向。當一個測驗中困難群在同一個題組中位置彼此相近,但是卻距離邊緣第三項很遠,此現象由Q3數據檢驗呈現是有微弱或是中等的局部依賴現象。這個現象出現在六個題組中(占總20%)。當局部依賴的現象出現於在題組中的前兩題項目,第一題是比第二題更難,且遠比第三題困難的情形時(約四分之一到三分之一的測驗者不回答此題組),題組的第一題根據主成分殘差分析的結果顯示,此項題目和Rash向度之單字知識顯示不相關。這個情形出現在單字階層測驗3和5中的四個題組(占總13%)。
本研究指出一個重要的議題就是單字困難度,在單字診斷測驗中如同單字階層測驗,這個議題一直以來都被忽略或者是被研究者視為擾嚷變數(Culligan, 2015)。就我所知,此種測驗類型的單字困難度從不曾被實際地理論化過,但是卻被默認為是語料庫中單字頻率的一種功能。儘管有一些相反的論述(Schmitt et al. [2001] 的單字階層測驗, and Beglar [2007]的VST),基本的假設是單字頻率越低(i.e., 比較不常見),此單字項目在單字階層測驗中就比較困難。本研究的結果顯示如此之假設是有問題的,基於兩原因。第一,Schmitt et al. (2001)的單字階層測驗的版本是基於過時且數量小的語料庫,因此在單字階層測驗中沒有正確的單字頻率,特別是在於低程度單字階層測驗3和5。主要的原因是當語料庫包含相對數量少的文章,也沒有考慮單字分布的情形(i.e., 該單字在語料庫中的多少文章中出現)。因此這些單字會有不一致和偏斜的分布的情形。第二,最重要的是單字困難度的評量並沒有和頻率的資料作相關聯之測試,即使將分布資料納入考量。這個觀察同時也顯示第二語言學習者的單字量,並沒有和純英文的語料庫做一個相關檢視,特別在於前2000字上面。建議應該要研究單字頻率和困難度的相關性。
The Vocabulary Levels Test (VLT) has been used as a placement test, diagnostic test and benchmark for learning in pre- and post-test type of studies. Compared to other vocabulary size tests like the VST and Yes/No test, the VLT has received the most attention in research publications in the last 35 years, despite widespread suspicion of its item cluster format. Since each item cluster is composed of three items (definitions) and six answer options (words), it is suspected that the answering of one item can unfairly influence—or depend on—the answering of another item in the cluster since the three cluster items draw from the same set of answer options. This type of Local Item Dependence (LID) is called item chaining and appears to be a flagrant violation of the basic assumption of Local Item Independence (LII) in Classical Test Theory as well Item Response Theory. And if item chaining is pervasive throughout the test, this also challenges another fundamental assumption in test theory: unidimensionality, or the test’s capacity to measure only one trait like vocabulary knowledge. If both of these assumptions are substantially violated by Local Item Dependence (LID), the test’s reliability and validity are necessarily called into question.
The purpose of this dissertation is to investigate the issue of LID in a shortened version of the VLT (three levels instead of five) using a wider variety of Rasch modelling approaches that were triangulated so as to identify the existence and extent of LID in the VLT. Specifically, data were collected for 302 Taiwanese university students or university graduates and Winsteps was used to run two types of dimensionality tests (1. Principal Components Analysis of Residuals [PCAR] and 2. Yen’s Q3 statistic that identifies pairs of locally dependent items) on 20 different data levels:
1.three combined levels of the VLT2, 3, and 5 (1 data level)
2.each independent VLT level (3 data levels)
3.four ability groups versus combined VLT levels (4 data levels)
4.four ability groups versus three independent VLT levels (12 data levels).
Two more analyses were also conducted: simulated data with non-random residuals factored out were also compared to the empirical data, and items were grouped into three-item clusters to perform a Rasch analysis of testlets. In total, this study synthesized the results of 42 different analyses and qualitatively investigated the resulting problematic testlets using 1. response patterns of answer keys, distractors and items left unanswered, and 2. word frequency and dispersion information from COCA the largest and most updated currently available English language corpus (Davies, 2008-).
Similar to previous research findings, the unidimensional Rasch analyses showed acceptable fit statistics, person and item reliability, and very little unexplained variance, especially when compared with the simulated data. The testlet analysis also did not uncover any obviously problematic testlets. However, from a combination of the above 20 levels of analysis, more than a third of the testlets appeared either to 1. have a pair of locally dependent (LD) items that were weakly to moderately dependent on each other (correlation of 0.3-0.7), and/or 2. have items with substantive PCAR loadings (beyond +/- 0.3) on a dimension that was not the Rasch dimension of vocabulary knowledge. Additional qualitative investigations were conducted in an effort to better understand and explain the Rasch statistical results. A subset of seven testlets that emerged from at least two of the above analyses were assumed to be the most likely candidates of problematic LID, and these were more closely scrutinized using qualitative procedures of checking item wording and word frequency.
Although the statistical and qualitative procedures cannot conclusively show that the cause of LID is item chaining, the seven items share a number of characteristics that clearly create a problematic dynamic that undermines the proper functioning of testlets. These characteristics include a pair of items that considerably differ in difficulty measures from the third item in the cluster, which I have called a “2-vs-1 difficulty bundle”; in fact, 19 out of 30 testlets shared this configuration. However, when these difficulty bundles in a testlet are fairly close together but far apart from the outlying third item, the Q3 LID analysis identified them as either weakly or moderately locally dependent; this was the case for six testlets (20% of the total). And when this LID pair was the first two items with the first item more difficult than the second, and much more difficult than the third outlying item (with a quarter to one third of test-takers leaving the pair unanswered), the first item in the testlet was identified by the PCAR as negatively correlating with Rasch dimension of vocabulary knowledge; this was the case for four testlets (13% of the total) in VLT3 and 5.
A key issue that emerged from this investigation is item difficulty in a vocabulary diagnostic test like the VLT, which has been variously ignored or treated as a “nuisance variable” by researchers (Culligan, 2015). Difficulty in this type of test has never, to the best of my knowledge, been overtly theorized, but has been tacitly operationalized as a function of word frequency from a corpus. Despite some unargued claims to the contrary (Schmitt et al. [2001] for the VLT, and Beglar [2007] for the VST), the assumption is that the less frequent (i.e., less common) the word, the more difficult the word-item on the VLT. This study shows that this is problematic for at least two reasons. First, the Schmitt et al. (2001) VLT versions are based on outdated and small corpora that have inaccurate word frequency information for all the VLT levels, but especially for the lower VLT3 and 5 levels; this is primarily because word frequency information will be necessarily inconsistent and skewed for less common words when using smaller corpora that contain a relatively small number of randomly sampled texts and do not account for dispersion (i.e., how many texts in the corpus containing the word). Secondly, and most importantly, difficulty measures—even when accounting for dispersion information—are often uncorrelated with frequency information, which shows that the learner’s second language (L2) lexicon does not mirror authentic English corpora, especially beyond the first 2000 words. Suggestions are given to help bridge the gap between frequency and difficulty.
Alderson, J. C. (2005). Diagnosing foreign language proficiency: The interface between learning and assessment. A&C Black.
Ambridge, B., Kidd, E., Rowland, C. F., & Theakston, A. L. (2015). The ubiquity of frequency effects in first language acquisition. Journal of Child Language, 42(2), 239–273.
Baghaei, P. (2010). A comparison of three polychotomous Rasch models for super-item analysis. Psychological Test and Assessment Modeling, 52(3), 313-322.
Baghaei, P., & Aryadoust, V. (2015). Modeling local item dependence due to common test format with a multidimensional Rasch model. International Journal of Testing, 15(1), 71-87.
Baldauf, R. (1982). The effects of guessing and item dependence on the reliability and validity of recognition based cloze tests. Educational and Psychological Measurement, 42, 855–867.
Bauer, L., & Nation, P. (1993). Word families. International Journal of Lexicography, 6(4), 253-279.
Beare, K. (2018, Oct.8). How many people learn English? Thought Co. Retrieved from https://www.thoughtco.com/how-many-people-learn-english-globally-1210367.
Beglar, D. (2010). A Rasch-based validation of the Vocabulary Size Test. Language Testing, 27(1), 101-118.
Biber, D., Johansson, S., Leech, G., Conrad, S., and Finegan, E. (1999). Longman grammar of spoken and written English. Longman.
BNC (The British National Corpus, version 3; BNC XML Edition). (2007). Distributed by Bodleian Libraries, University of Oxford, on behalf of the BNC Consortium. Retrieved from http://www.natcorp.ox.ac.uk/
Bolinger, D. (1976). Meaning and memory. Forum Linguisticum I: 1-14.
Bond, T. G., & Fox, C. M. (2007). Applying the Rasch Model. Fundamental measurement in the human sciences (2nd ed.). Mahwah, NJ: Lawrence Erlbaum Associates.
Bradlow E. T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlets. Psychometrika, 64(2), 153–168.
Bruton, A. (2009). The vocabulary knowledge scale: A critical analysis. Language Assessment Quarterly, 6(4), 288-297.
Bybee, J. (2006). From usage to grammar: The mind’s response to repetition. Language 82 (4): 711–733.
Bybee, J. (1995). Regular morphology and the lexicon. Language and Cognitive Processes, 10(5), 425-455.
Cameron, L. (2002). Measuring vocabulary size in English as an additional language. Language Teaching Research, 6(2), 145-173.
Castaneda, R. (2017). A model-building approach to assessing Q3 values for Local Item Dependence. Doctoral dissertation, UC Merced.
CECL (Centre for English Corpus Linguistics). (2019, Jan.27). Learner corpora around the world. Louvain-la-Neuve: Université catholique de Louvain. Retrieved from https://uclouvain.be/en/research-institutes/ilc/cecl/learner-corpora-around-the-world.html.
Coaley, K. (2009). An introduction to psychological assessment and psychometrics. London, England: Sage.
Cobb, T. (2007). Computing the vocabulary demands of L2 reading. Language Learning & Technology, 11(3), 38–63.
Cobb, T. (2008). Commentary: Response to McQuillan and Krashen. Language Learning & Technology, 12(1), 109–114.
Coulson, D. (2005). Recognition speed for basic L2 vocabulary. In A paper read at the Second JACET English Vocabulary Research Group Conference, Chuo University.
Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34(2), 213-238.
Croft, W. and Cruse, D. (2004). Cognitive Linguistics. Cambridge University Press.
Culligan, B. (2015). A comparison of three test formats to assess word difficulty. Language Testing, 32(4), 503-520.
Daller, H., Milton, J., & Treffers-Daller, J. (Eds.). (2007). Modelling and assessing vocabulary knowledge. Cambridge: Cambridge University Press.
Dang, T. N. Y., & Webb, S. (2016). Making an essential word list for beginners. In Nation, I. S. P., (Ed.) Making and using word lists for language learning and testing (pp. 153-167). John Benjamins, Amsterdam.
Davies, A. (2003). Three heresies of language testing research. Language Testing, 20(4), 355-368.
Davies, M. (2008-) The Corpus of Contemporary American English (COCA): 560 million words, 1990-present. Retrieved from https://corpus.byu.edu/coca/.
D'Anna, C. A., Zechmeister, E. B., & Hall, J. W. (1991). Toward a meaningful definition of vocabulary size. Journal of Reading Behavior, 23(1), 109-122.
De Ayala, R. J. (2010). Item Response Theory. In G. R. Hancock, R. O. Mueller, & L. M. Stapleton, L. M. (Eds.), The reviewer’s guide to quantitative methods in the social sciences. Routledge.
DeMars, C. E. (2006). Application of the bi-factor multidimensional item response theory model to testlet-based tests. Journal of Educational Measurement, 43, 145-168.
DeMars, C. (2010). Item response theory. Oxford University Press.
Ellis, N. C. (2002a). Frequency effects in language acquisition: A review with implications for theories of implicit and explicit language acquisition. Studies in Second Language Acquisition, 24, 143–188.
Ellis, N. C. (2002b). Reflections on frequency effects in language processing. Studies in Second Language Acquisition. 24, 297–339. Retrieved from https://doi.org/10.1017/s0272263102002140
Ellis, N. C., & Larsen‐Freeman, D. (2009). Constructing a second language: Analyses and computational simulations of the emergence of linguistic constructions from usage. Language Learning, 59(s1), 90-125.
Ellis, R. (1994). The study of second language acquisition. Oxford: Oxford University.
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Maheah.
Engelhard Jr, G. (2013). Invariant measurement: Using Rasch models in the social, behavioral, and health sciences. Routledge.
Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society A, 222, 309–368.
Gardner, D., & Davies, M. (2013). A new academic vocabulary list. Applied linguistics, 35(3), 305-327.
Gries, S. (2006). Some proposals towards more rigorous corpus linguistics. Zeitschrift für Anglistik und Amerikanistik, 54(2), 191-202.
Griffin, G. F., & Harley, T. A. (1996). List learning of second language vocabulary. Applied Psycholinguistics, 17: 443-460.
Gyllstad, H. (2013). Looking at L2 vocabulary knowledge dimensions from an assessment perspective. Challenges and potential solutions. In C. Bardel, C. Lindqvist & B. Laufer (Eds.), L2 vocabulary acquisition, knowledge and use. New perspectives on assessment and corpus analysis. Eurosla Monographs Series 2, 29–56. European Second Language Association.
Gyllstad, H., Vilkaitė, L., & Schmitt, N. (2015) Assessing vocabulary size through multiplechoice formats: issues with guessing and sampling rates. ITL - International Journal of Applied Linguistics, 166 (2). pp. 278-306.
Hagell, P. (2014). Testing rating scale unidimensionality using the principal component analysis (PCA)/t-test protocol with the Rasch model: the primacy of theory over statistics. Open Journal of Statistics, 4(6), 456-465.
Hambleton, R., Swaminathan, H., & Rogers, H. (1991). Fundamentals of Item Response Theory. Newbury Park, California: Sage Publications, Inc.
Henriksen, B. (1999). Three dimensions of vocabulary development. Studies in second language acquisition, 21(2), 303-317.
Higa, M. (1965). The psycholinguistic concept of “difficulty” and the teaching of foreign language vocabulary. Language Learning, 15(3-4), 167-179.
Hirai, A. (2014). A Review of Four Studies on Measuring Vocabulary Knowledge. Vocabulary Learning and Instruction, 3(2), 85-92.
Hoey, M. (2005). Lexical Priming: A New Theory of Words and Language. Routledge.
Huang, H. Y., & Wang, W. C. (2012). Higher order testlet response models for hierarchical latent traits and testlet-based items. Educational and Psychological Measurement, 73, 491-511.
Huhta, A., Alderson, J. C., Nieminen L., & Ullakonoja, R. (2011). Diagnosing reading in L2—predictors and vocabulary profiles. Paper presented at the ACTFL CEFR Alignment Conference, Provo, UT.
Hulstijn, J. H. (2015). Language proficiency in native and non-native speakers: Theory and research (Vol. 41). Amsterdam: John Benjamins Publishing Company.
Hulstijn, J. H. (2011). Language proficiency in native and nonnative speakers: An agenda for research and suggestions for second-language assessment. Language Assessment Quarterly, 8(3), 229-249.
Jiang, N. (2004). Semantic transfer and its implications for vocabulary teaching in a second language. The Modern Language Journal, 88(3), 416-432.
Kadota, S. (2010). The Interface between lexical and sentence processing in L2: An empirical study of Japanese EFL learners. Retrieved from Kadota Shuhei, Kwansei Gakuin University, Japan. (Research No. 19520532).
Kamimoto, T. (2014). Local item dependence on the Vocabulary Levels Test revisited. Vocabulary Learning and Instruction, 3(2), 56-68.
Kim, D. (2007). Assessing the relative performance of local item dependence indexes (Doctoral dissertation, The University of Nebraska-Lincoln).
Kline, P. (2000). Handbook of psychological testing. London: Routledge.
Knowles, E. S., & Condon, C. A. (2000). Does the rose still smell as sweet? Item variability across test forms and revisions. Psychological Assessment, 12(3), 245-252.
Koprowski, M. (2005). Investigating the usefulness of lexical phrases in contemporary coursebooks. ELT Journal, 59(4), 322-332. Retrieved from https://doi.org/10.1093/elt/cci061.
Kreiner, S., & Christensen, K. B. (2007). Validity and objectivity in health-related scales: Analysis by graphical loglinear Rasch models. In M. von Davier & C.H. Carstensen (Eds.), Multivariate and mixture distribution Rasch models (pp. 329-346). New York, NY: Springer.
Kremmel, B., & Schmitt, N. (2016). Interpreting Vocabulary Test Scores: What Do Various Item Formats Tell Us About Learners’ Ability to Employ Words? Language Assessment Quarterly, 13(4), 377-392.
Kremmel, B., & Schmitt, N. (2018). Vocabulary levels test. In J. I. Liontas, M. DelliCarpini, & J. C. Riopel (Eds.), The TESOL encyclopedia of English language teaching. John Wiley & Sons, Inc.
Kučera, H., & Francis, W. N. (1967). Computational analysis of present-day American English. Dartmouth Publishing Group.
Lai, G., Chen, C., Tsai, R., & Tseng, W. (under review). Psychometric models for local dependency in the vocabulary levels test. Submitted to Psychometrika.
Laufer, B. (2017). From word parts to full texts: Searching for effective methods of vocabulary learning. Language Teaching Research, 21(1), 5-11.
Laufer, B. (2010). Form focused instruction in second language vocabulary learning. In R.Chacón-Beltrán, C. Abello-Contesse, M.M. Torreblanca-López, & M.D. López-Jiménez (Eds.), Further insights into non-native vocabulary teaching and learning (pp.15–27). Bristol: Multilingual Matters.
Laufer, B. (1998). The development of active and passive vocabulary in a second language: same or different? Applied Linguistics, 19, 255-271.
Laufer, B. (1997). What’s in a word that makes it hard or easy? Intralexical factors affecting the difficulty of vocabulary acquisition. In M. McCarthy, & N. Schmitt (Eds.), Vocabulary description, acquisition and pedagogy (pp. 140-155). Cambridge: Cambridge University Press.
Laufer, B., & Ravenhorst-Kalovski, G. C. (2010). Lexical threshold revisited: Lexical text coverage, learners' vocabulary size and reading comprehension. Reading in a foreign language, 22(1), 15-30.
Li, L., & MacGregor, L. J. (2010). Investigating the receptive vocabulary size of university-level Chinese learners of English: how suitable is the Vocabulary Levels Test? Language and Education, 24(3), 239-249.
Linacre, J. M. (2017). A users guide to Winsteps Rasch model computer program: Program manual 4.0.0. Chicago: Winsteps.
Linacre, J. M. (2017, Sept. 13). Maximum number of items for smaller sample sizes? [Msg 4]. Message posted to Rasch measurement Online Forum, http://raschforum.boards.net/thread/780/maximum-number-items-smaller-sample?page=1&scrollTo=3821.
Linacre, J. M. (2016, Jul 12). Testlets-local-independence order. Message posted to Rasch Measurement Forum. Retrieved from http://raschforum.boards.net/thread/517/ testlets-local-independence-order
Lucke, J. F. (2005). “Rassling the hog”: The influence of correlated item error on internal consistency, classical reliability, and congeneric reliability. Applied Psychological Measurement, 29(2), 106-125.
Ludlow, L. H. (2002). Residuals: Trash or treasures. Popular Measurement, 4, 1-7.
Marais, I. (2013). Local Dependence. In Christensen, K. B., Kreiner, S., & Mesbah, M. (Eds.), Rasch models in health (pp. 111-130). John Wiley & Sons.
Marais, I., & Andrich, D. (2008a). Effects of varying magnitude and patterns of local dependence in the unidimensional Rasch model. Journal of Applied Measurement, 9(2), 1-20.
Marais, I., & Andrich, D. (2008b). Formalizing dimension and response violations of local independence in the unidimensional Rasch model. Journal of Applied Measurement, 9(3), 200-15.
Masrai, A., & Milton, J. (2015). Word difficulty and learning among native Arabic learners of EFL. English Language Teaching, 8(6), 1-10.
Matikainen, T. J. (2011). Semantic representation of L2 lexicon in Japanese university students (Doctoral dissertation). Temple University, Japan.
McLean, S., Kramer, B., & Beglar, D. (2015). The creation and validation of a listening vocabulary levels test. Language Teaching Research, 19(6), 741-760.
McLean, S., & Kramer, B. (2015). The creation of a new Vocabulary Levels Test. SHIKEN, 19(2), 1-11.
McNamara, T. F. (1996). Measuring second language performance. Addison Wesley Longman.
McNamara, T., & Knoch, U. (2012). The Rasch wars: The emergence of Rasch measurement in language testing. Language Testing, 29(4), 555-576.
McQuillan, J., & Krashen, S. (2008). Commentary: Can free reading take you all the way? A response to Cobb (2007). Language Learning & Technology, 12(1), 104–108.
Meara, P. (1992). EFL vocabulary tests. ERIC Clearinghouse.
Meara, P. (1996). The dimensions of lexical competence. In Brown, G., Malmkjr, K. and Williams, J., (Eds.), Performance and competence in second language acquisition. Cambridge: Cambridge University Press, 35–53.
Meara, P. (2005). Designing vocabulary tests for English, Spanish and other languages. In C. Butler, S. Christopher, M. Á. Gómez González, & S. M. Doval-Suárez (Eds.) The dynamics of language use (pp. 271–285). Amsterdam, John Benjamins Press.
Meara, P., & Jones, G. (1990). Eurocentres vocabulary size test 10KA. Zurich: Eurocentres.
Messick, S. (1989). Validity. In R. L. Linn (Ed), Educational measurement (3rd ed) (pp. 13–103). New York: Macmillan.
Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50, 741–749.
Millsap, R. E. (2012). Statistical approaches to measurement invariance. Routledge.
Milton, J. (2009). Measuring second language vocabulary acquisition (Vol. 45). Multilingual Matters.
Milton, J. (2013). Measuring the contribution of vocabulary knowledge to proficiency in the four skills. In C. Bardel, C. Lindqvist & B. Laufer (Eds.), L2 vocabulary acquisition, knowledge and use. New perspectives on assessment and corpus analysis. Eurosla Monographs Series 2, 57-78. European Second Language Association.
Milton, J., & Fitzpatrick, T. (Eds.). (2013). Dimensions of vocabulary knowledge. Palgrave Macmillan.
Mochizuki, M. (2002). Exploration of two aspects of vocabulary knowledge: Paradigmatic and collocational. Annual Review of English Language Education, 13, 121-129.
Mochizuki, M. (2012). Four empirical vocabulary test studies in the three dimensional framework. Vocabulary Learning and Instruction, 1(1), 44-52.
Nation, I. S. P. (2001). Learning vocabulary in another language. Ernst Klett Sprachen.
Nation, I. S. P. (2006). How large a vocabulary is needed for reading and listening? Canadian Modern Language Review, 63(1), 59-82.
Nation, I. S. P. (2012). Information on the BNC/COCA lists. Retrieved from http://www.victoria.ac.nz/lals/about/staff/paul-nation.
Nation, I. S. P. (2013). Learning vocabulary in another language. Second edition. Cambridge: Cambridge University Press.
Nation, I. S. P., & Beglar, D. (2007). A vocabulary size test. The Language Teacher, 31(7), 9-13.
Nation, P., & Gu, P. Y. (2007). Focus on vocabulary. Sydney: National Centre for English Language Teaching and Research.
Nation, I. S. P., & Webb, S. A. (2011). Researching and analyzing vocabulary. Heinle, Cengage Learning.
Novick, M.R. (1966). The axioms and principal results of classical test theory. Journal of Mathematical Psychology, 3(1), 1-18
Okamoto, M. (2015). Is corpus word frequency a good yardstick for selecting words to teach? Threshold levels for vocabulary selection. System, 51, 1-10.
Qian, D.D. (2002). Investigating the relationship between vocabulary knowledge and academic reading performance: An assessment perspective. Language Learning, 52, 513-536.
Read, J. (2000). Assessing vocabulary. Cambridge University Press.
Read, J. (1993). The development of a new measure of L2 vocabulary knowledge. Language testing, 10(3), 355-371.
Read, J. (1988). Measuring the vocabulary knowledge of second language learners. RELC Journal, 19(2), 12–25.
Richards, J. C. (1976). The role of vocabulary teaching. TESOL Quarterly, 10(1): 77-89.
Ronald, J., & Kamimoto, T. (2014). Confidence in word knowledge. In J. Milton and T. Fitzpatrick (Eds.), Dimensions of Vocabulary Knowledge, 154–72. Basingstoke: Palgrave Macmillan.
Schmitt, N. (2008). Instructed second language vocabulary learning. Language teaching research, 12(3), 329-363.
Schmitt, N. (2010). Researching vocabulary: A vocabulary research manual. Palgrave Macmillan.
Schmitt, N., Jiang, X., & Grabe, W. (2011). The percentage of words known in a text and reading comprehension. The Modern Language Journal, 95(1), 26-43.
Schmitt, N., & Meara, P. (1997). Researching vocabulary through a word knowledge framework: Word associations and verbal suffixes. Studies in Second Language Acquisition, 19, 17-36.
Schmitt, N., & Schmitt, D. (2014). A reassessment of frequency and vocabulary size in L2 vocabulary teaching. Language Teaching, 47(04), 484-503.
Schmitt, N., Schmitt, D., & Clapham, C. (2001). Developing and exploring the behaviour of two new versions of the Vocabulary Levels Test. Language Testing, 18(1), 55-88.
Schoonen, R., & Verhallen, M. (2008). The assessment of deep word knowledge in young first and second language learners. Language Testing, 25, 211-236.
Sevigny, P., & Ramonda, K. (2013). Vocabulary: What should we test? In JALT2012 Conference Proceedings. Tokyo: JALT.
Sinclair, J. (1991). Corpus, concordance and collocation. Oxford: Oxford University Press.
Smith Jr, E. V. (2002). Understanding Rasch measurement: Detecting and evaluating the impact of multidimensionality using item fit statistics and principal component analysis of residuals. Journal of Applied Measurement, 3, 205-231.
Stæhr, L. S. (2008). Vocabulary size and the skills of listening, reading and writing. Language Learning Journal, 36(2), 139-152.
Stewart, J., Batty, A. O., & Bovee, N. (2012). Comparing multidimensional and continuum models of vocabulary acquisition: An empirical examination of the vocabulary knowledge scale. TESOL Quarterly, 46(4), 695-721.
Stout, W. (1987). A nonparametric approach for assessing latent trait unidimensionality. Psychometrika, 52(4), 589-617.
Tennant, A. and Pallant, J. (2006). Unidimensionality Matters. Rasch Measurement Transactions, 20, 1048-1051.
Thorndike, E. L., & Lorge, I. (1944). The teacher's wordbook of 30,000 words. New York: Columbia University, Teachers College.
Tonzar, C., Lotto, L., & Job, R. (2009). L2 vocabulary acquisition in children: Effects of learning method and cognate status. Language Learning, 59(3), 623-646.
Traub, R. (1997). Classical Test Theory in historical perspective. Educational Measurement: Issues and Practice, 16 (4), 8-14.
Tulving, E., & Watkins, M. J. (1973). Continuity between recognition and recall. American Journal of Psychology, 86, 352-373.
Vermeer, A. (2001). Breadth and depth of vocabulary in relation to L1/L2 acquisition and frequency of input. Applied psycholinguistics, 22(2), 217-234.
Webb, S. A., & Sasao, Y. (2013). New directions in vocabulary testing. RELC Journal, 44(3), 263-277.
Webb, S., Sasao, Y., & Ballance, O. (2017). Developing and validating new versions of the Vocabulary Levels Test. International Journal of Applied Linguistics, 168(1), 22-69.
Wesche, M., & Paribakht, T. S. (1996). Assessing Second Language Vocabulary Knowledge: Depth Versus Breadth. Canadian Modern Language Review, 53(1), 13-40.
Wolfe, E. W., & Smith Jr., E. V. (2007a). Instrument development tools and activities for measure validation using Rasch models: Part I–Instrument development tools. Journal of Applied Measurement, 8, 97–123.
Wolfe E. W., & Smith Jr., E. V. (2007b). Instrument development tools and activities for measure validation using Rasch models: Part II–Validation activities. Journal of Applied Measurement, 8, 204–234.
Wolter, B. W. (2005). V-links: a new approach to assessing depth of word knowledge. Doctoral dissertation, University of Wales Swansea.
Wray, A. (2002). Formulaic Language and the Lexicon. Cambridge University Press.
Wright, B. D., Linacre, J. M., Gustafson, J. E., & Martin-Löf, P. (1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8(3), 370. Retrieved from http://www.rasch.org/rmt/rmt83b.htm
Wright, B. D., & Stone, M H. (2004). Making Measures. Chicago, IL: Phaneron Press.
Xing, P., & Fulcher, G. (2007). Reliability assessment for two versions of Vocabulary Levels Tests. System, 35(2), 182-191.
Yen, W. M. (1984). Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 8, 125-145.
Yen, W. M. (1993). Scaling performance assessments: Strategies for managing local item dependence. Journal of Educational Measurement, 30, 187-213.
Zhang, B. (2010). Assessing the accuracy and consistency of language proficiency classification under competing measurement models. Language Testing, 27(1), 119–140.
Zhang, Q., & Yang, H. (2015). Pacific Rim Objective Measurement Symposium (PROMS), 2014 Conference Proceedings. Berlin: Springer-Verlag.
Zimmerman, D. W., & Williams, R. H. (1965). Effect of chance success due to guessing on error of measurement in multiple-choice tests. Psychological Reports, 16, 1193–1196.