研究生: |
黃宏宇 Hung-Yu Huang |
---|---|
論文名稱: |
階層結構試題反應模式及其在電腦適性測驗之應用 The Hierarchical Structure Item Response Model and its Application to Computerized Adaptive Testing |
指導教授: |
陳柏熹
Chen, Po-Hsi 王文中 Wang, Wen-Chung |
學位類別: |
博士 Doctor |
系所名稱: |
教育心理與輔導學系 Department of Educational Psychology and Counseling |
論文出版年: | 2009 |
畢業學年度: | 97 |
語文別: | 英文 |
論文頁數: | 196 |
中文關鍵詞: | 階層結構試題反應模式 、貝氏估計法 、電腦適性測驗 、題庫安全 、新近選題法 、試題反應理論 |
英文關鍵詞: | hierarchical structure item response model, Bayesian estimation, computerized adaptive testing, test security, modern item selection rules, item response theory |
論文種類: | 學術論文 |
相關次數: | 點閱:285 下載:27 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本研究旨在發展具有階層結構潛在變項的試題反應模式,稱之為「階層結構試題反應模式」,且將其應用在電腦適性測驗中,並檢驗其有效性。本論文共有三個模擬研究,第一個研究是透過貝氏統計中的馬可夫鍊蒙地卡羅估計法,來進行模式參數的估計與模式適配度的檢驗,結果發現本研究發展的模式適配度指標與貝氏DIC指標適合用來診斷模式與資料的適配程度,且貝氏估計法能提供良好的模式參數回復性。第二個研究則是發展階層結構試題反應模式在電腦適性測驗上的算則,結果發現透過修正題組模式的電腦適性測驗算則而發展出的選題與能力估計程序,具有最佳的能力估計效能。第三個研究則是修正傳統的最大訊息量選題法,在測驗初期加上隨機成分來控制測驗初期能力估計的誤差,結果發現新近的選題方法能提高題庫使用率,降低試題的曝光率與測驗平均重疊率,支持新的選題法可以兼顧題庫安全與測量精確度。最後,作者則針對未來研究與實務應用提供若干建議。
This study is aimed at constructing IRT models with a higher order latent trait structure within a multidimensional IRT framework and implementing these models in the context of a CAT with varying modern item selection rules, in order to assess the effectiveness of the model through simulation studies. Sheng and Wikle (2008) proposed a Bayesian multidimensional IRT models with a hierarchical structure (also referred to as a hierarchical structure item response model (HSIRM) in this study), and conducted several simulation studies to support their assumptions. However, certain questionable features of their simulation design made their findings unclear and left many questions in need of answering. Unlike the original models proposed by Sheng and Wikle (2008), the HSIRM constructed a latent trait structure based on factor analysis instead of principle components analysis. Because the original study is questionable, and because it is necessary to guarantee that novel IRT models are stable and reliable before implementing them in a CAT environment, it is important to revise the proposed models and assess their estimation efficiency. Consequently three separate studies on the HSIRM were conducted. The first study focused on the Bayesian estimation method and the Bayesian model checking techniques at first and then checked the model parameter recovery. The second study attempted to derive the CAT algorithm of the HSIRM and evaluated the accuracy of overall and domain ability estimations under a variety of conditions. Finally, modern item selection methods were incorporated into the HSIRM-based CAT to better control item exposure and overlap rates.
In the first study, simulations were conducted to assess the effect of Bayesian model checking techniques, including the posterior predictive model checking (PPMC) method, the pseudo-Bayes factor (PsBF) approach, and the Bayesian DIC, and then to evaluate the model parameter recovery through comparison with the true model. The data sets were generated with a UIRT model, a MIRT model with identical latent trait (MIRT-I), and a HSIRM with both a high ability correlation (HSIRM-H) and a low ability correlation (HSIRM-L) in terms of 1PL- and 2PL-IRT models. The analytic models were 1PL- and 2PL-HSIRMs. Five indicators were incorporated into the PPMC procedure, including the SD of the biserial correlations (Bis), the Bayesian chi-square test (BChi), the reproduced correlation matrix test (Rcor), the observed score covariance between the subtests test (Cov), and the identical latent trait correlation test (Id). The results suggest that, when implementing PPMC in the HSIRM, it is advisable to fit data to the 1PL- HSIRM at first and move away the inappropriate data sets which were generated from the UIRT, MIRT-I, and MIRT-S models according to the well-working criteria, because the effect of the PPMC method is improved in the fit of 1PL- HSIRM. As for the relative model fit criteria, only the DIC was able to consistently select the correct model to fit the data. The PsBF always preferred the simplest model to fit the data regardless of which was the true model. With respect to model parameter recovery, most estimators were unbiased, suggesting that the Bayesian statistics such as the MCMC procedure can provide precise measurement for model parameters in the HSIRM. Finally, as a one-stage approach, in comparison with two-stage approaches such as the two-stage CFA and averaging procedures, the HSIRM was the most efficient in measurement accuracy for overall ability estimates. In addition, the major advantage of the one-stage method over the two-stage methods is that the HSIRM can provide standard errors of measurement obtained immediately by the standard deviation of the posterior distribution for each examinee in estimating overall ability, whereas only the two-stage CFA approach had an approximate estimate for the standard error of measurement obtained through an indirect formula transition. The most important thing, however, was that both the two-stage CFA and the averaging approaches did not meet the structure of test design used as a standard in the study, namely, the way to design a test was the only appropriate way to analyze the corresponding data set.
In the second study, three HSIRM-based CAT algorithms were proposed by the author. These included a multidimensional CAT, a unidimensional CAT, and a HSIRM-CAT approach. The two-stage methods, UCAT with CFA and UCAT with average approaches, served as baseline methods for comparison with the one-stage methods. The results showed that except for the unidimensional CAT approach, the one-stage methods always generated more accurate estimates both for overall and domain abilities than the two-stage methods, suggesting that the multidimensional CAT and the HSIRM-CAT approaches are reliable enough for administration in a CAT context. Neglecting the random effects of subtests made it difficult for the unidimensional CAT approach to estimate overall and domain abilities precisely, especially in the diverse factor loading setting and the 2PL-HSIRM condition. Of the two methods, the HSIRM-CAT approach was recommended because the CAT was based on the HSIRM that was used to generate the item responses. In addition, a significant advantage of the HSIRM-CAT approach was that it yielded standard errors of measurement for overall and domain ability estimates simultaneously after administering an adaptive item such that the fixed-precision stopping rule can be implemented if necessary, whereas the multidimensional CAT approach did not.
In the last study, the progressive (PG; Barrada, Olea, Ponsoda, & Abad, 2008; Revuelta & Ponsoda, 1998) and the proportional (PP; Barrada et al., 2008; Segall, 2004a) methods were incorporated into the CAT procedures based on the HSIRM to improve item pool security and measurement precision simultaneously, as compared to the point Fisher information (PFI) method. In addition, the Sympson and Hetter online freeze (SHOF; Chen, 2004, 2005) procedure and content balancing controls were implemented in the process. The result showed that the PG and the PP methods can reduce the item exposure rate as well as improve item pool usage. Further, the effect becomes larger as the acceleration parameter increases. However, it was not possible to guarantee that the item exposure rate for each item would be below a pre-specified level unless the SHOF was implemented. As the acceleration parameter increased, item overlap rate decreased both for the PG and the PP methods but the overall RMSE did not always increase. When the PG method improved measurement precision by reducing the acceleration parameter, the difference in overall RMSEs between the PFI and the PG method was much smaller. In sum, the HSIRM-CAT approach, with both the PG and SHOF procedures can improve item bank security with little or no loss in measurement precision and provide test information for the duration of the CAT, as evidenced by the equivalent overall RMSEs with lower test mean overlap rate of this approach as compared to the PFI method.
Finally, study limitations are noted and suggestions for future investigations are proposed.
Adams, R. J., Wilson, M., & Wang, W.-C. (1997). The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21, 1-23.
Adams, R. J, Wilson, M., & Wu, M. (1997). Multilevel item response models: An approach to error in variables regression. Journal of Educational and Behavioral Statistics, 22, 47-76.
Adams, R. J. & Wu, M. L. (Eds.). (2002). PISA 2000 technical report. Paris, OECD Publications.
Albert, J. H. (1992). Bayesian estimation of normal ogive item response curves using Gibbs sampling. Journal of Educational Statistics, 17, 251-269.
Akaike, H. (1974). A new look at the statistical model identication. IEEE Transactions on Automatic Control, 19, (6), 716-723.
Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561-573.
Baker, F. B. (1998). An investigation of the item parameter recovery characteristics of a Gibbs sampling procedure. Applied Psychological Measurement, 22, 163-169.
Baker, F., & Kim, S.-H., (2004). Item response theory: Parameter estimation techniques (2nd ed.). New York: Marcel Dekker.
Barrada, J. R., Olea, J., Ponsoda, V., & Abad, F. J. (2008). Incorporating randomness in the Fisher information for improving item-exposure control in CATs. British Journal of Mathematical and Statistical Psychology, 61(2), 493-513.
Bayarri, S., & Berger, J. (2000). P-values for composite null models. Journal of the American Statistical Association, 95, 1127–1142.
Beguin, A. A., & Glas, C. A. W. (2001). MCMC estimation and some model-fit analysis of multidimensional IRT models. Psychometrika, 66(4), 541–562.
Birubaum, A. (1968). Some latent trait models and their use in inferring an examinees’ ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores. Reading, MA: Addison-Wesley.
Bock, R. D., & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443-459.
Bock, R. D., & Mislevy, R. J. (1982). Adaptive EAP estimation of ability in a microcomputer environment. Applied Psychological Measurement, 6, 431-444
Bolt, D.M., Cohen, A.S., & Wollack, J.A. (2001). A mixture model for multiple choice data. Journal of Educational and Behavioral Statistics, 26(4), 381-409.
Bolt, D. M., Cohen, A. S., & Wollack, J. A. (2002). Item parameter estimation under conditions of test speededness: Application of a mixture Rasch model with ordinal constraints. Journal of Educational Measurement, 39(4), 331–348.
Bolt, D. M., & Lall, V. F. (2003). Estimation of compensatory and noncompensatory multidimensional item response models using Markov chain Monte Carlo. Applied Psychological Measurement, 29, 395-414.
Bond, T., & Fox, C. (2001). Applying the Rasch model: Fundamental measurement in the human sciences. Mahwah, NJ: Lawrence Erlbaum.
Bradlow, E. T., Wainer, H., & Wang, X. (1999). A Bayesian random effects model for testlet. Psychometrika, 64, 153-168.
Brown, W. (1910). Some experimental results in the correlation of mental abilities. British Journal of Psychology, 3, 296-322.
Cao, J., & Stokes, S. L. (2008). Bayesian IRT guessing models for partial guessing
behaviors. Psychometrica, 73, 209-230.
Chang, S.-W., & Ansley, T. N. (2003). A comparative study of item exposure control methods in computerized adaptive testing. Journal of Educational Measurement, 40, 71-103.
Chang, H.-H., & Ying, Z. (1996). A global information approach to computerized adaptive testing. Applied Psychological Measurement, 20, 213-229.
Chen, S.-Y. (2004). Controlling item exposure on the Fly in Computerized Adaptive Testing. Paper presented at the Annual Meeting of the Taiwanese Psychological Association, Taipei, Taiwan.
Chen, S.-Y. (2005). Controlling item exposure and test overlap on the Fly in computerized adaptive testing. Paper presented at the IMPS 2005 Annual Meeting of the Psychometric Society. Tilburg, Netherlands.
Chen, S.-Y., & Ankenmann, R. D. (2004). Effects of practical constraints on item selection rules at the early stages of computerized adaptive testing. Journal of Educational Measurement, 41, 149-174.
Chen, S.-Y., Ankenmann, R. D., & Chang, H.-H. (2000). A comparison of item selection rules at the early stages of computerized adaptive testing. Applied Psychological Measurement, 24, 241-255.
Chen, S.-Y., Ankenmann, R. D., & Spray, J. A. (2003). The relationship between item exposure and test overlap in computerized adaptive testing. Journal of Educational Measurement, 40, 129–145.
Chen, S.-Y., & Lei, P.-W. (2005). Controlling item exposure and test overlap in computerized adaptive testing. Applied Psychological Measurement, 29, 204–217.
Chib, S., & Greenberg, E. (1995). Understanding the Metropolis-Hastings algorithm. American Statistician, 49, 327-335.
Cohen, A. S., & Bolt, D. M. (2005). A mixture model analysis of differentail item functioning. Journal of Educational Measurement, 42, 133-148.
Davey, T., & Parshall, C. G. (1995). New algorithms for item selection and exposure control with computerized adaptive testing. Paper presented at the annual meeting of the American Educational Research Association, San Francisco, CA.
De Boeck, P., & Wilson, M. (Eds.). (2004). Explanatory item response models: A generalized linear and nonlinear approach. New York: Springer-Verlag.
De Boeck, P. (2008). Random item IRT models. Psychometrika, 73(4), 533-559.
de la Torre, J., & Douglas, J. A. (2004). Higher-order latent trait models for cognitive diagnosis. Psychometrika, 69, 333-353.
Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Lawrence Erlbaum.
Flaugher, R. (2000). Item pools. In Wainer, H. (Ed), Computerized adaptive testing: A primer (2nd ed.) (pp. 37-59). Mahwah, NH: Lawrence Erlbaum Associates.
Fischer, G. H. (1973). Linear logistic test model as an instrument in educational research. Acta Psychologica, 37, 359-374.
Fox, J.-P., & Glas, C. A. W. (2003). Bayesian modeling of measurement error in predictor varables using item response theory. Psychometrika, 68, 169-191.
Gelfand, A. E. (1996). Model comparison using sampling-based methods. In W. R. Gilks, S. Richardson, & D. J. Spiegelhalter (Eds.), Markov chain Monte Carlo in practice (pp. 145-161). Washington, DC: Chapman & Hall.
Gelfand, A. E., & Dey, D. K. (1994). Bayesian model choice: Asymptotics and exact calculations. Journal of the Royal Statistical Society, B, 56, 501-514.
Gelfand, A. E., & Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal densities. J. AM. Statist. Assoc., 85, 398-409.
Geisser, S., & Eddy, W. (1979). A predictive approach to model selection. Journal of American Statistical Association, 74, 153-160.
Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2003). Bayesian data analysis. New York: Chapman & Hall.
Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE trans. Pattern Analysis and Machine Intelligence, 12, 609-628.
Goegebeur, Y., De Boeck, P., Wollack, J. A., & Cohen, A. S. (2008). A speeded item response model with gradual process change. Psychometrica, 73, 65-87.
Gustafsson, J., & Balke, G. (1993). General and specific abilities as predictors of school achievement. Multivariate Behavioral Research, 28(4), 407-434.
Guttman, I. (1967). The use of the concept of a future observation in goodness-of-fit problems. Journal of the Royal Statistical Society B, 29, 83-100.
Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: Principles and applications. Norwell, MA: Kluwer Academic Publishers.
Hattie, J. (1981). Decision criteria for determining unidimensional and multidimensional normal ogive models of latent trait theory. Armidale, Australia: The University of New England, Center for Behavioral Studies.
Hoijtink, H., & Molenaar, I. W. (1997). A multidimensional item response model: Constrained latent class analysis using the Gibbs sampler and posterior predictive checks. Psychometrika, 62, 171-189.
Hoogland, J. J., & Boomsma, A. (1998). Robustness studies in covariance structure modeling: An overview and a meta-analysis. Sociological Methods and Research, 26, 329–367.
Hsu, C.-L. & Chen, S.-Y. (2007). Controlling item exposure and test overlap in variable length computerized adaptive testing. Psychological testing, 54(2), 403-428.
Ip, E. H.-S. (2000). Adjusting for information inflation due to local dependence in moderately large item clusters. Psychometrika, 65, 73-91.
Janssen, R., Tuerlinckx, F., Meulders, M., & De Boeck, P. (2000). A hierarchical IRT model for criterion-referenced measurement. Journal of Educational and Behavioral Statistics, 25(3), 285-306.
Johnson, D. E. (1998). Applied multivariate methods for data analysts. CA: Brooks/Cole Publishing Company.
Johnson, V. E., & Albert, J. H. (1999). Ordinal data modeling. New York: Springer-Verlag.
Jöreskog, K. G., & Sörbom, D. (2001). LISREL Version 8.51[Computer software]. Chicago: Scientific Software International.
Ju, Y. (2005). Item exposure control in a-stratified computerized adaptive testing. Unpublished master’s thesis, National Chung Cheng University, Chia-Yi, Taiwan.
Kang, T., & Cohen, A. S. (2007). IRT model selection methods for dichotomous items. Applied Psychological Measurement, 31, 331-358.
Kelloway, E. K. (1998). Using Lisrel for structural equation modeling: A researcher’s guide. Thousand Oaks: Sage Publications.
Klein Entink, R. H., Fox, J.-P, & van der Linden W. J. (2009). A multivariate multilevel approach to the modeling of accuracy and speed of test takers. Psychometrika, 74(1), 21-48.
Kingsbury, G. G., & Zara, A. R. (1989). Procedures for selecting items for computerized adaptive tests. Applied Measurement in Education, 2, 359-375.
Kingsbury, G. G., & Zara, A. R. (1991). A comparison of procedures for content-sensitive item selection in computerized adaptive tests. Applied Measurement in Education, 4, 241-261.
Li, Y., Bolt, D. M., & Fu, J. (2006). A comparison of alternative models for testlets. Applied Psychological measurement, 30(1), 3-21.
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149-174.
McBride, J. R., & Martin, J. T. (1983). Reliability and validity of adaptive ability tests in a military setting. In D. J. Weiss (Ed.), New Horizons in Testing (pp. 223-226). New York, NY: Academic Press.
McCulloch, C. E., & Searle, S. R. (2001). Generalized, linear, and mixed models. New York: Wiley.
McKinley, R. L., & Reckase, M. D. (1982). The use of the general Rasch model with multidimensional item response data (Research Report ONR 82-1). Iowa City IA: American College Testing.
Nering, M. L., Davey, T., & Thompson, T. (1998). A hybrid method for controlling item exposure in computerized adaptive testing. Paper presented at the annual meeting of the Psychometric Society, Urbana, IL.
Newton, M. A., & Raftery, A. E. (1994). Approximate Bayesian inference by the weighted likelihood bootstrap (with discussion). Journal of the Royal Statistical Society, Series B, 56, 3-48.
O’Hagan, A. (1991). Discussion on posterior Bayes factors (by M. Aitkin), Journal of the Royal Statistical Society, Series B, 53, 136.
O’Hagan, A. (1995). Fractional Bayes factors for model comparison. , Journal of the Royal Statistical Society, Series B, 57, 99-138.
Owen, R. J. (1969). A Bayesian approach to tailored testing (Research Report 69-92). Princeton, NJ: Educational Testing Service.
Owen, R. J. (1975). A Bayesian sequential procedure for quantal response in the context of adaptive mental testing. Journal of the American Statistical Association, 70, 351–356.
Parshall, C. G., Spray, J. A., Kalohn, J. C., & Davey, T. (2002). Practical considerations in computer-based testing. New York: Springer-Verlag.
Patz, R., & Junker, B. (1999). A straightforward approach to Markov chain Monte Carlo methods for item response models. Journal of Educational and Behavioral Statistics, 24, 146–178.
Press, S. J. (2003). Subjective and objective Bayesian statistics: Principle, models, and applications (Second Edition). Hoboken, NJ: John Wiley & Sons, Inc.
Ponsoda, V., & Olea, J. (2003). Adaptive and tailored testing. (Including IRT and Non IRT Application). In R. Fernandez-Ballesteros (Ed.), Encyclopaedia of psychological assessment (pp. 9–13). London: Sage Publications.
Raftery, A. E. (1996). Hypothesis testing and model selection. In W. R. Gilks, S. Richardson, & D. J. Spiegelhalter (Eds.), Markov chain Monte Carlo in practice (pp. 163-187). London: Chapman & Hall.
Raftery, A. E., & Lewis, S. M. (1996). Implementing MCMC. In W. R. Gilks, S. Richardson, & D. J. Spiegelhalter (Eds.), Markov chain Monte Carlo in practice (pp. 115-130). London: Chapman & Hall.
Raîche, G., Blais, J. G., & Magis, D. (2007). Adaptive estimatiors of trait level in adaptive testing: some proposals. In D. J. Weiss(Ed.), Proceedings of the 2007 GMAC conference on Computerized Adaptive Testing. 2007. June 7-8.
Rasch, G. (1960). Probabilistic models for some intelligence and attainment tests. Copenhagen: Institute of Educational Research. (Expanded edition, 1980. Chicago: The University of Chicago Press.)
Reckase, M. D. (1985). The difficulty of test items that measure more than one ability. Applied Psychological Measurement, 9, 401–412.
Reckase, M. D. (1997). The past and future of multidimensional item response theory. Applied Psychological Measurement, 21, 25-36.
Reise, S. P., Widaman, K. F., & Pugh, R. H. (1993). Confirmatory factor analysis and item response theory: Two approaches for exploring measurement invariance. Applied Psychological Measurement, 114, 552-566.
Revuelta, J., & Ponsoda, V. (1998). A comparison of item exposure control methods in
computerized adaptive testing. Journal of Educational Measurement, 35, 311–327.
Rijmen, F., Tuerlinckx, F., De Boeck, P., & Kupens, P. (2003). A nonlinear mixed model framework for item response theory. Psychological Methods, 8, 185-205.
Rubin, D.B. (1984). Bayesinly justifiable and relevant frequency calculations for the applied statistician. Annals of Statistics, 12, 1151-1172.
SAS Institute (1999). SAS online doc (version 8) (software manual on CD-Rom). Cary, NC: SAS Institute Inc.
San Martín, E., del Pino, G., & De Boeck, P. (2006). IRT models for ability-based guessing. Applied Psychological Measurement, 30(3), 193-203.
Schmid, J., & Leiman, J. M. (1957). The development of hierarchical factor solutions. Psychometrika, 22, 53-61.
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461-464.
Segall, D. O. (1996). Multidimensional adaptive testing. Psychometrika, 61, 331-354.
Segall, D. O. (2004a). A sharing item response theory model for computerized adaptive testing. Journal of Educational and Behavioral Statistics, 29, 439–460.
Segall, D. O. (2004b). Computerized adaptive testing. In K. Kempf-Leonard (Ed.), The encyclopaedia of social measurement (pp. 429-438). San Diego, CA: Academic Press.
Segall, D. O., & Moreno, K. E. (1999). Development of the computerized adaptive testing version of the Armed Services Vocational Aptitude Battery. In F. Drasgow, & J. Olson-Buchanan (Eds.), Innovations in computerized assessment (pp. 35-65). Mahwah, NJ: Lawrence Erlbaum.
Sheng, Y., & Wikle, C. K. (2008). Bayesian multidimensional IRT models with a hierarchical structure. Educational and psychological measurement, 68(3), 413-430.
Shih, C.-L. (2007). A Comparison of Item Selection Strategies in Computerized Adaptive Testing for Testlet-based Items and Multidimensional Items. Unpublished doctoral thesis, National Chung Cheng University, Chia-Yi, Taiwan.
Sinharay, S. (2005). Assessing fit unidimensional item response theory models using a Bayesian approach. Journal of Educational Measurement, 42(4), 375-394.
Sinharay, S., & Johnson, M. S. (2003). Simulation studies applying posterior predictive model checking for assessing fit of the common item response theory models. Manuscript in preparation. A preliminary version Retrieved November 1, 2004, from http://www.ets.org/research/newpubs.html.
Sinharay, S., Johnson, M. S., & Stern, H. S. (2006). Posterior predictive assessment of item response theory models. Applied Psychological measurement, 30(4), 298-321.
Skrondal, A., & Rabe-Hesketh, S. (2004). Generalized latent variable modeling: Multilevel, longitudinal and structural equation models. Boca Raton, FL: Chapman and Hall/CRC Press.
Smith, L. L., & Reise, S. P. (1998). Gender differences on negative affectivity: An IRT study of differential item functioning on the multidimensional personality questionnaire stress reaction scale. Journal of personality and social psychology, 75(5), 1350-1362.
Smith, E. V. Jr., & Smith, R. M. (Eds.). (2004). Introduction to Rasch measurement theory models and applications. Maple Grove, MN: JAM press.
Spearman, C. (1910). Correlation calculated from faulty data. British Journal of Psychology, 3, 271-295.
Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & van der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society, Series B, Methodological, 64, 583-616.
Spiegelhalter, D. J., Thomas, A., & Best, N. (2003). WinBUGS version 1.4 [Computer Program.]. Cambridge, UK: MRC Biostatistics Unit, Institute of Public Health.
Stocking, M. L., & Lewis, C. (1995). A new method for controlling item exposure in computerized adaptive testing (ETS Research Report RR-95-25). Princeton, NJ: Educational Testing Service.
Stocking, M. L., & Lewis, C. (1998). Controlling item exposure conditional on ability in computerized adaptive testing. Journal of Educational and Behavioral Statistics, 23, 57-75.
Stocking, M. L., & Swanson, L. (1993). A method for severely constrained item selection in adaptive testing. Applied Psychological Measurement, 17 (3), 277-292.
Su, Y.-H. (2007). Simultaneous Control over Item Exposure and Test Overlap in Computerized Adaptive Testing for Testlet-based Items and Multidimensional Items. Unpublished doctoral thesis, National Chung Cheng University, Chia-Yi, Taiwan.
Swanson, L., & Stocking, M. L. (1993). A model and heuristic for solving very large item selection problems. Applied Psychological Measurement, 17 (2), 151-166.
Sympson, J. B., & Hetter, R. D. (1985). Controlling item-exposure rates in computerized adaptive testing. In Proceedings of the 27th annual meeting of the Military Testing Association (pp. 973–977). San Diego, CA: Navy Personnel Research and Development Center.
Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. J. AM. Statist. Assoc., 82, 528-549.
Thomasson, G. L. (1995). New item response control algorisms for computerized adaptive testing. Paper presented at the annual meeting of the Psychometric Society, Minneapolis, MN.
van der Linden, W. J. (1998). Bayesian item-selection criteria for adaptive testing. Psychometrika, 63, 201-216.
van der Linden, W. J., & Glas, C. (Eds.). (2000). Computer adaptive testing: Theory and practice. Boston, MA: Kluwer Academic Publishers.
van der Linden,W. J., & Hambleton, R. K. (Eds.). (1997). Handbook of modern item response theory. New York: Springer.
van der Linden, W. J., & Pashley, P. J. (2000). Item selection and ability estimation in adaptive testing. In W. J. van der Linden, & C. A. W. Glas (Eds.), Computerized adaptive testing: Theory and practice (pp. 1-25). Dordrecht, The Netherlands: Kluwer Academic Publishers.
Veerkamp, W. J. J., & Berger, M. P. F. (1997). Some new item selection criteria for adaptive testing. Journal of Educational and Behavioral Statistics 22, 203-226.
Wainer, H. (Ed.) (1990). Computerized adaptive testing: A primer. Hilsdale, NJ: Lawrence Erlbaum Associates.
Wainer, H. (1995). Precision and differential item functioning on a testlet-based test: The 1991 Law School Admissions Test as an example. Applied Measurement in Education, 8, 157-186.
Wainer, H., Bradlow, E. T., & Du, Z. (2000). Testlet response theory: An analog for the 3-PL useful in adaptive testing. In W. J. van der Linden, & C. A. W. Glas (Eds.), Computerized adaptive testing: Theory and practice (pp. 245-269). Dordrecht, The Netherlands: Kluwer Academic Publishers.
Wainer, H., & Lukhele, R. (1997). How reliable are TOEFL scores? Educational and Psychological Measurement, 57, 741-758.
Wainer, H., & Kiely, G. L. (1987). Item clusters and computerized adaptive testing: A case for testlets. Journal of Educational Measurement, 24, 185-201.
Wainer, H., & Thissen, D. (1996). How is reliability related to the quality of test scores? What is the effect of local dependence on reliability? Educational Measurement: Issues and Practice, 15, 22-29.
Wainer, H., & Wang, X. (2000). Using a new statistical model for testlets to score TOEFL. Journal of Educational Measurement, 37, 203-220.
Wang, W.-C. (2004). Effects of anchor item methods on the detection of differential
item functioning within the family of Rasch models. Journal of Experimental Education, 72, 221-261.
Wang W.-C, & Chen, P.-H. (2004). Implementation and measurement efficiency of multidimensional computerized adaptive testing. Applied Psychological measurement, 28(5), 295-316.
Wang, W.-C., Cheng, Y.-Y., & Wilson, M. R. (2005). Local item dependency for items across tests connected by common stimuli. Educational and Psychological measurement, 65, 5-27.
Wang, W.-C., & Liu, C.-Y. (2007). Formulation and Application of the Generalized Multilevel Facets Model. Educational and Psychological Measurement 67, 583-605.
Wang, W.-C., & Wilson, M. R. (2005a). Assessment of differential item functioning
in testlet-based items using the Rasch testlet model. Educational and Psychological Measurement, 65, 549-576.
Wang, W.-C., & Wilson, M. R. (2005b). The Rasch testlet model. Applied Psychological Measurement, 29, 126-149.
Wang, W.-C., & Wilson, M. R. (2005c). Exploring local item dependence using a random-effects facet model. Applied Psychological Measurement, 29, 296-318.
Wang, W.-C., Wilson, M. R., & Adams, R. J. (1997). Rasch models for multidimensionality between items and within items. In M. Wilson, G. Engelhard & K. Draney (Eds.), Objective measurement: Theory into practice (Volume 4, pp. 139-155). Norwood, NJ: Ablex.
Wang, W.-C., Wilson, M. R., & Adams, R. J. (2000). Interpreting the parameters of a multidimensional Rasch model. In M. Wilson, & G. Engelhard (Eds.), Objective measurement: Theory into practice (Volume 5, pp. 219-242). Norwood, NJ: Ablex.
Way, W. D. (1998). Protecting the integrity of computerized testing item pools. Educational Measurement: Issues and Practice, 17, 17–27.
Wright, B. D., & Masters, G. N. (1982). Rating scale analysis. Chicago: MESA
Press.
Wu, M.-L. (2006). Controlling Item Exposure on the Fly in Computerized Adaptive Testing. Unpublished master’s thesis, National Chung Cheng University, Chia-Yi, Taiwan.
Wu, M.-L., & Chen, S.-Y. (2008). Investigating item exposure control on the FLY in Computerized Adaptive Testing. Psychological testing, 55(1), 1-32.
Wu, M. L., Adams, R. J., & Wilson, M. R. (1998). ACER ConQuest: Generalised item response modeling software. Melbourne, Australia: Australian Council for Educational Research.