Standardized educational assessments: from framework to outcomes

Authors

DOI:

https://doi.org/10.18861/cied.2025.16.2.4015

Keywords:

educational assessment, large-scale assessment, standardized assessment, psychometrics, methodology

Abstract

This article analyzes standardized educational assessments (SEA) as valuable tools for measuring educational achievements and improving the quality of educational systems. It offers an updated overview of their development, implementation, and analysis methods. The research is based on a theoretical bibliographic review, examining recent studies that address global and regional educational contexts, with a particular emphasis on Latin America. The analysis identifies significant methodological advances, such as the incorporation of item response theory, computerized adaptive testing, and the use of artificial intelligence for item generation, all of which enhance the accuracy and relevance of measurements. However, limitations were also identified, including the disconnect between the technical design of SEA and their practical application, the lack of technical training for teachers, and the underutilization of the databases generated for educational research and informed decision-making. The findings highlight the need to design SEA that consider disadvantaged contexts and students with disabilities. The conclusion emphasizes that maximizing the impact of SEA requires strengthening training in their interpretation and promoting their use in educational research, thereby contributing to more equitable and effective educational systems.

Downloads

Download data is not yet available.

References

Abad, F. J., Olea, J., Ponsoda, V., & García, C. (2011). Medición en ciencias sociales y de la salud. Síntesis.

Agencia de Calidad de la Educación (2014). Informe Técnico SIMCE 2012. Agencia de Calidad de la Educación.

American Educational Research Association, American Psychological Association & National Council on Measurement in Education (2014). Standards for educational and psychological testing. American Educational Research Association.

Backhoff, E. (2018). Evaluación estandarizada de logro educativo: contribuciones y retos. Revista Digital Universitaria, 19(6), 1-15. http://doi.org/10.22201/codeic.16076079e.2018.v19n6.a3

Beck, K. (2020). Ensuring content validity of psychological and educational tests―the role of experts. Frontline Learning Research, 8(6), 1-37. https://doi.org/10.14786/flr.v8i6.517

Benaros, S., Lipina, S. J., Segretin, M. S., Hermida, M. J., & Colombo, J. A. (2010). Neurociencia y educación: hacia la construcción de puentes interactivos. Revista de Neurología, 50(3), 179-186.

Blair, C., & Razza, R. P. (2007). Relating effortful control, executive function, and false belief understanding to emerging math and literacy ability in kindergarten. Child development, 78(2), 647–663.

Bond, T. G., & Fox, C. M. (2013). Applying the Rasch model: Fundamental measurement in the human sciences. Psychology Press.

Carlson, J. E., & von Davier, M. (2013). Item Response Theory. ETS Research Report Series, 2013(2), i-69. https://doi.org/10.1002/j.2333-8504.2013.tb02335.x

Cizek, G. J., & Bunch, M. B. (2007). Standard setting: A guide to establishing and evaluating performance standards on tests. SAGE Publications.

Correa-Rojas, J. (2021). Coeficiente de Correlación Intraclase: Aplicaciones para estimar la estabilidad temporal de un instrumento de medida. Ciencias Psicológicas, 15(2), 1-12. https://doi.org/10.22235/cp.v15i2.2318

Crocker, L., & Algina, J. (2008). Introduction to classical and modern test theory. CENGAGE Learning.

Cuellar, E., Partchev, I., Zwitser, R., & Bechger, T. (2021). Making sense out of measurement non-invariance: how to explore differences among educational systems in international large-scale assessments. Educational Assessment, Evaluation and Accountability, 33, 9-25. https://doi.org/10.1007/s11092-021-09355-x

De Ayala, R. J. (2009). The theory and practice of item response theory. Guilford Press.

De la Torre, J., & Minchen, N. (2014). Cognitively diagnostic assessments and the cognitive diagnosis model framework. Psicología Educativa, 20(2), 89-97. https://doi.org/10.1016/j.pse.2014.11.003

Dehaene, S. (2019). ¿Cómo aprendemos?: Los cuatro pilares con los que la educación puede potenciar los talentos de nuestro cerebro. Siglo XXI Editores.

Dumas, D., Dong, Y., & McNeish, D. (2022). How fair is my test: A ratio statistic to help represent consequential validity. European Journal of Psychological Assessment, 39(6), 416-423. https://doi.org/10.1027/1015-5759/a000724

Engelhardt, L., & Goldhammer, F. (2019). Validating test score interpretations using time information. Frontiers in Psychology, 10, 1131. https://doi.org/10.3389/fpsyg.2019.01131

Falcão, F., Pereira, D. M., Gonçalves, N., De Champlain, A., Costa, P., & Pêgo, J. M. (2023). A suggestive approach for assessing item quality, usability and validity of Automatic Item Generation. Advances in Health Sciences Education, 28(5), 1441-1465.

Fernández Alonso, R., & Muñiz Fernández, J. (2011). Diseño de cuadernillos para la evaluación de las competencias básicas. Aula abierta, 39(2), 3-34.

Ferrando, P. J., Lorenzo Seva, U., Hernández Dorado, A., & Muñiz, J. (2022). Decalogue for the factor analysis of test items. Psicothema, 34(1), 7-17. https://doi.org/10.7334/psicothema2021.456

Ferrer, G. (2006). Estándares en educación. Implicancias en América Latina. PREAL.

Flora, D. B. (2020). Your coefficient alpha is probably wrong, but which coefficient omega is right? A tutorial on using R to obtain better reliability estimates. Advances in Methods and Practices in Psychological Science, 3(4), 484-501.

García, P. E., Abad, F. J., Olea, J., & Aguado, D. (2013). A new IRT-based standard setting method: Application to eCat-Listening. Psicothema, 25(2), 238-244.

García, P. E., Olea, J., & De la Torre, J. (2014). Application of cognitive diagnosis models to competency-based situational judgment tests. Psicothema, 26(3), 372-377. https://doi.org/10.7334/psicothema2013.322

Gierl, M. J., & Haladyna, T. M. (2013). Automatic item generation: Theory and practice. Routledge.

Gierl, M. J., Lai, H., & Tanygin, V. (2021). Advanced methods in automatic item generation. Routledge.

Götz, F. M., Maertens, R., Loomba, S., & van der Linden, S. (2023). Let the algorithm speak: How to use neural networks for automatic item generation in psychological scale development. Psychological Methods, 29(3), 494–518. https://doi.org/10.1037/met0000540

Guo, H., Ríos, J. A., Haberman, S., Liu, O. L., Wang, J., & Paek, I. (2016). A new procedure for detection of students’ rapid guessing responses using response time. Applied Measurement in Education, 29(3), 173–183.

Haladyna, T. M., & Rodriguez, M. C. (2013). Developing and validating test items. Routledge.

Hambleton, R. K., & Zenisky, A. L. (2011). Translating and adapting tests for cross-cultural assessments. En D. Matsumoto & F. J. R. van de Vijver (Eds.), Cross-cultural research methods in psychology (pp. 46-70). Cambridge University Press.

Heyneman, S., & Lee, B. (2014). The impact of international studies of academic achievement on policy and research. En L. Rutkowski, M. von Davier & D. Rutkowski (Eds.), Handbook of international large-scale assessment. Background, Technical Issues and Methods of Data Analysis (pp. 37-72). CRC Press.

Hidalgo-Montesinos, M. D., & French, B. F. (2016). Una introducción didáctica a la Teoría de Respuesta al Ítem para comprender la construcción de escalas. Revista de Psicología Clínica con Niños y Adolescentes, 3(2), 13-21.

Instituto Colombiano para la Evaluación de la Educación (2011). Informe técnico de las pruebas Saber 5.to y 9.no 2009. ICFES.

Instituto Nacional de Estudos e Pesquisas Educacionais Anísio Teixeira (2024). Saeb 2023: detalhamento da população e resultados: nota técnica n.º 18/2023/CGMEB/DAEB. INEP.

Instituto Nacional de Evaluación Educativa (2017). Aristas. Marco de lectura en tercero y sexto de primaria. INEEd.

Instituto Nacional de Evaluación Educativa (2018). Aristas. Marco general de la evaluación. INEEd.

Instituto Nacional de Evaluación Educativa (2020). Aristas 2018. Informe de resultados de tercero de educación media. INEEd.

Instituto Nacional de Evaluación Educativa (2021). Aristas 2020. Primer informe de resultados de tercero y sexto de educación primaria. INEEd.

Instituto Nacional para la Evaluación de la Educación (2004). El Aprendizaje del español y las Matemáticas en la educación básica en México. Sexto de primaria y tercero de secundaria. INEE.

Instituto Nacional para la Evaluación de la Educación (2019). Manual técnico del Plan Nacional para la Evaluación de los Aprendizajes PLANEA 2015. Educación media superior. INEE.

Jackson Stenner, A., Smith III, M., & Burdick, D.S. (2022). Toward a Theory of Construct Definition. En W. P. Fisher & P. J. Massengill (Eds.), Explanatory Models, Unit Standards, and Personalized Learning in Educational Measurement (pp. 43-55). Springer.

Joint Committee on Standards for Educational Evaluation (JCSEE) (2010). The Program Evaluation Standards. Sage.

Jornet Meliá, J. M. (2017). Evaluación estandarizada. Revista Iberoamericana de Evaluación Educativa (RIEE), 10(1), 5-8.

Jornet Meliá, J. M., & González-Such, J. (2009). Evaluación criterial: determinación de estándares de interpretación (EE) para pruebas de rendimiento educativo. Estudios sobre educación, 16, 103-123.

Lee, S., & Winke, P. (2018). Young learners' response processes when taking computerized tasks for speaking assessment. Language Testing, 35(2), 239-269. https://doi.org/10.1177/0265532217704009

Linacre, J. M. (2012). Winsteps Rasch Measurement Computer Program User's Guide. Winsteps.

Linn, R. (2003). Performance Standards: Utility for Different Uses of Assessments. Education Policy Analysis Archives, 11(31).

Loken, E., & Gelman, A. (2017). Measurement error and the replication crisis. Science, 355(6325), 584-585.

Luzardo, M. (2019). Item Selection Algorithms in Computerized Adaptive Test Comparison Using Items Modeled with Nonparametric Isotonic Model. En M. Wiberg, S. Culpepper, R. Janssen, J. González & D. Molenaar (Eds.), Quantitative Psychology (pp. 95-105). Springer International Publishing. https://doi.org/10.1007/978-3-030-01310-3_6

Luzardo, M., & Rodríguez, P. (2015). A nonparametric estimator of a monotone item characteristic curve. En L. A. van der Ark, D. Bolt, W. C. Wang, A. Douglas & S. M. Chow (Eds.), Quantitative Psychology (pp. 99–108), Springer.

Mahias Finger, P., & Polloni Erazo, M. P. (2019). Cuadernillo técnico de evaluación educativa Desarrollo de instrumentos de evaluación: pruebas. Centro de Medición MIDE UC; INEE.

Marsman, M. (2014). Plausible values in statistical inference [Tesis doctoral, University of Twente].

Martin, M. O., Mullis, I. V. S., & Foy, P. (2015). Assessment Design for PIRLS, PIRLS Literacy, and ePIRLS in 2016. En I. V. S. Mullis & M. O. Martin (Eds.), PIRLS 2016 Assessment Framework. TIMSS & PIRLS International Study Center.

Masters, G. N. (2016). Partial Credit Models. En W. J. van der Linden (Ed.), Handbook of modern item response theory. CRC Press.

McDonald, R. P. (1999). Test Theory: A Unified Treatment. Erlbaum.

Ministerio de Educación del Perú (2024). Reporte técnico de la Evaluación Nacional de Logros de Aprendizajes de Estudiantes 2023 (ENLA). MINEDU.

Mitzel, H. C., Lewis, D. M., Patz, R. J., & Green, D. R. (2001). The bookmark procedure: Psychological perspectives. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods, and perspectives (pp. 249-281). Lawrence Erlbaum.

Muñiz, J. (2018). Introducción a la Psicometría: Teoría Clásica y TRI. Pirámide.

Muñiz, J., & Fonseca-Pedrero, E. (2019). Diez pasos para la construcción de un test. Psicothema, 31(1), 7–16.

Muñiz, J., Elosua, P., & Hambleton, R. K. (2013). Directrices para la traducción y adaptación de los test: segunda edición. Psicothema, 25(2), 151-157. https://doi.org/10.7334/psicothema2013.24

National Assessment Governing Board (2022). Mathematics Assessment Framework for the 2022 to 2024 National Assessment of Educational Progress. NAGB.

National Assessment of Educational Progress (2023). Technical Documentation: Student Test Form and Booklet Block Design. NAEP.

North, B., & Jones, N. (2009). Relating language examinations to the Common European Framework of Reference for Languages: Learning, teaching, assessment (CEFR): Further material on maintaining standards across languages, contexts and administrations by exploiting teacher judgment and IRT scaling. Council of Europe.

Olea, J., & Ponsoda, V. (2013). Test adaptativos informatizados. Editorial UNED.

OREALC-UNESCO (2016). Reporte Técnico. Tercer Estudio Regional Comparativo y Explicativo (TERCE).

Pérez Juste, R. (2006). Evaluación de programas educativos. La Muralla.

Popham, W. J. (1999). Where Large Scale Educational Assessment Is Heading and Why It Shouldn't. Educational Measurement: Issues and Practice, 18(3), 13-17. https://doi.org/10.1111/j.1745-3992.1999.tb00268.x

Raykov, T. (2007). Reliability if deleted, not 'alpha if deleted': Evaluation of scale reliability following component deletion. British Journal of Mathematical and Statistical Psychology, 60(2), 201-216. https://doi.org/10.1348/000711006X115954

Reckase, M. D. (2009). Multidimensional Item Response Theory. Springer.

Reckase, M. D. (2016). Multidimensional logistic models. En W. J. van der Linden (Ed.), Handbook of Item Response Theory: Models (pp. 189-210). CRC Press.

Reynolds, K. A., & Moncaleano, S. (2021). Digital module 26: Content alignment in standards-based educational assessment. Educational Measurement: Issues & Practice, 40(3), 127-128. https://doi.org/10.1111/emip.12405

Ríos, J. A., & Guo, H. (2020). Can culture be a salient predictor of test-taking engagement? An analysis of differential noneffortful responding on an international college-level assessment of critical thinking. Applied Measurement in Education, 33(4), 263-279.

Rodríguez Morales, P. (2017). Creación, Desarrollo y Resultados de la Aplicación de Pruebas de Evaluación basadas en Estándares para Diagnosticar Competencias en Matemática y Lectura al Ingreso a la Universidad. Revista Iberoamericana de Evaluación Educativa, 10(1), 89-107. https://doi.org/10.15366/riee2017.10.1.005

Rodríguez Morales, P., & Luzardo Verde, M. (2020). Cómo asegurar evaluaciones válidas y detectar falseamiento en pruebas a distancia síncronas. Revista Digital de Investigación en Docencia Universitaria, 14(2), e1240.

Rodríguez, P., & Luzardo, M. (2019). A Modification of the IRT-Based Standard Setting Method. En M. Wiberg, S. Culpepper, R. Janssen, J. González & D. Molenaar (Eds.), Quantitative Psychology (pp. 65-74), Springer Nature. https://doi.org/10.1007/978-3-030-01310-3_6

Rodríguez, P., Pérez, G., & Luzardo, M. (2017). Desarrollo y aplicación del primer test adaptativo informatizado (TAI) de Matemática para orientar trayectorias en la Universidad. En N. Peré (Comp.), La Universidad Se Investiga (pp. 1041-1048). CSE-ANEP.

Russell, M. (2011). Personalizing assessment. En T. Gray & H. Silver-Pacuilla (Eds), Breakthrough teaching and learning (pp. 111–126). Springer.

Rutkowski, D., Rutkowski, L., & von Davier, M. (2014). A brief introduction to modern international large scale assessment. En L. Rutkowski, M. von Davier & D. Rutkowski (Eds.), Handbook of international large-scale assessment. Background, Technical Issues and Methods of Data Analysis (pp. 3-10). CRC Press.

Samejima, F. (2016). Graded Response Models. En W. J. van der Linden (Ed.), Handbook of modern item response theory. CRC Press.

Sanz, S., Luzardo, M., García, C., & Abad, F. J. (2020). Detecting cheating methods on unproctored Internet tests. Psicothema, 32(4), 549-558. https://dx.doi.org/10.7334/psicothema2020.86

Sijstma, K., & Molenaar, I. W. (2016). Mokken models. En W. J. van der Linden (Ed.), Handbook of modern item response theory. CRC Press.

Sireci, S., & Benítez, I. (2023). Evidence for test validation: a guide for practitioners. Psicothema, 35(3), 217-226. https://dx.doi.org/10.7334/psicothema2022.477

Sistema de Medición de la Calidad de la Educación (SIMCE) (2010) Resultados Nacionales SIMCE 2009. Agencia de Calidad de la Educación.

Soca, J. M. (2018). Tendencias de Investigación e Innovación en Evaluación Educativa. CONACyT – INEE.

Soland, J. (2018). Are achievement gap estimates biased by differential student test effort? Putting an important policy metric to the test. Teachers College Record, 120(12).

Soland, J., & Kuhfeld, M. (2019). Do students rapidly guess repeatedly over time? A longitudinal analysis of student test disengagement, background, and attitudes. Educational Assessment, 24(4), 327–342.

Swaminathan, H., & Rogers, H. J. (2016). Normal-ogive multidimensional models. En W. J. van der Linden (Ed.), Handbook of Item Response Theory: Models (pp. 167–188). CRC Press.

Teig, N., & Steinmann, I. (2023). Leveraging large-scale assessments for effective and equitable school practices: the case of the nordic countries. Large-scale Assessments in Education, 11, 11-21. https://doi.org/10.1186/s40536-023-00172-w

Thissen, D., & Cai, L. (2016). Nominal Categories Models. En W. J. van der Linden (Ed.), Handbook of modern item response theory. CRC Press.

Tourón, J. (2009). El establecimiento de estándares de rendimiento en los sistemas educativos. Estudios sobre Educación, 16, 127–146.

Van der Linden, W. J. (2016). Unidimensional Logistic Response Models. En W. J. van der Linden (Ed.), Handbook of modern item response theory (pp. 19-30). CRC Press.

Van der Linden, W. J. (2018). Handbook of item response theory. CRC Press.

Van der Linden, W. J., & Glas, C. A. (2000). Computerized adaptive testing: Theory and practice. Kluwer Academic.

Vladisauskas, M., & Goldin, A. P. (2020). 20 años de entrenamiento cognitivo: una perspectiva amplia. Journal of Neuroeducation, 1(1), 130-135.

Von Davier, M. (2016). Rasch Models. En W. J. Van der Linden (Ed.), Handbook of item response theory (pp. 31-45). CRC Press.

Wagemaker, H. (2014). International large-scale assessments: from research to policy. En L. Rutkowski, M. von Davier & D. Rutkowski (Eds.), Handbook of international large-scale assessment. Background, Technical Issues and Methods of Data Analysis (pp. 11-36). CRC Press.

Wise, S. L., & Ma, L. (2012). Setting response time thresholds for a CAT item pool: The normative threshold method. Annual Meeting of the National Council on Measurement in Education, Vancouver, Canada.

Wry, E., & Mullis, I. V. S. (2023). Developing the PIRLS 2021 achievement instruments. En M. von Davier, I. V. S. Mullis, B. Fishbein & P. Foy (Eds.), Methods and Procedures: PIRLS 2021 Technical Report (pp. 1-24). Boston College; TIMSS; PIRLS International Study Center. https://doi.org/10.6017/lse.tpisc.tr2101.kb7549

Xu, X., & Douglas, J. (2006). Computerized adaptive testing under nonparametric IRT models. Psychometrika, 71, 121-137.

Published

2025-07-10

How to Cite

Rodríguez, P., Soca, J., Castillo, M., & Luzardo, M. (2025). Standardized educational assessments: from framework to outcomes. Cuadernos De Investigación Educativa, 16(2). https://doi.org/10.18861/cied.2025.16.2.4015

Issue

Section

Articles