Revisiting the Institutional English Proficiency Test (EPT): Evaluating Its Validity, Reliability, and CEFR Alignment
DOI:
https://doi.org/10.24903/sj.v11i1.2331Keywords:
English Proficiency Test, Item-objective congruence, reliability, validityAbstract
Background:
There is an urgency for institutions to develop standardized English proficiency tests for individual use, given the unaffordable and unreachable high-stakes tests for some academic communities. Therefore, the Center for Language Development of IAIN Madura developed EPT (English Proficiency Test) as an individual English test. Nevertheless, no research has provided evidence that EPT has been standardized.
Methodology:
This quantitative consequential research invited four subject matter experts (SMEs) to judge the objectives of the test items to validate the content validity (CV) with the item-objective congruence (IOC) formula and assess three ways of language proficiency assessment to see how the EPT aligns with the Common European Framework of Reference for Languages (CEFR).
Findings:
The results revealed that the CV and reliability of EPT had been achieved, with the IOC for the three skills measured above 0.75. Internal consistency and stability were also proven to have high reliability coefficients. However, no evidence indicates alignment between EPT and CEFR, as the administrator was found not to follow the three validation frameworks: what is assessed (specification), how performance is interpreted (standardization), and how comparison is made (standard setting).
Conclusion:
The results conclude that EPT should be improved by aligning it with CEFR. Administrators should conduct a standard-setting study to map EPT scores onto the CEFR and provide the minimum scores (cut scores) needed to enter each targeted CEFR level.
Originality:
This research fills the knowledge gap by evaluating the EPT’s alignment with CEFR and addressing the need for a standardized and affordable language test in Indonesia.
References
Abma, T. A. (2005). Responsive evaluation: Its meaning and special contribution to health promotion. Evaluation and Program Planning, 28(3), 279–289. https://doi.org/10.1016/j.evalprogplan.2005.04.003
Alderson, J. C., Figueras, N., Kuijper, H., Nold, G., Takala, S., & Tardieu, C. (2006). Analysing Tests of Reading and Listening in Relation to the Common European Framework of Reference: The Experience of The Dutch CEFR Construct Project. Language Assessment Quarterly, 3(1), 3–30. https://doi.org/10.1207/s15434311laq0301_2
Almanasreh, E., Moles, R., & Chen, T. F. (2019). Evaluation of methods used for estimating content validity. Research in Social and Administrative Pharmacy, 15(2), 214–221. https://doi.org/10.1016/j.sapharm.2018.03.066
Almohanna, A. A. S., Win, K. T., Meedya, S., & Vlahu-Gjorgievska, E. (2022). Design and content validation of an instrument measuring user perception of the persuasive design principles in a breastfeeding mHealth app: A modified Delphi study. International Journal of Medical Informatics, 164, 104789. https://doi.org/10.1016/j.ijmedinf.2022.104789
Blanchard, J. J., & Brown, S. B. (1998). 4.05—Structured Diagnostic Interview Schedules. In A. S. Bellack & M. Hersen (Eds.), Comprehensive Clinical Psychology (pp. 97–130). Pergamon. https://doi.org/10.1016/B0080-4270(73)00003-1
BPS. (2022). Statistik Pendapatan. Badan Pusat Statistik.
British Council. (2022). Test Dates, Fees, and Locations | British Council Foundation Indonesia. https://www.britishcouncilfoundation.id/en/exam/ielts/dates-fees-locations
Bronkhorst, L. H., Meijer, P. C., Koster, B., Akkerman, S. F., & Vermunt, J. D. (2013). Consequential research designs in research on teacher education. Teaching and Teacher Education, 33, 90–99. https://doi.org/Teaching and Teacher Education
Chapelle, C. A., Jamieson, J., & Hegelheimer, V. (2003). Validation of a web-based ESL test. Language Testing, 20(4), 409–439. https://doi.org/10.1191/0265532203lt266oa
Chapelle, C. A., & Voss, E. (2016). 20 Years of Technology and Language Assessment in Language Learning & Technology. Language Learning & Technology, 20(2), 116–128. https://www.lltjournal.org/item/793/
Chen, X., Meurers, D., & Rebuschat, P. (2022). ICALL offering individually adaptive input: Effects of complex input on L2 development. Language Learning & Technology, 26(1), 1–21. https://hdl.handle.net/10125/73496
Council of Europe. (2011). Manual for Language Test Development and Examining. Council of Europe.
Dang, C. N., & Dang, T. N. Y. (2021). The Predictive Validity of the IELTS Test and Contribution of IELTS Preparation Courses to International Students’ Subsequent Academic Study: Insights from Vietnamese International Students in the UK. RELC Journal, 0033688220985533. https://doi.org/10.1177/0033688220985533
De Jong, J. H. A. L., Becker, K., Bolt, D., & Goodman, J. (2014). Aligning PTE Academic Test Scores to the Common European Framework of Reference for Languages. Pearson. https://pearsonpte.com/wp-content/ uploads/2014/07/Aligning_PTEA_Scores_CEF.pdf
DeVellis, R. F., & Thorpe, C. T. (2021). Scale development: Theory and applications. Sage publications.
Dimova, S. (2017). Life after oral English certification: The consequences of the Test of Oral English Proficiency for Academic Staff for EMI lecturers. English for Specific Purposes, 46, 45–58. https://doi.org/10.1016/j.esp.2016.12.004
du Plessis, C., & Els, C. (2019). Informative assessment: A supportive tool for systemic validity in language education. South African Journal of Higher Education. https://doi.org/10.20853/33-6-3027
Ebel, R. L., & Frisbie, D. A. (1991). Essentials of Educational Measurement (5th ed.). Prentice Hall of India.
ETS Global. (2022). 4 reasons why learning English is essential. https://www.etsglobal.org/pl/en/blog/news/importance-of-learning-english
Fleckenstein, J., Keller, S., Krüger, M., Tannenbaum, R. J., & Köller, O. (2020). Linking TOEFL iBT® writing rubrics to CEFR levels: Cut scores and validity evidence from a standard setting study. Assessing Writing, 43, 100420. https://doi.org/10.1016/j.asw.2019.100420
Global Language Center ITS. (2025). TEFL ITS. https://bahasa.its.ac.id/
Gregori-Giralt, E., & Menéndez-Varela, J.-L. (2021). The content aspect of validity in a rubric-based assessment system for course syllabuses. Studies in Educational Evaluation, 68, 100971. https://doi.org/10.1016/j.stueduc.2020.100971
Halek, M., Holle, D., & Bartholomeyczik, S. (2017). Development and evaluation of the content validity, practicability and feasibility of the Innovative dementia-oriented Assessment system for challenging behaviour in residents with dementia. BMC Health Services Research, 17(1), 554. https://doi.org/10.1186/s12913-017-2469-8
Harsch, C., & Martin, G. (2012). Adapting CEF-descriptors for rating purposes: Validation by a combined rater training and scale revision approach. Assessing Writing, 17(4), 228–250. https://doi.org/10.1016/j.asw.2012.06.003
Harsch, C., & Seyferth, S. (2020). Marrying achievement with proficiency – Developing and validating a local CEFR-based writing checklist. Assessing Writing, 43, 100433. https://doi.org/10.1016/j.asw.2019.100433
Hille, K., & Cho, Y. (2020). Placement testing: One test, two tests, three tests? How many tests are sufficient? Language Testing, 37(3), 453–471. https://doi.org/10.1177/0265532220912412
Hossain, Md. M. (2016). English Language Teaching in Rural Areas: A Scenario and Problems and Prospects in Context of Bangladesh. Advances in Language and Literary Studies, 7(3). https://doi.org/10.7575/aiac.alls.v.7n.3p.1
Hulstijn, J., Schoonen, R., Jong, N. H. de, Steinel, M. P., & Florijn, A. (2012). Linguistic competences of learners of Dutch as a second language at the B1 and B2 levels of speaking proficiency of the Common European Framework of Reference for Languages (CEFR) 1. https://doi.org/10.1177/0265532211419826
IIEF. (2022). TOEFL ITP Test Schedule – IIEF. https://www.iief.or.id/toefl-itp-test-schedule
International Test Center. (2022). Jadwal Tes – International Test Center. https://itc-indonesia.com/jadwal-tes/
Isbell, D. R., & Kremmel, B. (2020). Test Review: Current options in at-home language proficiency tests for making high-stakes decisions. Language Testing, 37(4), 600–619. https://doi.org/10.1177/0265532220943483
Ismail, F. K. M., & Zubairi, A. M. B. (2021). Item Objective Congruence Analysis for Multidimensional Items Content Validation of a Reading Test in Sri Lankan University. English Language Teaching, 15(1), 106. https://doi.org/10.5539/elt.v15n1p106
Jin, Y. (2022). Consequential research of accountability testing: The case of the CET. Language Testing in Asia, 12(1), 15. https://doi.org/10.1186/s40468-022-00165-6
Jung, Y. J., Crossley, S., & McNamara, D. (2019). Predicting Second Language Writing Proficiency in Learner Texts Using Computational Tools. The Journal of Asia TEFL, 16(1), 37–52. https://doi.org/10.18823/asiatefl.2019.16.1.3.37
Kelley, T. L. (1939). The selection of upper and lower groups for the validation of test items. Journal of Educational Psychology, 30(1), 17–24. https://doi.org/10.1037/h0057123
Kennedy, I. (2021). Sample Size Determination in Test-Retest and Cronbach Alpha Reliability Estimates. Middle East Research Journal of Humanities and Social Sciences, 1(1), 16–24. https://doi.org/10.36348/merjhss.2021.v01i01.003
Kennedy, I. (2022). Sample Size Determination in Test-Retest and Cronbach Alpha Reliability Estimates. British Journal of Contemporary Education, 2(1), 17–29. https://doi.org/10.52589/BJCE-FY266HK9
Kumar, V. (2023). Using item analysis to evaluate hand hygiene self-assessments at Alberta health services. American Journal of Infection Control, 51(6), 683–686. https://doi.org/10.1016/j.ajic.2022.08.030
Li, X., & Wang, N. (2004). On the Objectivity and Fairness of Objective Language Testing. Dongbei Daxue Xuebao (Shehui Kexue Ban)/Journal of Northeastern University (Social Science), 6, 385–387.
Lu, X., & Hu, R. (2022). Sense-aware lexical sophistication indices and their relationship to second language writing quality. Behavior Research Methods, 54(3), 1444–1460. https://doi.org/10.3758/s13428-021-01675-6
Madya, S., Retnawati, H., Purnawan, A., Putro, N. H. P. S., & Apino, E. (2019). The Equivalence of TOEP Forms. TEFLIN Journal - A Publication on the Teaching and Learning of English, 30(1), 88–104. https://doi.org/10.15639/teflinjournal.v30i1/88-104
Mahesar, I. K., & Jokhio, A. A. (2021). Investigating the Impact of Resilience on Learners’ Motivated Behavior of L2 and Proficiency in English of University Students at Karachi, Pakistan. Ethical Lingua: Journal of Language Teaching and Literature, 8(2), Article 2. https://doi.org/10.30605/25409190.290
Michigan Assessment. (2022, October 17). Explore Recognizing Organizations—Michigan Language Assessment. https://michiganassessment.org/i-am-a/recognition/recognizing-organization/
Mohseni, A. (2021). The Impact of Genre-Based Instruction on Iranian Intermediate EFL Learners’ Writing Skills. Vision: Journal for Language and Foreign Language Learning, 10(2), 115–132. https://doi.org/10.21580/vjv11i110596
Natova, I. (2021). Estimating CEFR reading comprehension text complexity. The Language Learning Journal, 49(6), 699–710. https://doi.org/10.1080/09571736.2019.1665088
Nordström, T., Andersson, U. B., Fälth, L., & Gustafson, S. (2019). Teacher inquiry of using assessments and recommendations in teaching early reading. Studies in Educational Evaluation, 63, 9–16. https://doi.org/10.1016/j.stueduc.2019.06.006
North, B. (2014). Putting the Common European Framework of Reference to good use. Language Teaching, 47(2), 228–249. https://doi.org/10.1017/S0261444811000206
Park, T. (2012). Examining the Validity of an Essay Writing Test Using Rasch Analysis. KISS: Korean Information Service System, 5(2), 70–96.
Pearson, W. S. (2020). Mapping English language proficiency cutoff scores and pre-sessional EAP programmes in UK higher education. Journal of English for Academic Purposes, 45, 100866. https://doi.org/10.1016/j.jeap.2020.100866
Pearson, W. S. (2021). The Predictive Validity of the Academic IELTS Test: A Methodological Synthesis. ITL - International Journal of Applied Linguistics, 172(1), 85–120. https://doi.org/10.1075/itl.19021.pea
Pearson, W. S. (2023). Test review: High-stakes English language proficiency tests—Enquiry, resit, and retake policies. Language Testing, 40(4), 1022–1035. https://doi.org/10.1177/02655322231186706
Penuel, W., Roschelle, J., & Shechtman, N. (2007). Designing Formative Assessment Software with Teachers: An Analysis of the Co-Design Process. Research and Practice in Technology Enhanced Learning, 2, 51–74. https://doi.org/10.1142/S1793206807000300
Pusat Bahasa UIN Maulana Malik Ibrahim Malang. (2024, December 30). Beranda. Pusat Pengembangan Bahasa. https://ppb.uin-malang.ac.id/
Pusat Bahasa UNAIR. (2025). UNAIR’S ELPT. Pusat Bahasa dan Multibudaya. https://pusatbahasa.unair.ac.id/unairs-elpt/
Pusat Pengembangan Bahasa UIN Sunan Ampel Surabaya. (2023, February 16). Layanan P2B - UINSA. https://uinsa.ac.id/p2b/layanan-p2b
Reynolds, C. R., Altmann, R. A., & Allen, D. N. (2021). Reliability. In C. R. Reynolds, R. A. Altmann, & D. N. Allen (Eds.), Mastering Modern Psychological Testing: Theory and Methods (pp. 133–184). Springer International Publishing. https://doi.org/10.1007/978-3-030-59455-8_4
Rovinelli, R. J., & Hambleton, R. K. (1977). On the use of content specialists in the assessment of criterion-referenced test item validity. Tijdschrift Voor Onderwijsresearch, 2(2), 49–60.
Roy, R., Sukumar, G. M., Philip, M., & Gopalakrishna, G. (2023). Face, content, criterion and construct validity assessment of a newly developed tool to assess and classify work–related stress (TAWS– 16). PLOS ONE, 18(1), e0280189. https://doi.org/10.1371/journal.pone.0280189
Rumsey, M., Thiessen, J., Buchan, J., & Daly, J. (2016). The consequences of English language testing for international health professionals and students: An Australian case study. International Journal of Nursing Studies, 54, 95–103. https://doi.org/10.1016/j.ijnurstu.2015.06.001
Saadatara, A., Kiany, G., & Talebzadeh, H. (2023). Bundles to beat the band in high-stakes tests: Pedagogical applications of an exploratory investigation of lexical bundles across band scores of the IELTS writing component. Journal of English for Academic Purposes, 61, 101208. https://doi.org/10.1016/j.jeap.2022.101208
Sari, N. A., & Mualimin, M. (2021). The Influence of the Pandemic on the Motivation of EAP Learners in Studying IELTS. E3S Web of Conferences, 317, 02031. https://doi.org/10.1051/e3sconf/202131702031
Schildkamp, K., van der Kleij, F. M., Heitink, M. C., Kippers, W. B., & Veldkamp, B. P. (2020). Formative assessment: A systematic review of critical teacher prerequisites for classroom practice. International Journal of Educational Research, 103, 101602. https://doi.org/10.1016/j.ijer.2020.101602
Shirali, G., Shekari, M., & Angali, K. A. (2018). Assessing Reliability and Validity of an Instrument for Measuring Resilience Safety Culture in Sociotechnical Systems. Safety and Health at Work, 9(3), 296–307. https://doi.org/10.1016/j.shaw.2017.07.010
Shou, Y., Sellbom, M., & Chen, H.-F. (2022). 4.02—Fundamentals of Measurement in Clinical Psychology. In G. J. G. Asmundson (Ed.), Comprehensive Clinical Psychology (Second Edition) (pp. 13–35). Elsevier. https://doi.org/10.1016/B978-0-12-818697-8.00110-2
Solovjeva, S. V., & Baksheev, D. P. (2021). Preparation for international language exams as a means of developing intercultural competence. Proceedings of FIR Conferences - International Relations: History, Theory, Practice, 425–430. https://elib.bsu.by/handle/123456789/268895
Sridhanyarat, K., Pathong, S., Suranakkharin, T., & Ammaralikit, A. (2021). The Development of STEP, the CEFR-Based English Proficiency Test. English Language Teaching, 14(7), 95. https://doi.org/10.5539/elt.v14n7p95
Swiecki, Z., Khosravi, H., Chen, G., Martinez-Maldonado, R., Lodge, J. M., Milligan, S., Selwyn, N., & Gašević, D. (2022). Assessment in the age of artificial intelligence. Computers and Education: Artificial Intelligence, 3, 100075. https://doi.org/10.1016/j.caeai.2022.100075
Taghizadeh, M., & Mazdayasna, G. (2023). Language assessment course at Iranian state universities: An evaluation of the incorporation of assessment principles into the course content. Heliyon, 9(1), e12857. https://doi.org/10.1016/j.heliyon.2023.e12857
Tannenbaum, R. J., & Baron, P. A. (2011). Mapping TOEFL® ITP Scores Onto the Common European Framework of Reference. English Testing Services.
Tannenbaum, R. J., & Wylie, E. C. (2008). Linking English-Language Test Scores Onto the Common European Framework of Reference: An Application of Standard-Setting Methodology. ETS Research Report Series, 2008(1), i–75. https://doi.org/10.1002/j.2333-8504.2008.tb02120.x
Teistler, N. (2021). Development of an Instrument to Assess Pre-Service Teachers’ Attitudes on Person-Centered Behavior Toward Students (APBS): Gathering Validity Evidence Based on Test Content. International Journal of Educational Research, 110, 101878. https://doi.org/10.1016/j.ijer.2021.101878
The CEFR Levels. (2022). Common European Framework of Reference for Languages (CEFR). https://www.coe.int/en/web/common-european-framework-reference-languages/level-descriptions
Turner, R. C., & Carlson, L. (2003). Indexes of Item-Objective Congruence for Multidimensional Items. International Journal of Testing, 3(2), 163–171. https://doi.org/10.1207/S15327574IJT0302_5
Umirov, S. (2024). IELTS or TOEFL Ibt? Investigating Factors influencing language test decisions. Oriental Renaissance: Innovative, Educational, Natural, and Social Science, 4(1), 427–432.
UNESCO Institute for Lifelong Learning. (2021). GAL Country Profiles as of February 2021. United Nations Educational, Scientific and Cultural Organization.
Ursachi, G., Horodnic, I. A., & Zait, A. (2015). How Reliable are Measurement Scales? External Factors with Indirect Influence on Reliability Estimators. Procedia Economics and Finance, 20, 679–686. https://doi.org/10.1016/S2212-5671(15)00123-9
Vincent, W., & Shanmugam, S. K. S. (2020). The Role of Classical Test Theory to Determine the Quality of Classroom Teaching Test Items: Pedagogia : Jurnal Pendidikan, 9(1), Article 1. https://doi.org/10.21070/pedagogia.v9i1.123
Walski, T. (2014). Consequential Research. Journal of Water Resources Planning and Management, 140, 559–561. https://doi.org/10.1061/(ASCE)WR.1943-5452.0000430
Weir, C. J. (2005). Limitations of the Common European Framework for developing comparable examinations and tests. Language Testing, 22(3), 281–300. https://doi.org/10.1191/0265532205lt309oa
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Moh. Syafik, Eva Nikmatul Rabbianty, Yazid Basthomi, Ling Gan, Nurul Hadi

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution 4.0 International License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.


