Revisiting the Institutional English Proficiency Test (EPT): Evaluating Its Validity, Reliability, and CEFR Alignment

Moh. Syafik; Eva Nikmatul Rabbianty; Yazid Basthomi; Ling Gan; Nurul Hadi

doi:10.24903/sj.v11i1.2331

Authors

Moh. Syafik Universitas Islam Negeri Madura https://orcid.org/0009-0007-2183-9277
Eva Nikmatul Rabbianty Department of English Education, Faculty of Letters, The State University of Malang, Malang 65145 & Universitas Islam Negeri Madura, Pamekasan 69371 https://orcid.org/0000-0001-8361-040X
Yazid Basthomi Department of English Education, Faculty of Letters, The State University of Malang, Malang 65145 https://orcid.org/0000-0003-3314-3334
Ling Gan School of Languages and Communications, Beijing Technology and Business University
Nurul Hadi Universitas Islam Negeri Madura https://orcid.org/0000-0003-4485-3171

DOI:

https://doi.org/10.24903/sj.v11i1.2331

Keywords:

English Proficiency Test, Item-objective congruence, reliability, validity

Abstract

Background:
There is an urgency for institutions to develop standardized English proficiency tests for individual use, given the unaffordable and unreachable high-stakes tests for some academic communities. Therefore, the Center for Language Development of IAIN Madura developed EPT (English Proficiency Test) as an individual English test. Nevertheless, no research has provided evidence that EPT has been standardized.
Methodology:
This quantitative consequential research invited four subject matter experts (SMEs) to judge the objectives of the test items to validate the content validity (CV) with the item-objective congruence (IOC) formula and assess three ways of language proficiency assessment to see how the EPT aligns with the Common European Framework of Reference for Languages (CEFR).
Findings:
The results revealed that the CV and reliability of EPT had been achieved, with the IOC for the three skills measured above 0.75. Internal consistency and stability were also proven to have high reliability coefficients. However, no evidence indicates alignment between EPT and CEFR, as the administrator was found not to follow the three validation frameworks: what is assessed (specification), how performance is interpreted (standardization), and how comparison is made (standard setting).
Conclusion:
The results conclude that EPT should be improved by aligning it with CEFR. Administrators should conduct a standard-setting study to map EPT scores onto the CEFR and provide the minimum scores (cut scores) needed to enter each targeted CEFR level.
Originality:
This research fills the knowledge gap by evaluating the EPT’s alignment with CEFR and addressing the need for a standardized and affordable language test in Indonesia.

Author Biography

Ling Gan, School of Languages and Communications, Beijing Technology and Business University

Ling Gan, Ph.D, is a lecturer at the School of Languages and Communications at Beijing Technology and Business University in China. Her research interests include teacher assessment literacy and teacher professional development. She has published research in Language Assessment Quarterly, Language Testing in Asia, and RELC Journal.

References

Abma, T. A. (2005). Responsive evaluation: Its meaning and special contribution to health promotion. Evaluation and Program Planning, 28(3), 279–289. https://doi.org/10.1016/j.evalprogplan.2005.04.003

Alderson, J. C., Figueras, N., Kuijper, H., Nold, G., Takala, S., & Tardieu, C. (2006). Analysing Tests of Reading and Listening in Relation to the Common European Framework of Reference: The Experience of The Dutch CEFR Construct Project. Language Assessment Quarterly, 3(1), 3–30. https://doi.org/10.1207/s15434311laq0301_2

Almanasreh, E., Moles, R., & Chen, T. F. (2019). Evaluation of methods used for estimating content validity. Research in Social and Administrative Pharmacy, 15(2), 214–221. https://doi.org/10.1016/j.sapharm.2018.03.066

Almohanna, A. A. S., Win, K. T., Meedya, S., & Vlahu-Gjorgievska, E. (2022). Design and content validation of an instrument measuring user perception of the persuasive design principles in a breastfeeding mHealth app: A modified Delphi study. International Journal of Medical Informatics, 164, 104789. https://doi.org/10.1016/j.ijmedinf.2022.104789

Blanchard, J. J., & Brown, S. B. (1998). 4.05—Structured Diagnostic Interview Schedules. In A. S. Bellack & M. Hersen (Eds.), Comprehensive Clinical Psychology (pp. 97–130). Pergamon. https://doi.org/10.1016/B0080-4270(73)00003-1

BPS. (2022). Statistik Pendapatan. Badan Pusat Statistik.

British Council. (2022). Test Dates, Fees, and Locations | British Council Foundation Indonesia. https://www.britishcouncilfoundation.id/en/exam/ielts/dates-fees-locations

Bronkhorst, L. H., Meijer, P. C., Koster, B., Akkerman, S. F., & Vermunt, J. D. (2013). Consequential research designs in research on teacher education. Teaching and Teacher Education, 33, 90–99. https://doi.org/Teaching and Teacher Education

Chapelle, C. A., Jamieson, J., & Hegelheimer, V. (2003). Validation of a web-based ESL test. Language Testing, 20(4), 409–439. https://doi.org/10.1191/0265532203lt266oa

Chapelle, C. A., & Voss, E. (2016). 20 Years of Technology and Language Assessment in Language Learning & Technology. Language Learning & Technology, 20(2), 116–128. https://www.lltjournal.org/item/793/

Chen, X., Meurers, D., & Rebuschat, P. (2022). ICALL offering individually adaptive input: Effects of complex input on L2 development. Language Learning & Technology, 26(1), 1–21. https://hdl.handle.net/10125/73496

Council of Europe. (2011). Manual for Language Test Development and Examining. Council of Europe.

Dang, C. N., & Dang, T. N. Y. (2021). The Predictive Validity of the IELTS Test and Contribution of IELTS Preparation Courses to International Students’ Subsequent Academic Study: Insights from Vietnamese International Students in the UK. RELC Journal, 0033688220985533. https://doi.org/10.1177/0033688220985533

De Jong, J. H. A. L., Becker, K., Bolt, D., & Goodman, J. (2014). Aligning PTE Academic Test Scores to the Common European Framework of Reference for Languages. Pearson. https://pearsonpte.com/wp-content/ uploads/2014/07/Aligning_PTEA_Scores_CEF.pdf

DeVellis, R. F., & Thorpe, C. T. (2021). Scale development: Theory and applications. Sage publications.

Dimova, S. (2017). Life after oral English certification: The consequences of the Test of Oral English Proficiency for Academic Staff for EMI lecturers. English for Specific Purposes, 46, 45–58. https://doi.org/10.1016/j.esp.2016.12.004

du Plessis, C., & Els, C. (2019). Informative assessment: A supportive tool for systemic validity in language education. South African Journal of Higher Education. https://doi.org/10.20853/33-6-3027

Ebel, R. L., & Frisbie, D. A. (1991). Essentials of Educational Measurement (5th ed.). Prentice Hall of India.

ETS Global. (2022). 4 reasons why learning English is essential. https://www.etsglobal.org/pl/en/blog/news/importance-of-learning-english

Fleckenstein, J., Keller, S., Krüger, M., Tannenbaum, R. J., & Köller, O. (2020). Linking TOEFL iBT® writing rubrics to CEFR levels: Cut scores and validity evidence from a standard setting study. Assessing Writing, 43, 100420. https://doi.org/10.1016/j.asw.2019.100420

Global Language Center ITS. (2025). TEFL ITS. https://bahasa.its.ac.id/

Gregori-Giralt, E., & Menéndez-Varela, J.-L. (2021). The content aspect of validity in a rubric-based assessment system for course syllabuses. Studies in Educational Evaluation, 68, 100971. https://doi.org/10.1016/j.stueduc.2020.100971

Halek, M., Holle, D., & Bartholomeyczik, S. (2017). Development and evaluation of the content validity, practicability and feasibility of the Innovative dementia-oriented Assessment system for challenging behaviour in residents with dementia. BMC Health Services Research, 17(1), 554. https://doi.org/10.1186/s12913-017-2469-8

Harsch, C., & Martin, G. (2012). Adapting CEF-descriptors for rating purposes: Validation by a combined rater training and scale revision approach. Assessing Writing, 17(4), 228–250. https://doi.org/10.1016/j.asw.2012.06.003

Harsch, C., & Seyferth, S. (2020). Marrying achievement with proficiency – Developing and validating a local CEFR-based writing checklist. Assessing Writing, 43, 100433. https://doi.org/10.1016/j.asw.2019.100433

Hille, K., & Cho, Y. (2020). Placement testing: One test, two tests, three tests? How many tests are sufficient? Language Testing, 37(3), 453–471. https://doi.org/10.1177/0265532220912412

Hossain, Md. M. (2016). English Language Teaching in Rural Areas: A Scenario and Problems and Prospects in Context of Bangladesh. Advances in Language and Literary Studies, 7(3). https://doi.org/10.7575/aiac.alls.v.7n.3p.1

Hulstijn, J., Schoonen, R., Jong, N. H. de, Steinel, M. P., & Florijn, A. (2012). Linguistic competences of learners of Dutch as a second language at the B1 and B2 levels of speaking proficiency of the Common European Framework of Reference for Languages (CEFR) 1. https://doi.org/10.1177/0265532211419826

IIEF. (2022). TOEFL ITP Test Schedule – IIEF. https://www.iief.or.id/toefl-itp-test-schedule

International Test Center. (2022). Jadwal Tes – International Test Center. https://itc-indonesia.com/jadwal-tes/

Isbell, D. R., & Kremmel, B. (2020). Test Review: Current options in at-home language proficiency tests for making high-stakes decisions. Language Testing, 37(4), 600–619. https://doi.org/10.1177/0265532220943483

Ismail, F. K. M., & Zubairi, A. M. B. (2021). Item Objective Congruence Analysis for Multidimensional Items Content Validation of a Reading Test in Sri Lankan University. English Language Teaching, 15(1), 106. https://doi.org/10.5539/elt.v15n1p106

Jin, Y. (2022). Consequential research of accountability testing: The case of the CET. Language Testing in Asia, 12(1), 15. https://doi.org/10.1186/s40468-022-00165-6

Jung, Y. J., Crossley, S., & McNamara, D. (2019). Predicting Second Language Writing Proficiency in Learner Texts Using Computational Tools. The Journal of Asia TEFL, 16(1), 37–52. https://doi.org/10.18823/asiatefl.2019.16.1.3.37

Kelley, T. L. (1939). The selection of upper and lower groups for the validation of test items. Journal of Educational Psychology, 30(1), 17–24. https://doi.org/10.1037/h0057123

Kennedy, I. (2021). Sample Size Determination in Test-Retest and Cronbach Alpha Reliability Estimates. Middle East Research Journal of Humanities and Social Sciences, 1(1), 16–24. https://doi.org/10.36348/merjhss.2021.v01i01.003

Kennedy, I. (2022). Sample Size Determination in Test-Retest and Cronbach Alpha Reliability Estimates. British Journal of Contemporary Education, 2(1), 17–29. https://doi.org/10.52589/BJCE-FY266HK9

Kumar, V. (2023). Using item analysis to evaluate hand hygiene self-assessments at Alberta health services. American Journal of Infection Control, 51(6), 683–686. https://doi.org/10.1016/j.ajic.2022.08.030

Li, X., & Wang, N. (2004). On the Objectivity and Fairness of Objective Language Testing. Dongbei Daxue Xuebao (Shehui Kexue Ban)/Journal of Northeastern University (Social Science), 6, 385–387.

Lu, X., & Hu, R. (2022). Sense-aware lexical sophistication indices and their relationship to second language writing quality. Behavior Research Methods, 54(3), 1444–1460. https://doi.org/10.3758/s13428-021-01675-6

Madya, S., Retnawati, H., Purnawan, A., Putro, N. H. P. S., & Apino, E. (2019). The Equivalence of TOEP Forms. TEFLIN Journal - A Publication on the Teaching and Learning of English, 30(1), 88–104. https://doi.org/10.15639/teflinjournal.v30i1/88-104

Mahesar, I. K., & Jokhio, A. A. (2021). Investigating the Impact of Resilience on Learners’ Motivated Behavior of L2 and Proficiency in English of University Students at Karachi, Pakistan. Ethical Lingua: Journal of Language Teaching and Literature, 8(2), Article 2. https://doi.org/10.30605/25409190.290

Michigan Assessment. (2022, October 17). Explore Recognizing Organizations—Michigan Language Assessment. https://michiganassessment.org/i-am-a/recognition/recognizing-organization/

Mohseni, A. (2021). The Impact of Genre-Based Instruction on Iranian Intermediate EFL Learners’ Writing Skills. Vision: Journal for Language and Foreign Language Learning, 10(2), 115–132. https://doi.org/10.21580/vjv11i110596

Natova, I. (2021). Estimating CEFR reading comprehension text complexity. The Language Learning Journal, 49(6), 699–710. https://doi.org/10.1080/09571736.2019.1665088

Nordström, T., Andersson, U. B., Fälth, L., & Gustafson, S. (2019). Teacher inquiry of using assessments and recommendations in teaching early reading. Studies in Educational Evaluation, 63, 9–16. https://doi.org/10.1016/j.stueduc.2019.06.006

North, B. (2014). Putting the Common European Framework of Reference to good use. Language Teaching, 47(2), 228–249. https://doi.org/10.1017/S0261444811000206

Park, T. (2012). Examining the Validity of an Essay Writing Test Using Rasch Analysis. KISS: Korean Information Service System, 5(2), 70–96.

Pearson, W. S. (2020). Mapping English language proficiency cutoff scores and pre-sessional EAP programmes in UK higher education. Journal of English for Academic Purposes, 45, 100866. https://doi.org/10.1016/j.jeap.2020.100866

Pearson, W. S. (2021). The Predictive Validity of the Academic IELTS Test: A Methodological Synthesis. ITL - International Journal of Applied Linguistics, 172(1), 85–120. https://doi.org/10.1075/itl.19021.pea

Pearson, W. S. (2023). Test review: High-stakes English language proficiency tests—Enquiry, resit, and retake policies. Language Testing, 40(4), 1022–1035. https://doi.org/10.1177/02655322231186706

Penuel, W., Roschelle, J., & Shechtman, N. (2007). Designing Formative Assessment Software with Teachers: An Analysis of the Co-Design Process. Research and Practice in Technology Enhanced Learning, 2, 51–74. https://doi.org/10.1142/S1793206807000300

Pusat Bahasa UIN Maulana Malik Ibrahim Malang. (2024, December 30). Beranda. Pusat Pengembangan Bahasa. https://ppb.uin-malang.ac.id/

Pusat Bahasa UNAIR. (2025). UNAIR’S ELPT. Pusat Bahasa dan Multibudaya. https://pusatbahasa.unair.ac.id/unairs-elpt/

Pusat Pengembangan Bahasa UIN Sunan Ampel Surabaya. (2023, February 16). Layanan P2B - UINSA. https://uinsa.ac.id/p2b/layanan-p2b

Reynolds, C. R., Altmann, R. A., & Allen, D. N. (2021). Reliability. In C. R. Reynolds, R. A. Altmann, & D. N. Allen (Eds.), Mastering Modern Psychological Testing: Theory and Methods (pp. 133–184). Springer International Publishing. https://doi.org/10.1007/978-3-030-59455-8_4

Rovinelli, R. J., & Hambleton, R. K. (1977). On the use of content specialists in the assessment of criterion-referenced test item validity. Tijdschrift Voor Onderwijsresearch, 2(2), 49–60.

Roy, R., Sukumar, G. M., Philip, M., & Gopalakrishna, G. (2023). Face, content, criterion and construct validity assessment of a newly developed tool to assess and classify work–related stress (TAWS– 16). PLOS ONE, 18(1), e0280189. https://doi.org/10.1371/journal.pone.0280189

Rumsey, M., Thiessen, J., Buchan, J., & Daly, J. (2016). The consequences of English language testing for international health professionals and students: An Australian case study. International Journal of Nursing Studies, 54, 95–103. https://doi.org/10.1016/j.ijnurstu.2015.06.001

Saadatara, A., Kiany, G., & Talebzadeh, H. (2023). Bundles to beat the band in high-stakes tests: Pedagogical applications of an exploratory investigation of lexical bundles across band scores of the IELTS writing component. Journal of English for Academic Purposes, 61, 101208. https://doi.org/10.1016/j.jeap.2022.101208

Sari, N. A., & Mualimin, M. (2021). The Influence of the Pandemic on the Motivation of EAP Learners in Studying IELTS. E3S Web of Conferences, 317, 02031. https://doi.org/10.1051/e3sconf/202131702031

Schildkamp, K., van der Kleij, F. M., Heitink, M. C., Kippers, W. B., & Veldkamp, B. P. (2020). Formative assessment: A systematic review of critical teacher prerequisites for classroom practice. International Journal of Educational Research, 103, 101602. https://doi.org/10.1016/j.ijer.2020.101602

Shirali, G., Shekari, M., & Angali, K. A. (2018). Assessing Reliability and Validity of an Instrument for Measuring Resilience Safety Culture in Sociotechnical Systems. Safety and Health at Work, 9(3), 296–307. https://doi.org/10.1016/j.shaw.2017.07.010

Shou, Y., Sellbom, M., & Chen, H.-F. (2022). 4.02—Fundamentals of Measurement in Clinical Psychology. In G. J. G. Asmundson (Ed.), Comprehensive Clinical Psychology (Second Edition) (pp. 13–35). Elsevier. https://doi.org/10.1016/B978-0-12-818697-8.00110-2

Solovjeva, S. V., & Baksheev, D. P. (2021). Preparation for international language exams as a means of developing intercultural competence. Proceedings of FIR Conferences - International Relations: History, Theory, Practice, 425–430. https://elib.bsu.by/handle/123456789/268895

Sridhanyarat, K., Pathong, S., Suranakkharin, T., & Ammaralikit, A. (2021). The Development of STEP, the CEFR-Based English Proficiency Test. English Language Teaching, 14(7), 95. https://doi.org/10.5539/elt.v14n7p95

Swiecki, Z., Khosravi, H., Chen, G., Martinez-Maldonado, R., Lodge, J. M., Milligan, S., Selwyn, N., & Gašević, D. (2022). Assessment in the age of artificial intelligence. Computers and Education: Artificial Intelligence, 3, 100075. https://doi.org/10.1016/j.caeai.2022.100075

Taghizadeh, M., & Mazdayasna, G. (2023). Language assessment course at Iranian state universities: An evaluation of the incorporation of assessment principles into the course content. Heliyon, 9(1), e12857. https://doi.org/10.1016/j.heliyon.2023.e12857

Tannenbaum, R. J., & Baron, P. A. (2011). Mapping TOEFL® ITP Scores Onto the Common European Framework of Reference. English Testing Services.

Tannenbaum, R. J., & Wylie, E. C. (2008). Linking English-Language Test Scores Onto the Common European Framework of Reference: An Application of Standard-Setting Methodology. ETS Research Report Series, 2008(1), i–75. https://doi.org/10.1002/j.2333-8504.2008.tb02120.x

Teistler, N. (2021). Development of an Instrument to Assess Pre-Service Teachers’ Attitudes on Person-Centered Behavior Toward Students (APBS): Gathering Validity Evidence Based on Test Content. International Journal of Educational Research, 110, 101878. https://doi.org/10.1016/j.ijer.2021.101878

The CEFR Levels. (2022). Common European Framework of Reference for Languages (CEFR). https://www.coe.int/en/web/common-european-framework-reference-languages/level-descriptions

Turner, R. C., & Carlson, L. (2003). Indexes of Item-Objective Congruence for Multidimensional Items. International Journal of Testing, 3(2), 163–171. https://doi.org/10.1207/S15327574IJT0302_5

Umirov, S. (2024). IELTS or TOEFL Ibt? Investigating Factors influencing language test decisions. Oriental Renaissance: Innovative, Educational, Natural, and Social Science, 4(1), 427–432.

UNESCO Institute for Lifelong Learning. (2021). GAL Country Profiles as of February 2021. United Nations Educational, Scientific and Cultural Organization.

Ursachi, G., Horodnic, I. A., & Zait, A. (2015). How Reliable are Measurement Scales? External Factors with Indirect Influence on Reliability Estimators. Procedia Economics and Finance, 20, 679–686. https://doi.org/10.1016/S2212-5671(15)00123-9

Vincent, W., & Shanmugam, S. K. S. (2020). The Role of Classical Test Theory to Determine the Quality of Classroom Teaching Test Items: Pedagogia : Jurnal Pendidikan, 9(1), Article 1. https://doi.org/10.21070/pedagogia.v9i1.123

Walski, T. (2014). Consequential Research. Journal of Water Resources Planning and Management, 140, 559–561. https://doi.org/10.1061/(ASCE)WR.1943-5452.0000430

Weir, C. J. (2005). Limitations of the Common European Framework for developing comparable examinations and tests. Language Testing, 22(3), 281–300. https://doi.org/10.1191/0265532205lt309oa

Revisiting the Institutional English Proficiency Test (EPT): Evaluating Its Validity, Reliability, and CEFR Alignment

Authors

DOI:

Keywords:

Abstract

Author Biography

Ling Gan, School of Languages and Communications, Beijing Technology and Business University

References

Downloads

Published

Issue

Section

License

sidebar_menu

visitor_stat

certificate

article_template

googlescholarcitation