Using GenAI to develop multiple-choice questions | Centre of Expertise for Higher Education

In collaboration with Prof. Annemiek Snoeckx (Faculty of Medicine and Health Sciences, UAntwerp)

1. Multiple-choice questions in assessment and education

Examinations constitute an important instrument for evaluating students’ learning process and the knowledge and skills they have acquired. Depending on the intended learning outcomes, a range of assessment formats may be employed, including oral or written examinations (ECHO teaching tip, 2026). For assessing factual knowledge, conceptual understanding, and application, multiple-choice questions (MCQs) are particularly suitable. Consequently, they are widely used in both formative and summative assessment contexts.
MCQs can take various forms depending on their purpose. For instance, questions requiring justification or identifying the best or worst answer can assess understanding and critical thinking, whereas reordering or matching items are more suited to evaluating relationships and processes (Sabbe, 2013; NBME, 2024).

The use of MCQs offers several advantages. They enable the efficient assessment of large cohorts within a limited timeframe, while ensuring objectivity and broad coverage of the curriculum. Moreover, well-designed distractors can help differentiate between partial and complete understanding by revealing common misconceptions or gaps in knowledge. The digitalisation of assessment further enhances these efficiencies.

The quality of MCQs depends heavily on their construction. Each item consists of a stem, the correct answer, and several distractors. The stem should present the problem clearly, concisely, and unambiguously. Negative phrasing and irrelevant information are best avoided. The answer options should be grammatically consistent with the stem, plausible, substantively distinct, and of comparable length. In most cases, three answer options strike an effective balance between readability and reliability (Sabbe, 2013; NBME, 2024).

Developing high-quality MCQs is a time-intensive process, particularly when combined with the wide range of other academic responsibilities faced by teaching staff. As a result, exam questions are often reused over multiple years. While understandable, this practice carries certain risks: knowledge evolves, curricula change, and items may become outdated. Continuous development of new, up-to-date questions therefore remains essential. Emerging technologies, such as generative artificial intelligence (GenAI), offer potential support in this process (Artsi et al., 2024; UNESCO, 2023).

2. Artificial intelligence and large language models

Recent advances in GenAI have led to significant developments in natural language processing, particularly through the emergence of large language models (LLMs). These models, including systems such as ChatGPT, GPT-4, and Claude, are trained on vast corpora of text. They are capable of generating new content, answering questions, and structuring information in ways that often appear both human-like and accurate. The application of GenAI is attracting growing interest in knowledge-intensive domains such as education. It has the potential to support learning, improve access to knowledge, and reduce workload, thereby creating more space for personalised guidance. At the same time, these developments prompt critical reflection on how best to integrate human expertise with AI capabilities. In this sense, GenAI represents not only a technological innovation but also a driver of educational transformation (Hang et al., 2023).

This teaching tip explores how teaching staff can use LLMs to develop MCQs, outlines key considerations, and provides practical guidance for implementation.

3. What can LLMs offer for MCQ-based assessment?

LLMs can support educators in several ways (Artsi et al., 2024; UNESCO, 2023):

More efficient item development

One of the primary advantages of GenAI is speed. Whereas manually constructing MCQs is time-consuming, LLMs can rapidly generate multiple draft or sample items. Reduced time investment allows for the development of larger item pools, facilitating more frequent renewal of question banks and helping to ensure that assessment materials remain current.

Beyond generating new items, LLMs can also be applied to existing MCQ sets to:

Revise and restructure large item banks.
Adapt existing questions to different levels of difficulty.
Assemble multiple equivalent versions of an examination with comparable levels of difficulty.

Quality assurance

Evaluating the quality of MCQs requires both time and expertise, and ensuring consistency and comparability across items remains a significant challenge. Conducting such evaluations efficiently is particularly demanding when working with large item banks. Assessing MCQs in terms of item writing flaws (IWFs) is especially complex in extensive question sets. IWFs refer to errors or weaknesses in item construction, such as ambiguous wording, negatively phrased stems, or inconsistently structured response options. LLMs may, in the future, play a valuable role in systematically screening and standardising items, thereby supporting the validity and reliability of examinations. In addition, LLMs can assist in correcting grammatical and linguistic inaccuracies, reducing the need for manual post-editing.

Greater variation and enhanced learning experiences

GenAI facilitates the generation of varied items tailored to different student groups or levels. This enables educators to design assessments that better align with diverse learner needs. For example, first-year students may be presented with more foundational knowledge questions, whereas master’s students may engage with complex case-based or interpretative tasks. In addition, GenAI opens up opportunities for more formative assessment by enabling the rapid generation of personalised practice questions with immediate feedback.

Enhanced feedback for students

Research indicates that MCQs are most effective for learning when students receive not only the correct answer but also explanations for why that answer is correct and why the alternatives are incorrect. In practice, however, such detailed feedback is rarely provided due to the time required to develop it. LLMs can support this process by automatically generating rationales for each answer option (Ch’en et al., 2025).

4. Challenges and considerations

Despite their potential, the use of LLMs also raises important considerations for educators (Law et al., 2025; Rincon et al., 2025; Ahmed et al., 2025).

General considerations

Limited currency: Many models do not have access to real-time information. More recent developments may therefore be absent unless using models with updated access.
Reliability of the output: GenAI can sometimes produce inaccurate or unsubstantiated responses that nevertheless appear convincing, often referred to as 'hallucinations'. Critical review by the educator remains essential.
Referencing and transparency: LLMs rarely provide verifiable sources. This is particularly problematic in educational and assessment contexts, where accuracy and traceability are crucial.
Bias and ethics: Training data may contain biases, which can be reflected in model outputs.
Privacy and confidentiality: Sensitive or identifiable information about students or colleagues should not be shared.
Over-reliance: While GenAI can reduce workload, it should not replace the educator’s professional judgement and pedagogical decision-making.
Implementation challenges: The responsible and practical integration of GenAI into teaching remains an evolving area, with unresolved questions regarding logistics, cost, and staff development.

Considerations specific to MCQs

Scientific evidence: Research on GenAI-generated MCQs is still emerging. Existing studies are limited in number, often conducted in English, and typically situated in controlled environments rather than authentic educational contexts.
Difficulty level: Evidence suggests that AI-generated items tend to be less challenging than those developed by experienced educators, limiting their effectiveness in assessing higher-order cognitive skills such as application and analysis.
Discriminatory power: Relatedly, such items may be less effective at distinguishing between high- and low-performing students.
Quality of distractors: The plausibility of incorrect options remains a known limitation. Implausible distractors reduce the validity of assessment.
Rationales and feedback: Although LLMs can generate explanations, these are often superficial or insufficiently nuanced, limiting their current pedagogical value.
Role of the educator: LLMs are best used to generate templates or initial drafts. Comprehensive review, refinement, and validation by the educator remain indispensable.

5. Getting started with LLMs for MCQs

The importance of effective prompting

When using LLMs to develop MCQs, the quality of the prompt is critical. A prompt refers to the instruction provided to the model; the clearer, more specific, and better structured the prompt, the higher the quality of the output. This process, often referred to as prompt engineering, is an emerging field that examines how models can be guided to produce relevant, usable, and reliable outputs (Kiyak et al., 2024; Ch’en et al., 2025; Artsi et al., 2024).

Core principles for effective prompt use

Be highly specific

The more detail you include in your prompt, the more targeted the response will be. For example, specify the subject area, educational level, type of examination, and the expected output.

Provide context

Supply the model with a reference text or course material and instruct it to base questions and answers exclusively on that material. This increases relevance and reduces the likelihood of errors.

Break down complex tasks

This ensures more controllable output. It also allows you to evaluate and adjust each step in the process.

Iterate and refine

Use the output from your first prompt as a basis and ask the model to generate improvements or alternatives. Specify your aims, for example: make the question more difficult, revise the distractors, and so on.

Request brief justification

Ask the model to briefly explain how it arrived at the answer and to check itself for missing assumptions or inconsistencies. This often yields more consistent responses.

Use external tools where possible

Some models allow integration with databases or search functions to retrieve recent or domain-specific information, for example, connection to scientific databases such as PubMed, or linkage to recent policy documents or guidelines, such as WHO publications.

Evaluate systematically

Compare the output, where possible, with standard questions or existing materials, and assess it against known criteria.

6. Step-by-step guide for developing MCQs with LLMs

Based on the principles above, teaching staff can follow a step-by-step approach. This helps to harness the benefits of GenAI while simultaneously keeping known challenges (such as the quality of distractors or overly simple questions) under control (Kiyak et al., 2024; Bhowmick et al., 2023).

Step 1. Select the content

Choose a learning outcome or a section of course material.

Step 2. Generate questions

Formulate a clear prompt specifying the topic, the level of the students, and the desired question type.
Ask the LLM to generate a set of possible question stems based on the input.
As an educator, check whether the questions are correct, relevant, and clearly formulated.

Step 3. Formulate correct answers

Ask the model to generate a correct answer for each question.
Check substantively whether this answer is accurate and aligns well with the learning outcome.

Step 4. Generate distractors

Ask the LLM to create multiple incorrect answer options that are plausible and closely related to the correct answer.
Evaluate the quality yourself: are they credible and not too obvious?
If you need three distractors, consider asking the LLM to provide five. You can then select the best distractors for the given question (see Step 6).

Step 5. Add rationales

Ask the model to provide a brief explanation of why an answer is correct and why the other options are incorrect.
Use this as a basis for feedback, but always review it critically.

Step 6. Filter and refine

Select the questions that are substantively accurate, have the appropriate difficulty level, and meet the quality criteria.
Rephrase where necessary and remove weak items.

Step 7. Iterate and expand

Use the strongest questions as templates for new prompts.
Ask the model to create variants or examination sets with comparable levels of difficulty.

Step 8. Test and evaluate

Compare the GenAI-generated questions with existing exam questions.
Evaluate whether they meet your assessment objectives and whether they challenge students sufficiently.

6. What does the future hold?

The use of LLMs for developing MCQs is still in its infancy, and there is currently very little scientific research on this topic. Moreover, the existing literature is often based on English-language use (which is not easily implementable within the context of Dutch higher education), and research has been conducted in specific research settings, without a clear translation to everyday practice.

At the same time, models are developing at a rapid pace, with new developments following one another in quick succession. New generations of GenAI models (for example, OpenAI's reasoning models or DeepSeek-R1) can already reason better and build up their answers step by step.

In addition, there are first systems specifically designed to support educators in assessment development, such as Questgen, QuizRise, and others (Hang et al., 2023).

The future therefore looks promising, with considerable confidence that AI will offer opportunities to support educators in developing, revising, and evaluating MCQs. For teaching staff, this means that experimentation is worthwhile, but that critical evaluation and human validation remain indispensable.

7. Conclusion

LLMs can accelerate and enrich the development of MCQs, but their output should serve only as a starting point. Human oversight remains essential to ensure quality and didactic value. The technology is developing at a rapid pace, so experimentation and critical learning are crucial. Do not wait until GenAI is perfect, but begin using it today: experiment, try things out, evaluate critically, and progressively discover how it can support you.

Want to know more?

Strongly recommended reading

Ahmed A, Kerr E, O'Malley A. Quality assurance and validity of AI-generated single best answer questions. BMC Med Educ. 2025 Feb 25;25(1):300. doi: 10.1186/s12909-025-06881-w.
Artsi Y, Sorin V, Konen E, Glicksberg BS, Nadkarni G, Klang E. Large language models for generating medical examinations: systematic review. BMC Med Educ. 2024 Mar 29;24(1):354. doi: 10.1186/s12909-024-05239-y.
Bhowmick AK, Jagmohan A, Vempaty A, Dey P, Hall L, Hartman J, Kokku R, Maheshwari H. Automating question generation from educational text. arXiv [Preprint]. 2023 Sep 26 [cited 2025 Oct 11]. Available from: https://arxiv.org/abs/2309.15004
Ch'en PY, Day W, Pekson RC, Barrientos J, Burton WB, Ludwig AB, Jariwala SP, Cassese T. GPT-4 generated answer rationales to multiple choice assessment questions in undergraduate medical education. BMC Med Educ. 2025 Mar 4;25(1):333. doi: 10.1186/s12909-025-06862-z.
Hang CN, Tan CW, Yu P-D. MCQGen: A large language model-driven MCQ generator for personalized learning. IEEE Access. 2023;11:1-12. doi:10.1109/ACCESS.2024.3420709
Kıyak YS, Emekli E. ChatGPT prompts for generating multiple-choice questions in medical education and evidence on their validity: a literature review. Postgrad Med J. 2024 Oct 18;100(1189):858-865. doi: 10.1093/postmj/qgae065.
Law AK, So J, Lui CT, Choi YF, Cheung KH, Kei-Ching Hung K, Graham CA. AI versus human-generated multiple-choice questions for medical education: a cohort study in a high-stakes examination. BMC Med Educ. 2025 Feb 8;25(1):208. doi: 10.1186/s12909-025-06796-6.
National Board of Medical Examiners. NBME item-writing guide: Constructing written test questions for the health sciences. 6th ed. Philadelphia (PA): National Board of Medical Examiners; 2024. Available from: https://www.nbme.org
Rincón EHH, Jimenez D, Aguilar LAC, Flórez JMP, Tapia ÁER, Peñuela CLJ. Mapping the use of artificial intelligence in medical education: a scoping review. BMC Med Educ. 2025 Apr 12;25(1):526. doi: 10.1186/s12909-025-07089-8.
Sabbe E, Lesage E. Meerkeuzetoetsen: praktische handleiding voor leerkrachten en docenten. Antwerpen: Garant; 2012. 82 p.
United Nations Educational, Scientific and Cultural Organization (UNESCO). Guidance for generative AI in education and research. Paris: UNESCO; 2023. ISBN: 978-92-3-100612-8. Available from: https://doi.org/10.54675/EWZM9535

Background literature

Arif T, Asthana S, Collins-Thompson K. Generation and assessment of multiple-choice questions from video transcripts using large language models. In: Proceedings of the Eleventh ACM Conference on Learning @ Scale (L@S ’24). Atlanta (GA): ACM; 2024. p. 1–7. doi:10.1145/3657604.3664714
Başaranoğlu M, Akbay E, Erdem E. AI-generated questions for urological competency assessment: a prospective educational study. BMC Med Educ. 2025;25:611. doi:10.1186/s12909-025-07202-x
Cheung BHH, Lau GKK, Wong GTC, Lee EYP, Kulkarni D, Seow CS, et al. ChatGPT versus human in generating medical graduate exam multiple choice questions—A multinational prospective study (Hong Kong S.A.R., Singapore, Ireland, and the United Kingdom). PLoS One. 2023;18(8):e0290691. doi:10.1371/journal.pone.0290691
Demeester T, Beckmann L. Distractor generation for multiple-choice questions with predictive prompting and large language models. Commun Comput Inf Sci. 2025;2134:48–63.
Griot M, Vanderdonckt J, Yuksel D, Hemptinne C. Multiple choice questions and large language models: a case study with fictional medical data. arXiv preprint. arXiv:2406.02394. 2024.
Mistry NP, Saeed H, Rafique S, Le T, Obaid H, Adams SJ. Large language models as tools to generate radiology board-style multiple-choice questions. Acad Radiol. 2024;31(11):3872–8. doi:10.1016/j.acra.2024.06.046
Moore S, Nguyen HA, Chen T, Stamper J. Assessing the quality of multiple-choice questions using GPT-4 and rule-based methods. Carnegie Mellon University; 2023.
Safranek CW, Sidamon-Eristoff AE, Gilson A, Chartash D. The role of large language models in medical education: applications and implications. JMIR Med Educ. 2023;9:e50945. doi:10.2196/50945
Sawamura S, Kohiyama K, Takenaka T, Sera T, Inoue T, Nagai T. Potential of large language models in generating multiple-choice questions for the Japanese National Licensure Examination for Physical Therapists. Cureus.2025;17(2):e79183. doi:10.7759/cureus.79183
Tomova M, Roselló Atanet I, Sehy V, Sieg M, März M, Mäder P. Leveraging large language models to construct feedback from medical multiple-choice questions. SciRep. 2024;14:27910. doi:10.1038/s41598-024-79245-x
Tran A, Angelikas K, Rama E, Okechukwu C, Macneil S. Generating multiple choice questions for computing courses using large language models. In: 2023 IEEE Frontiers in Education Conference (FIE). IEEE; 2023. doi:10.1109/FIE58773.2023.10342898
Verghese BG, Iyer C, Borse T, Cooper S, White J, Sheehy R. Modern artificial intelligence and large language models in graduate medical education: a scoping review of attitudes, applications & practice. BMC Med Educ.2025;25:730. doi:10.1186/s12909-025-07321-5
Wu S, Koo M, Blum L, Black A, Kao L, Fei Z, Scalzo F, Kurtz I. Benchmarking open-source large language models, GPT-4, and Claude 2 on multiple-choice questions in nephrology. NEJM AI. 2024;1(2). doi:10.1056/AIdbp2300092
Elkins S, Kochmar E, Serban I, Cheung JCK. How useful are educational questions generated by large language models? arXiv preprint. arXiv:2304.06638. 2023

Lees deze tip in het Nederlands