Skip to main content
  • Research Note
  • Open access
  • Published:

The feasibility of using generative artificial intelligence for history taking in virtual patients

Abstract

Objective

This study aimed to design and develop a virtual patient program using generative Artificial Intelligence (AI) technology, providing medical students opportunities to practice history-taking with a chatbot. We evaluated the feasibility of this approach by analyzing the quality of responses generated by the chatbot.

Results

Five expert reviewers participated in a pilot test, interacting with the chatbot to take the history of a patient presenting with a urinary problem using the Korean AI platform Naver HyperCLOVA X®. They evaluated the AI responses using a five-item questionnaire rated on a five-point Likert scale. The chatbot generated 96 pairs of questions and answers, totaling 1,325 words in 177 sentences. Discourse analysis of the scripts revealed that 2.6% (34) of the words generated by the chatbot were deemed implausible and were categorized into inarticulate answers, hallucinations, and missing important information. Participants rated the AI answers as relevant (M = 4.50 ± 0.32), valid (M = 4.20 ± 0.40), accurate (M = 4.10 ± 0.20), and succinct (M = 3.80 ± 0.51), but were neutral about their fluency (M = 3.20 ± 0.60). Using generative AI for history-taking of virtual patients is feasible, but improvements are needed for more articulate and natural responses.

Peer Review reports

Introduction

There has been growing interest in using generative artificial intelligence (AI) technology, such as ChatGPT, for medical education [1], yet its research and practice are just beginning to emerge. Despite the potential of generative AI for use in various teaching and learning contexts in medical education, research and practice have primarily focused on its performance in assessments for medical knowledge, such as medical licensing examinations [2, 3] and its use for automatic item generation [4]. In this study, we designed and developed a Virtual Patient (VP) program using generative AI technology based on the Large Langue Model (LLM) to provide students opportunities to practice history taking by interacting with the chatbot.

Medical students have traditionally practiced and been assessed on doctor-patient interactions using standardized patients (SPs), which are effective to replace patient encounters in real clinical settings. Still, medical students have limited opportunities to practice patient encounters with SPs other than in high-stakes assessments due to resource constraints. Therefore, VPs have been used to supplement medical students’ patient encounters and are known to be effective for fostering medical students’ clinical reasoning skills [5]. Despite the importance of interactive encounter for medical students to develop competencies in history taking [6], the doctor-patient interaction in conventional VPs, which were developed based on pre-LLM technology, significantly lacked in realism. Conventional VPs typically allow students to select from a list of predefined questions from the pull-down menu for history taking rather than to generate their own questions [7]. Hence, there has been increasing attention on integrating generative AI into VPs to overcome these limitations.

Recent developments in generative AI with LLM and natural language processing capabilities have made it a feasible tool to practice taking patient history by allowing realistic and natural interactions with the chatbot [1]. Several studies illustrated that generative AI is a feasible tool for providing a simulated patient experience, offering a majority of plausible answers, automated structured feedback, and yielding a good user experience [8, 9, 10, 11]. Still, the application of generative AI to teaching medical students patient encounters, including taking patient history, is just on the horizon, and research is scant on whether the AI can generate responses suited for practicing history taking. Particularly, research indicates that AI is prone to hallucinations and in some cases, it tends to produce socially desirable answers or biases [8, 9, 10]. Furthermore, the current evidence regarding measurable educational outcomes of AI-powered interventions in medical education is scarce [12], and while chatbots have shown promise in specific clinical cases, their generalizability across various clinical scenarios remains less explored [13].

As previous studies indicate that integrating AI into VPs presents both opportunities and challenges, more research is needed to fully understand the effectiveness and limitations of these technologies in various educational contexts. In this study, we developed a VP prototype, in which the student interacts with the chatbot to simulate a medical interview with a patient and conducted a preliminary study to evaluate its feasibility for use in practicing history taking by analyzing the quality of the responses generated by the chatbot.

Methods

Research design and procedures

We adopted generative AI technology using an LLM to enable the VP to deal with questions using a Korean AI platform (Naver HyperCLOVA X®). In this VP, the student is given initial information about the patient (e.g., chief complaints) and begins interviewing the VP on an open chat interface. In this pilot test, we developed a VP with a urinary problem and evaluated the quality of responses generated by the AI. This development took iterative processes of internal testing and fine-tuning of the AI responses by providing training data, which were scripts of medical interviews of fictitious patients developed by an expert medical faculty.

We evaluated the quality of responses generated by the chatbot using expert reviewers. Five researchers, including three AI experts and two medical educators from a research group on AI in healthcare affiliated with an academic medical center in Korea participated in this study. Each of the reviewers evaluated the VP as if they were medical students taking history of a patient with a urinary problem. This pilot test yielded five sets of question-answer pairs, which were retrieved from the AI platform and saved in Microsoft Excel® format for data analysis. The reviewers then evaluated the quality of AI responses using an assessment tool developed by the authors (Supplementary material 1). This questionnaire included five items (relevance, accuracy, fluency, succinctness, and validity of the AI responses in impersonating a patient) rated on a five-point Likert scale (1 = “strongly disagree,” 5 = “strongly agree”). Two experts in AI reviewed the questionnaire to establish the content validity of the items.

Data analysis

A discourse analysis was performed on the scripts of conversations with the chatbot. Discourse analysis is a research method for examining various forms of discourse by analyzing the meaning and structures either qualitatively or quantitatively [14]. From a review of the relevant literature, we noticed a lack of research instruments for systematically analyzing the discourse generated by chatbots. The validity of a coding scheme can be established by using inductive and/or deductive approaches [15]. We used both inductive and deductive approaches by developing the initial construct of the categories of implausible answers by the chatbot from a review of relevant literature and also by checking whether the data collected in this study fitted with the initial scheme after reviewing all utterances generated by the chatbot. The authors coded the scripts independently by counting the frequencies of implausible answers and categorizing them. Any discrepancies in the analysis were discussed until consensus was reached.

Questionnaire responses were subjected to descriptive analysis, and the Interclass Correlation Coefficient (ICC) was calculated to analyze inter-rater reliability. IBM SPSS Statistics for Windows ver. 27.0 (IBM Corp., Armonk, USA) was used for the data analysis.

Results

This pilot test produced a total of 96 pairs of questions and answers, the chatbot generating a total of 1,325 words in 177 sentences. Table 1 illustrates an excerpt from the actual scripts of the dialogue with the chatbot.

Table 1 Sample conversations between a student doctor and the chatbot impersonating a patient for practicing history taking (translated from Korean)

Results of analysis of conversations with the chatbot are shown in Table 2. Among 1,325 words generated by the chatbot, 34 (2.6%) were deemed implausible. These fell into three categories: (1) inarticulate answers, which include fragmented, partial, or repetitive sentences (23 words, 1.7%); (2) hallucinations (7 words, 0.5%) with nonsensical, inaccurate or misleading information - e.g., saying he was not taking any medication, despite previously stating he was; and (3) missing important information, such as not providing a complete response to the student doctor’s question (4 utterances, 0.3%). Since word counts cannot be performed on instances of ‘missing important information’, which are not stated by the chatbot, each sentence with missing important information was counted as an utterance.

Table 2 Analysis of conversations with the chatbot impersonating a patient

Table 3 shows results of expert review of the quality of AI responses. Participants generally responded positively to the quality of the AI responses (M = 3.96, SD = 0.21). Participants felt the AI answers were relevant (M ± SD = 4.50 ± 0.32), valid (M ± SD = 4.20 ± 0.40), and accurate (M ± SD = 4.10 ± 0.20) in the context of patient encounter. Still, participants felt that the AI responses were not as fluent as those of a real person (M = 3.20, SD = 0.60). Inter-rater reliability of participant responses was moderate, with ICC values ranging between 0.64 and 0.80.

Table 3 Analysis of expert evaluation of the qualty AI responses *

Discussion

This study investigated the feasibility of using generative AI based on the Large Langue Model for practicing history taking by evaluating the quality of AI responses. Our study shows the AI responses were generally plausible. We found few instances of implausible answers generated by the chatbot, half of which were inarticulate answers, and the others were hallucinations and missing important information. Similarly, participants gave overall positive ratings on the quality of AI responses, yet they pointed out it lacks fluency. Our study shows using generative AI to practice history patient taking is feasible and noteworthy for testing with learners. However, it still needs improvement to offer more authentic and natural responses, which would help generate more articulate answers, and is expected to improve over time with enhanced functionality. Thus, we are refining the prototype to improve the quality of AI responses to make them more fluent by providing more training data.

We are also developing an assessment system wherein the student receives automated feedback by reviewing his/her full chat history, important questions they missed, and their performance score using a checklist developed by expert medical faculty. We believe this is an innovative approach for enhancing the design of VPs by simulating more natural conversation for taking patient history compared to conventional formats. Moreover, medical students can benefit from this VP by providing them with additional opportunities to practice taking patient history without relying on SPs or real patients and by being able to receive structured feedback on their performance by utilizing AI technology. This VP program can also be integrated into a virtual or augmented reality environment to enable students to take the VP’s history as part of the whole patient-encounter process.

Limitations

Limitations of the study need to be acknowledged, which also warrant future research. First, this was an early-stage study to evaluate the program’s feasibility perceived by educators before it is implemented with medical students. Future studies are recommended to investigate the effectiveness of this program, including its usability and learner reactions by having medical students interact with the chatbot in various scenarios and collecting feedback on both the interactions and the realism of the AI’s responses to identify areas needing improvement. Moreover, longitudinal studies are recommended where students engage with the chatbot in different scenarios over time to assess the learning outcomes and the evolution of its capabilities.

Second, this was a pilot study using a prototype of one clinical presentation (urinary problem), which may not be generalizable to other clinical presentations. This limitation suggests that further research involving diverse clinical scenarios is necessary to validate the applicability of the AI tool across various clinical contexts. As medical students need to develop competencies in taking patient history of various clinical presentations, it is recommended that the VP program be developed to cover a diverse range of clinical presentations and permutations [16]. In particular, further studies are warranted to inform the effective use of prompts and behavioral components of the chatbot that optimizes its performance across various clinical contexts [8, 9, 10]. To that end, a program of design-based research - a formative research approach to educational designs based on principles derived from prior research to advance the testing and refinement of theories and advance educational practice [17] - is warranted to advance the theory and practice for the effective use of AI for VPs that promote student learning [18, 19, 20, 21]. Moreover, it consumes a great deal of resources to develop such programs; therefore, collaboration is recommended across medical schools to share resources in developing a comprehensive library of VP cases. Third, several AI platforms are currently available, and this study used one platform for the chatbot. As it is likely that the quality of AI responses differs across platforms, future studies are recommended to investigate the generalizability of our findings to other platforms and improve their performance.

Data availability

Data is available and can be requested from the corresponding author via email upon reasonable request.

References

  1. Gordon M, Daniel M, Ajiboye A, Uraiby H, Xu NY, Bartlett R, et al. A scoping review of artificial intelligence in medical education: BEME guide 84. Med Teach. 2024;46(4):446–70.

    Article  PubMed  Google Scholar 

  2. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large Language models. PLOS Digit Health. 2023;2(2):e0000198.

    Article  PubMed  PubMed Central  Google Scholar 

  3. Mihalache A, Huang RS, Popovic MM, Muni RH. ChatGPT-4: an assessment of an upgraded artificial intelligence chatbot in the united States medical licensing examination. Med Teach. 2024;46(3):366–72.

    Article  PubMed  Google Scholar 

  4. Kiyak Y, Kononowicz A. Case-based MCQ generator: a custom ChatGPT based on published prompts in the literature for automatic item generation. Med Teach. 2024;46(8):18–20.

    Article  Google Scholar 

  5. Plackett R, Kassianos AP, Mylan S, Kambouri M, Raine R, Sheringham J. The effectiveness of using virtual patient educational tools to improve medical students’ clinical reasoning skills: a systematic review. BMC Med Educ. 2022;22:365.

    Article  PubMed  PubMed Central  Google Scholar 

  6. Keifenheim KE, Teufel M, Ip J, Speiser N, Leehr EJ, Zipfel S, et al. Teaching history taking to medical students: a systematic review. BMC Med Educ. 2015;15:159.

    Article  PubMed  PubMed Central  Google Scholar 

  7. Reiswich A, Haag M. Evaluation of chatbot prototypes for taking the virtual patient’s history. Stud Health Technol Inf. 2019;260:73–80.

    Google Scholar 

  8. Cook DA, Overgaard J, Pankratz VS, Del Fiol G, Aakre CA. Virtual patients using large language models: scalable, contextualized simulation of clinician-patient dialog with feedback. J Med Internet Res. 2025. Epub ahead of print. PMID: 39854611.

  9. Holderried F, Stegemann-Philipps C, Herschbach L, Moldt JA, Nevins A, Griewatz J, et al. A generative pretrained transformer (GPT)-powered chatbot as a simulated patient to practice history taking: prospective, mixed methods study. JMIR Med Educ. 2024;10:e53961.

    Article  PubMed  PubMed Central  Google Scholar 

  10. Holderried F, Stegemann-Philipps C, Herrmann-Werner A, Festl-Wietek T, Holderried M, Eickhoff C, et al. A Language model-powered simulated patient with automated feedback for history taking: prospective study. JMIR Med Educ. 2024;10:e59213.

    Article  PubMed  PubMed Central  Google Scholar 

  11. García-Torres D, Ripoll M, Peris C, Solves J. Enhancing clinical reasoning with virtual patients: a hybrid systematic review combining human reviewers and ChatGPT. Healthcare. 2024;12(22).

  12. Feigerlova E, Hani H, Hothersall-Davies E. A systematic review of the impact of artificial intelligence on educational outcomes in health professions education. BMC Med Educ. 2025;25:129.

    Article  PubMed  PubMed Central  Google Scholar 

  13. Frangoudes F, Hadjiaros M, Schiza E, Matsangidou M, Tsivitanidou O, Neokleous K, editors. An overview of the use of chatbots in medical and healthcare education. Learning and collaboration technologies: games and virtual environments for learning, LCT 2021. PT II; 2021. 2021-01-01.

  14. Kärreman D, Levay C. The interplay of text, meaning and practice: methodological considerations on discourse analysis in medical education. Med Educ. 2017;51(1):72–80.

    Article  PubMed  Google Scholar 

  15. Rourke L, Anderson T. Validity in quantitative content analysis. ETR&D-Educational Technol Res Dev. 2004;52(1):5–18.

    Article  Google Scholar 

  16. Cook DA. Creating virtual patients using large Language models: scalable, global, and low cost. Med Teach. 2025;47(1):40–2.

    Article  PubMed  Google Scholar 

  17. Dolmans DH, Tigelaar D. Building bridges between theory and practice in medical education using a design-based research approach: AMEE guide 60. Med Teach. 2012;34(1):1–10.

    Article  PubMed  Google Scholar 

  18. Hirumi A, Johnson T, Reyes RJ, Lok B, Johnsen K, Rivera-Gutierrez DJ, et al. Advancing virtual patient simulations through design research and interplay: part II-integration and field test. ETR&D-Educational Technol Res Dev. 2016;64(6):1301–35.

    Article  Google Scholar 

  19. Hirumi A, Kleinsmith A, Johnsen K, Kubovec S, Eakins M, Bogert K, et al. Advancing virtual patient simulations through design research and interplay: part I: design and development. Etr&D-Educational Technol Res Dev. 2016;64(4):763–85.

    Article  Google Scholar 

  20. Cook DA, Triola MM. Virtual patients: a critical literature review and proposed next steps. Med Educ. 2009;43(4):303–11.

    Article  PubMed  Google Scholar 

  21. Cook DA, Erwin PJ, Triola MM. Computerized virtual patients in health professions education: a systematic review and meta-analysis. Acad Med. 2010;85(10):1589–602.

    Article  PubMed  Google Scholar 

Download references

Acknowledgements

None.

Funding

This work was supported by TEL Innovation Development Grant 2022 of AMEE (Association for Medical Education in Europe).

Author information

Authors and Affiliations

Authors

Contributions

Conception and design of the work: YY, KK. Data collection, data analysis and interpretation: YY, KK. Drafting the article: KK. Critical revision of the article, and final approval of the version to be published: YY, KK.

Corresponding author

Correspondence to Kyong-Jee Kim.

Ethics declarations

Ethics approval and consent for participation

Ethical approval was provided by the IRB (DGU IRB 20240027), and all methods were carried out in accordance with the Declaration of Helsinki (Clinical trial number: not applicable). Signing on the Informed consent for participation by the participants was waived in accordance with the guideline by the institutional review board of Dongguk University as they were adults not vulnerable and did not collect private or confidential information; they were informed about the purpose and procedures of the study; participation was voluntary and could withdraw from the study at any point.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yi, Y., Kim, KJ. The feasibility of using generative artificial intelligence for history taking in virtual patients. BMC Res Notes 18, 80 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13104-025-07157-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13104-025-07157-8

Keywords