Artificial Intelligence Chatbot Accuracy for Patient Education in HPV infection.
Introduction/objective: Chatbots are artificial intelligence (AI) tools that use large language models to answer questions and simulate human-like conversations. Chatbots are commonly used by patients for self-education regarding medical issues. Chatbot accuracy regarding patient self-education in Human-Papilloma Virus (HPV) infection and recurrent respiratory papillomatosis (RRP) is unknown. This work is intended to analyze aspects of chatbot answers to common patient questions regarding laryngeal HPV infection/RRP.
Methods: Eight questions were asked by the authors to mimic real-world patient queries into three different Chatbot models (Claude, ChatGPT, and Gemini). Two fellowship-trained laryngologists blindly rated 8 answers for each model using a 5-point Likert scale regarding accuracy, comprehensiveness, and similarity to a response the rating physician would themselves offer. Descriptive and statistical analyses were performed.
Results: Gemini had the highest overall mean score (4.7, SD 0.5), and Claude had the lowest (4, SD 0.7). Gemini had the highest individual scores for each aspect evaluated. Between the aspects analyzed in all platforms, accuracy was scored highest (4.4, SD 0.6), but the difference was not robust. The One-way ANOVA test showed an overall significant difference between the three platforms (p-value>0.0001). A paired comparison between platforms was performed using a post-hoc Tukey test and the result were only significant when Gemini was compared individually with ChatGPT and Claude (p<0.0001). The paired comparison from ChatGPT and Claude was not substantial.
Conclusion: Gemini had the highest overall and individual rated scores for answer accuracy, comprehensiveness and similarity to what physician raters would offer their patients. It was the only chatbot platform that demonstrated a robust difference when compared to others regarding answers from patients on HPV infection and RRP.