Preprint / Version 1

The sports nutrition knowledge of large language model (LLM) artificial intelligence (AI) chatbots

an assessment of accuracy, completeness, clarity, quality of evidence, and test-retest reliability

##article.authors##

DOI:

https://doi.org/10.51224/SRXIV.512

Keywords:

Accuracy, Artificial Intelligence, Athlete, Chatbot, Completeness, Evidence Quality, Large Language Model (LLM), Nutrition, Sports nutrition, Reliability

Abstract

Generative artificial intelligence (AI) chatbots are increasingly utilised in various domains, including sports nutrition. Despite their growing popularity, there is limited evidence on the accuracy, completeness, clarity, evidence quality, and test-retest reliability of AI-generated sports nutrition advice. This study evaluates the performance of ChatGPT, Gemini, and Claude’s basic and advanced models across these metrics to determine their utility in providing sports nutrition information. Two experiments were conducted. In Experiment 1, chatbots were tested with simple and detailed prompts in two domains: Sports nutrition for training and Sports nutrition for racing. Interrater agreement was determined and chatbot performance was assessed by measuring accuracy, completeness, clarity, evidence quality, and test-retest reliability. In Experiment 2, chatbot performance was evaluated by measuring the accuracy and test-retest reliability of chatbots’ answers to multiple-choice questions based on a sports nutrition certification exam. In Experiment 1, interrater reliability was good and accuracy varied from 74% (Gemini1.5pro) to 31% (ClaudePro). Detailed prompts improved Claude’s accuracy but had little impact on ChatGPT or Gemini. Completeness scores were highest for ChatGPT-4o compared to other chatbots, which scored low to moderate. The quality of cited evidence was low for all chatbots when simple prompts were used but improved with detailed prompts. In Experiment 2, accuracy ranged from 89% (Claude3.5Sonnet) to 61% (ClaudePro). Test-retest reliability was acceptable across all metrics in both experiments. While generative AI chatbots demonstrate potential in providing sports nutrition guidance, their accuracy is moderate at best and inconsistent between models. Until significant advancements are made, athletes and coaches should consult registered dietitians for tailored nutrition advice.

Metrics

Metrics Loading ...

References

Grand View Research. Chatbot Market Size, Share & Trends, Analysis Report By Application (Customer Services, Branding & Advertising), By Type, By Vertical, By Region (North America, Europe, Asia Pacific, South America), And Segment Forecasts, 2023 - 2030. [Internet]. [cited 2024 Sep 6]. Available from: https://www.grandviewresearch.com/industry-analysis/chatbot-market

Google. Google Trends for “ChatGPT”, “Microsoft Copilot”, “Gemini”, and “Claude”. [Internet]. [cited 2024 Sep 6]. Available from: https://trends.google.com/trends/explore?date=2022-01-01%202024-04-25&q=Microsoft%20Copilot,ChatGPT,%2Fg%2F11ts49p01g,%2Fg%2F11kq5ghr35

AI Endurance. AI Endurance: AI running, cycling, and triathlon coach. [Internet]. [cited 2025 Jan 20]. Available from: https://aiendurance.com/

AlbonApp. Trail running training app. [Internet]. [cited 2024 Dec 12]. Available from: https://www.albon.app/

Vert.run. A training app for trail and ultrarunners. [Internet]. [cited 2024 Dec 12]. Available from: https://vert.run/

RunRight. AI running coach. [Internet]. [cited 2024 Dec 12]. Available from: https://runright.ai/

Fitness AI. Get Stronger with A.I. [Internet]. [cited 2024 Dec 12]. Available from: https://www.fitnessai.com/

PlanFit. Free AI-powered personal training app for gym beginners [Internet]. [cited 2024 Dec 12]. Available from: https://planfit.ai/

GymBuddy. AI-Powered workout planning app for gym beginners to pros [Internet]. [cited 2024 Dec 12]. Available from: https://www.gymbuddy.ai/

MacroFactor. Macrotracker and diet coach app. [Internet]. [cited 2024 Sep 6]. Available from: https://macrofactorapp.com/

Strongr Fastr. AI nutrition, workouts, and meal planner. [Internet]. [cited 2024 Dec 12]. Available from: https://www.strongrfastr.com/

ZOE. Understand how food affects your body. [Internet]. [cited 2024 Dec 12]. Available from: https://zoe.com/

DayTwo. Predictive, personal, proven. Predict blood sugar response before the first bite. [Internet]. [cited 2024 Dec 12]. Available from: https://www.daytwo.com/

Thomas DT, Erdman KA, Burke LM. American College of Sports Medicine Joint Position Statement. Nutrition and Athletic Performance. Med Sci Sports Exerc 2016;48:543–68. https://doi.org/10.1249/MSS.0000000000000852.

Kerksick CM, Wilborn CD, Roberts MD, Smith-Ryan A, Kleiner SM, Jäger R, et al. ISSN exercise & sports nutrition review update: research & recommendations. J Int Soc Sports Nutr 2018;15:38. https://doi.org/10.1186/s12970-018-0242-y.

Burke LM, Castell LM, Casa DJ, Close GL, Costa RJS, Desbrow B, et al. International Association of Athletics Federations Consensus Statement 2019: Nutrition for Athletics. Int J Sport Nutr Exerc Metab 2019;29:73–84. https://doi.org/10.1123/ijsnem.2019-0065.

Düking P, Sperlich B, Voigt L, Van Hooren B, Zanini M, Zinner C. ChatGPT Generated Training Plans for Runners are not Rated Optimal by Coaching Experts, but Increase in Quality with Additional Input Information. J Sports Sci Med 2024;23:56–72. https://doi.org/10.52082/jssm.2024.56.

Washif JA, Pagaduan J, James C, Dergaa I, Beaven CM. Artificial intelligence in sport: Exploring the potential of using ChatGPT in resistance training prescription. Biol Sport 2024;41:209–20. https://doi.org/10.5114/biolsport.2024.132987.

Dergaa I, Saad HB, El Omri A, Glenn JM, Clark CCT, Washif JA, et al. Using artificial intelligence for exercise prescription in personalised health promotion: A critical evaluation of OpenAI’s GPT-4 model. Biol Sport 2024;41:221–41. https://doi.org/10.5114/biolsport.2024.133661.

Ponzo V, Goitre I, Favaro E, Merlo FD, Mancino MV, Riso S, et al. Is ChatGPT an Effective Tool for Providing Dietary Advice? Nutrients 2024;16:469. https://doi.org/10.3390/nu16040469.

Papastratis I, Stergioulas A, Konstantinidis D, Daras P, Dimitropoulos K. Can ChatGPT provide appropriate meal plans for NCD patients? Nutrition 2024;121:112291. https://doi.org/10.1016/j.nut.2023.112291.

Haman M, Školník M, Lošťák M. AI dietician: Unveiling the accuracy of ChatGPT’s nutritional estimations. Nutrition 2024;119:112325. https://doi.org/10.1016/j.nut.2023.112325.

Sun H, Zhang K, Lan W, Gu Q, Jiang G, Yang X, et al. An AI Dietitian for Type 2 Diabetes Mellitus Management Based on Large Language and Image Recognition Models: Preclinical Concept Validation Study. J Med Internet Res 2023;25:e51300. https://doi.org/10.2196/51300.

Niszczota P, Rybicka I. The credibility of dietary advice formulated by ChatGPT: Robo-diets for people with food allergies. Nutrition 2023;112:112076. https://doi.org/10.1016/j.nut.2023.112076.

Pérez-Guerrero EJ, Mehrotra I, Jain SS, Perez MV. Large language models as partners in medical literature. Heart Rhythm 2024:S1547-5271(24)03073-X. https://doi.org/10.1016/j.hrthm.2024.07.097.

Goodman RS, Patrinely JR, Stone CA, Zimmerman E, Donald RR, Chang SS, et al. Accuracy and Reliability of Chatbot Responses to Physician Questions. JAMA Netw Open 2023;6:e2336483. https://doi.org/10.1001/jamanetworkopen.2023.36483.

Pugliese N, Wai-Sun Wong V, Schattenberg JM, Romero-Gomez M, Sebastiani G, NAFLD Expert Chatbot Working Group, et al. Accuracy, Reliability, and Comprehensibility of ChatGPT-Generated Medical Responses for Patients With Nonalcoholic Fatty Liver Disease. Clin Gastroenterol Hepatol 2024;22:886-889.e5. https://doi.org/10.1016/j.cgh.2023.08.033.

Giray L. Prompt Engineering with ChatGPT: A Guide for Academic Writers. Ann Biomed Eng 2023;51:2629–33. https://doi.org/10.1007/s10439-023-03272-4.

Fagbohun O, Harrison RM, Dereventsov A. An Empirical Categorization of Prompting Techniques for Large Language Models: A Practitioner’s Guide [Internet]. arXiv, 2024 [cited 2024 Sep 6]. Available from: https://arxiv.org/abs/2402.14837 https://doi.org/10.48550/ARXIV.2402.14837.

Labulee. The Art of ChatGPT Prompting: A Guide to Crafting Clear and Effective Prompts. [Internet]. 2023 [cited 2024 Sep 4]. Available from: https://medium.com/@labulee/the-art-of-chatgpt-prompting-a-guide-to-crafting-clear-and-effective-prompts-5fe22e1cc915

Saravia E. Prompt Engineering Guide. [Internet]. 2022 [cited 2024 Sep 6]. Available from: https://github.com/dair-ai/Prompt-Engineering-Guide

OpenAI. Prompt engineering. [Internet]. [cited 2024 Sep 6]. Available from: https://platform.openai.com/docs/guides/prompt-engineering

Google. Gemini for Google Workspace Prompt Guide. [Internet]. [cited 2024 Sep 6]. Available from: https://services.google.com/fh/files/misc/gemini-for-google-workspace-prompting-guide-101.pdf

Microsoft. Create effective prompts. [Internet]. 2024 [cited 2024 Sep 6]. Available from: https://learn.microsoft.com/en-us/copilot/security/prompting-tips

Anthropic. Build with Claude: Prompt engineering overview. [Internet]. [cited 2024 Sep 6]. Available from: https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/overview

Lechien JR, Maniaci A, Gengler I, Hans S, Chiesa-Estomba CM, Vaira LA. Validity and reliability of an instrument evaluating the performance of intelligent chatbot: the Artificial Intelligence Performance Instrument (AIPI). Eur Arch Otorhinolaryngol 2024;281:2063–79. https://doi.org/10.1007/s00405-023-08219-y.

Yan S, Du D, Liu X, Dai Y, Kim M-K, Zhou X, et al. Assessment of the Reliability and Clinical Applicability of ChatGPT’s Responses to Patients’ Common Queries About Rosacea. Patient Prefer Adherence 2024;18:249–53. https://doi.org/10.2147/PPA.S444928.

Walker HL, Ghani S, Kuemmerli C, Nebiker CA, Müller BP, Raptis DA, et al. Reliability of Medical Information Provided by ChatGPT: Assessment Against Clinical Guidelines and Patient Information Quality Instrument. J Med Internet Res 2023;25:e47479. https://doi.org/10.2196/47479.

Mohammad-Rahimi H, Ourang SA, Pourhoseingholi MA, Dianat O, Dummer PMH, Nosrat A. Validity and reliability of artificial intelligence chatbots as public sources of information on endodontics. Int Endod J 2024;57:305–14. https://doi.org/10.1111/iej.14014.

Puce L, Ceylan Hİ, Trompetto C, Cotellessa F, Schenone C, Marinelli L, et al. Optimizing athletic performance through advanced nutrition strategies: can AI and digital platforms have a role in ultraendurance sports? Biol Sport 2024;41:305–13. https://doi.org/10.5114/biolsport.2024.141063.

Solomon TP, Laye MJ. Registered protocol: Examining the sports nutrition knowledge of large language model (LLM) artificial intelligence (AI) chatbots. [Internet]. OSF Registries, 2024 [cited 2024 Sep 9]. Available from: https://osf.io/zckya/ https://doi.org/10.17605/OSF.IO/ZCKYA.

Sallam M, Barakat M, Sallam M. A Preliminary Checklist (METRICS) to Standardize the Design and Reporting of Studies on Generative Artificial Intelligence-Based Models in Health Care Education and Practice: Development Study Involving a Literature Review. Interact J Med Res 2024;13:e54704. https://doi.org/10.2196/54704.

Kottner J, Audigé L, Brorson S, Donner A, Gajewski BJ, Hróbjartsson A, et al. Guidelines for Reporting Reliability and Agreement Studies (GRRAS) were proposed. J Clin Epidemiol 2011;64:96–106. https://doi.org/10.1016/j.jclinepi.2010.03.002.

Sims ST, Kerksick CM, Smith-Ryan AE, Janse de Jonge XAK, Hirsch KR, Arent SM, et al. International society of sports nutrition position stand: nutritional concerns of the female athlete. J Int Soc Sports Nutr 2023;20:2204066. https://doi.org/10.1080/15502783.2023.2204066.

Solomon TP, Laye MJ. Data registry: Examining the sports nutrition knowledge of large language model (LLM) artificial intelligence (AI) chatbots. 2024. https://doi.org/10.17605/OSF.IO/K2C6T.

My Sports Dietitian. CSSD exam prep. [Internet]. [cited 2024 Sep 6]. Available from: https://mysportsd.com/cssd-study-guide

Faul F, Erdfelder E, Lang A-G, Buchner A. G*Power 3: a flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behav Res Methods 2007;39:175–91. https://doi.org/10.3758/bf03193146.

R Core Team. R version 4.4.2 (2024-10-31): A language and environment for statistical computing. https://www.R-project.org/.

Wickham H, Averick M, Bryan J, Chang W, McGowan L, François R, et al. Welcome to the Tidyverse. JOSS 2019;4:1686. https://doi.org/10.21105/joss.01686.

Robinson D, Hayes A, Couch S. broom: Convert Statistical Objects into Tidy Tibbles. https://CRAN.R-project.org/package=broom. 2014.

Bray A, Ismay C, Chasnovski E, Couch S, Baumer B, Cetinkaya-Rundel M. infer: Tidy statistical inference. https://CRAN.R-project.org/package=infer [Internet]. 2018 [cited 2024 Sep 7]. Available from: https://CRAN.R-project.org/package=infer

Kassambara A. ggpubr: “ggplot2” Based Publication Ready Plots. https://CRAN.R-project.org/package=ggpubr. 2016.

Gamer M, Lemon J, Fellows I, Singh P. irr: Various coefficients of interrater reliability and agreement. https://CRAN.R-project.org/package=irr. 2005.

Bates D, Mächler M, Bolker B, Walker S. Fitting Linear Mixed-Effects Models Using lme4. J Stat Soft 2015;67. https://doi.org/10.18637/jss.v067.i01.

Kuznetsova A, Bruun Brockhoff P, Haubo Bojesen Christensen R. lmerTest: Tests in Linear Mixed Effects Models [Internet]. 2013 [cited 2024 Dec 12].p.3.1-3. Available from: https://CRAN.R-project.org/package=lmerTest https://doi.org/10.32614/CRAN.package.lmerTest.

Hothorn T, Zeileis A, Farebrother RW, Cummins C. lmtest: Testing Linear Regression Models [Internet]. 1999 [cited 2024 Dec 12].p.0.9-40. Available from: https://CRAN.R-project.org/package=lmtest https://doi.org/10.32614/CRAN.package.lmtest.

Fox J, Weisberg S, Price B. car: Companion to Applied Regression [Internet]. 2001 [cited 2024 Dec 12].p.3.1-3. Available from: https://CRAN.R-project.org/package=car https://doi.org/10.32614/CRAN.package.car.

Ben-Shachar MS, Makowski D, Lüdecke D, Patil I, Wiernik BM, Thériault R, et al. effectsize: Indices of Effect Size [Internet]. 2019 [cited 2024 Dec 12].p.1.0.0. Available from: https://CRAN.R-project.org/package=effectsize https://doi.org/10.32614/CRAN.package.effectsize.

Lenth RV. emmeans: Estimated Marginal Means, aka Least-Squares Means [Internet]. 2017 [cited 2024 Dec 12].p.1.10.5. Available from: https://CRAN.R-project.org/package=emmeans https://doi.org/10.32614/CRAN.package.emmeans.

Champely S. pwr: Basic Functions for Power Analysis [Internet]. 2006 [cited 2024 Dec 12].p.1.3-0. Available from: https://CRAN.R-project.org/package=pwr https://doi.org/10.32614/CRAN.package.pwr.

Sullivan GM, Artino AR. Analyzing and interpreting data from likert-type scales. J Grad Med Educ 2013;5:541–2. https://doi.org/10.4300/JGME-5-4-18.

Koo TK, Li MY. A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research. J Chiropr Med 2016;15:155–63. https://doi.org/10.1016/j.jcm.2016.02.012.

Streiner DL. Starting at the beginning: an introduction to coefficient alpha and internal consistency. J Pers Assess 2003;80:99–103. https://doi.org/10.1207/S15327752JPA8001_18.

Commission on Dietetic Registration. Board Certification as a Specialist in Sports Dietetics. [Internet]. [cited 2024 Sep 6]. Available from: https://www.cdrnet.org/board-certification-as-a-specialist-in-sports-dietetics

Commission on Dietetic Registration. Certification Examinations for the Commission on Dietetic Registration’s Board Certified Specialists Candidate Handbook [Internet]. [cited 2024 Sep 6]. Available from: https://www.cdrnet.org/vault/2459/web/Combined%20Specialist%20Handbook-%20Updated%20Links.pdf

Solomon TPJ. AI in sports nutrition: Statistical code for data analysis. [Internet]. Available from: https://github.com/tpjsolomon/AI_in_sports_nutrition

Mihalache A, Huang RS, Popovic MM, Muni RH. ChatGPT-4: An assessment of an upgraded artificial intelligence chatbot in the United States Medical Licensing Examination. Med Teach 2024;46:366–72. https://doi.org/10.1080/0142159X.2023.2249588.

Al-Khater KMK. Comparative assessment of three AI platforms in answering USMLE Step 1 anatomy questions or identifying anatomical structures on radiographs. Clin Anat 2024. https://doi.org/10.1002/ca.24243.

Shieh A, Tran B, He G, Kumar M, Freed JA, Majety P. Assessing ChatGPT 4.0’s test performance and clinical diagnostic accuracy on USMLE STEP 2 CK and clinical case reports. Sci Rep 2024;14:9330. https://doi.org/10.1038/s41598-024-58760-x.

Bicknell BT, Butler D, Whalen S, Ricks J, Dixon CJ, Clark AB, et al. ChatGPT-4 Omni Performance in USMLE Disciplines and Clinical Skills: Comparative Analysis. JMIR Med Educ 2024;10:e63430. https://doi.org/10.2196/63430.

Abrar M, Sermet Y, Demir I. An Empirical Evaluation of Large Language Models on Consumer Health Questions [Internet]. arXiv, 2025 [cited 2025 Jan 28]. Available from: https://arxiv.org/abs/2501.00208 https://doi.org/10.48550/ARXIV.2501.00208.

Hoffman MD, Cotter JD, Goulet ÉD, Laursen PB. VIEW: Is Drinking to Thirst Adequate to Appropriately Maintain Hydration Status During Prolonged Endurance Exercise? Yes. Wilderness & Environmental Medicine 2016;27:192–5. https://doi.org/10.1016/j.wem.2016.03.003.

Armstrong LE, Johnson EC, Bergeron MF. COUNTERVIEW: Is Drinking to Thirst Adequate to Appropriately Maintain Hydration Status During Prolonged Endurance Exercise? No. Wilderness & Environmental Medicine 2016;27:195–8. https://doi.org/10.1016/j.wem.2016.03.002.

Goulet EDB. Effect of exercise-induced dehydration on time-trial exercise performance: a meta-analysis. Br J Sports Med 2011;45:1149–56. https://doi.org/10.1136/bjsm.2010.077966.

Goulet EDB, Hoffman MD. Impact of Ad Libitum Versus Programmed Drinking on Endurance Performance: A Systematic Review with Meta-Analysis. Sports Med 2019;49:221–32. https://doi.org/10.1007/s40279-018-01051-z.

Gemini Apps Help. What happens when a chat is deleted? [Internet]. [cited 2024 Dec 13]. Available from: https://support.google.com/gemini/answer/13666746?visit_id=638538808956110765-2206370473&p=deleted_chats&rd=1#deleted_chats

Khlaif ZN, Mousa A, Hattab MK, Itmazi J, Hassan AA, Sanmugam M, et al. The Potential and Concerns of Using AI in Scientific Research: ChatGPT Performance Evaluation. JMIR Med Educ 2023;9:e47049. https://doi.org/10.2196/47049.

Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med 2023;29:1930–40. https://doi.org/10.1038/s41591-023-02448-8.

OpenAI. ChatGPT (AI chatbot) [Internet]. [cited 2024 Oct 1]. Available from: https://chat.openai.com/chat

Google. Gemini (AI chatbot) [Internet]. [cited 2024 Oct 1]. Available from: https://gemini.google.com

Anthropic. Claude (AI chatbot) [Internet]. [cited 2024 Oct 1]. Available from: https://claude.ai/

Downloads

Posted

2025-02-05