Preprint / Version 1

Synthetic data for sharing and exploration in high performance sport

Considerations for application


  • John Warmenhoven University of Technology Sydney
  • Franco Impellizzeri
  • Ian Shrier
  • Andrew Vigotsky
  • Lorenzo Lolli
  • Paolo Menaspà
  • Aaron Coutts
  • Maurizio Fanchini
  • Giles Hooker



open science, privacy, anonymity, data analysis, transparency


Synthetic data represent alternative data sources generated using mathematical procedures to address specific issues in research and practice. Synthetic data has emerging applications in clinical and medical data contexts and may assist in overcoming privacy issues to help support open science practice. The present study discusses the applicability of an established synthetic data generation process using sequential tree-based algorithms (Synthpop package in R) in the context of athlete monitoring data in sport. We provide an educational primer and discussion for potential application of these methods when exploring issues in the field sports and exercise sciences via the application of Synthpop in seven simulation examples applied to a professional football dataset. Although sequential tree-based algorithms can create synthetic data using our reference dataset, we provide considerations for and highlight limitations when constructing synthetic data. To summarize, three types of models can be conceptualised for generating synthetic data: 1) models used for analysis of the original data (answering specific research questions); 2) models used to generate synthetic data, and; 3) models that represent the true generation process for the original data. Misalignments in the specifications of these models might introduce biases that can compromise the utility of synthetic data no matter the purpose. As synthetic data do not constitute a direct replacement of real data from conceptual and empirical standpoints, we believe that researchers embracing this practice must include sufficient documentation concerning the synthetic data generation process purpose, the predictors and model used, and the potential boundary conditions for using the synthetic data in future investigations in sports and other fields.


Metrics Loading ...


Bullock, G.S., et al., Call for open science in sports medicine. 2022, BMJ Publishing Group Ltd and British Association of Sport and Exercise Medicine. p. 1143-1144. DOI:

Wilkinson, M.D., et al., The FAIR Guiding Principles for scientific data management and stewardship. Scientific data, 2016. 3(1): p. 1-9. DOI:

Rodenberg, R.M., J.T. Holden, and A.D. Brown, Real-time sports data and the first amendment. Wash. JL Tech. & Arts, 2015. 11: p. 63.

Dattani, N., et al., Accessing electronic administrative health data for research takes time. Archives of disease in childhood, 2013. 98(5): p. 391-392. DOI:

Abay, N.C., et al. Privacy preserving synthetic data release using deep learning. in Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2018, Dublin, Ireland, September 10–14, 2018, Proceedings, Part I 18. 2019. Springer.

Goncalves, A., et al., Generation and evaluation of synthetic patient data. BMC medical research methodology, 2020. 20(1): p. 1-40. DOI:

Jordon, J., et al., Synthetic Data--what, why and how? arXiv preprint arXiv:2205.03257, 2022. DOI:

Azizi, Z., et al., Can synthetic data be a proxy for real clinical trial data? A validation study. BMJ open, 2021. 11(4): p. e043497. DOI:

Kokosi, T. and K. Harron, Synthetic data in medical research. BMJ Medicine, 2022. 1(1), e000167. DOI:

Jiang, N., et al., A method to create a synthetic population with social networks for geographically-explicit agent-based models. Computational Urban Science, 2022. 2(1): p. 7. DOI:

Reeves, D.M., D.A. Benson, and M.M. Meerschaert, Transport of conservative solutes in simulated fracture networks: 1. Synthetic data generation. Water resources research, 2008. 44(5). DOI:

Rubin, D.B., Statistical disclosure limitation. Journal of official Statistics, 1993. 9(2): p. 461-468.

Nowok, B., G.M. Raab, and C. Dibben, synthpop: Bespoke creation of synthetic data in R. Journal of statistical software, 2016. 74: p. 1-26. DOI:

Braddon, A.E., et al., Exploring the utility of synthetic data to extract more value from sensitive health data assets: A focused example in perinatal epidemiology. Paediatric and Perinatal Epidemiology, 2022. (First Published: 08 December 2022) DOI:

Quintana, D.S., A synthetic dataset primer for the biobehavioural sciences to promote reproducibility and hypothesis generation. Elife, 2020. 9, e53275. DOI:

Naughton, M., et al., Synthetic Data as a Strategy to Resolve Data Privacy and Confidentiality Concerns in the Sport Sciences: Practical Examples and an R Shiny Application. International Journal of Sports Physiology and Performance, 2023. 18(10), p. 1213-1218. DOI:

Vaden Jr, K.I., et al., Fully synthetic neuroimaging data for replication and exploration. Neuroimage, 2020. 223: p. 117284. DOI:

Kokosi, T., et al., An overview on synthetic administrative data for research. International Journal of Population Data Science, 2022. 7(1), p. 1727. DOI:

Fanchini, M., et al., Despite association, the acute: chronic work load ratio does not predict non-contact injury in elite footballers. Science and Medicine in Football, 2018. 2(2): p. 108-114. DOI:

Schwellnus, M., et al., How much is too much?(Part 2) International Olympic Committee consensus statement on load in sport and risk of illness. British journal of sports medicine, 2016. 50(17): p. 1043-1052. DOI:

Soligard, T., et al., How much is too much?(Part 1) International Olympic Committee consensus statement on load in sport and risk of injury. British journal of sports medicine, 2016. 50(17): p. 1030-1041. DOI:

Impellizzeri, F.M., et al., Training load and its role in injury prevention, part 2: conceptual and methodologic pitfalls. Journal of athletic training, 2020. 55(9): p. 893-901. DOI:

Impellizzeri, F.M., et al., Training load and its role in injury prevention, part I: back to the future. Journal of athletic training, 2020. 55(9): p. 885-892. DOI:

Impellizzeri, F.M., et al., Acute: chronic workload ratio: conceptual issues and fundamental pitfalls. International journal of sports physiology and performance, 2020. 15(6): p. 907-913. DOI:

Impellizzeri, F.M., et al., Training load and injury part 2: questionable research practices hijack the truth and mislead well-intentioned clinicians. journal of orthopaedic & sports physical therapy, 2020. 50(10): p. 577-584. DOI:

Yu, B. and K. Kumbier, Veridical data science. Proceedings of the National Academy of Sciences (PNAS), Physical Sciences, 2019. 117 (8): p. 3920-3929. DOI:

Impellizzeri, F.M., et al., What role do chronic workloads play in the acute to chronic workload ratio? Time to dismiss ACWR and its underlying theory. Sports Medicine, 2021. 51: p. 581-592. DOI:

Lolli, L., et al., Mathematical coupling causes spurious correlation within the conventional acute-to-chronic workload ratio calculations. 2019, BMJ Publishing Group Ltd and British Association of Sport and Exercise Medicine. p. 921-922. DOI:

Williamson, D.S., et al., Repeated measures analysis of binary outcomes: applications to injury research. Accident Analysis & Prevention, 1996. 28(5): p. 571-579. DOI:

Snoke, J., et al., General and specific utility measures for synthetic data. Journal of the Royal Statistical Society: Series A (Statistics in Society), 2018. 181(3): p. 663-688. DOI:

El Emam, K., L. Mosquera, and R. Hoptroff, Practical synthetic data generation: balancing privacy and the broad availability of data. 2020: O'Reilly Media.

Raab, G.M., B. Nowok, and C. Dibben, Assessing, visualizing and improving the utility of synthetic data. arXiv preprint arXiv:2109.12717, 2021. DOI:

Raab, G.M., B. Nowok, and C. Dibben. synthpop: R package for generating synthetic versions of sensitive microdata for statistical disclosure control. 24/05/2023]; Available from:

Conversano, C. and R. Siciliano, Incremental tree-based missing data imputation with lexicographic ordering. Journal of classification, 2009. 26: p. 361-379. DOI:

Reiter, J.P., Using CART to generate partially synthetic public use microdata. Journal of official statistics, 2005. 21(3): p. 441.

Read, J., et al. Classifier chains for multi-label classification. in Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2009, Bled, Slovenia, September 7-11, 2009, Proceedings, Part II 20. 2009. Springer. DOI:

Spyromitros-Xioufis, E., et al., Multi-target regression via input space expansion: treating targets as inputs. Machine Learning, 2016. 104: p. 55-98. DOI:

Goodfellow, I.J., et al., Generative adversarial nets (Advances in neural information processing systems)(pp. 2672–2680). Red Hook, NY Curran, 2014.

Yan, C., et al. Generating electronic health records with multiple data types and constraints. in AMIA annual symposium proceedings. 2020. American Medical Informatics Association. (

Reiter, J.P., Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study. Journal of the Royal Statistical Society: Series A (Statistics in Society), 2005. 168(1): p. 185-205. DOI:

Reiter, J.P. and J. Drechsler, Releasing multiply-imputed synthetic data generated in two stages to protect confidentiality. Statistica Sinica, 2010: p. 405-421. DOI:

Little, R.J., F. Liu, and T.E. Raghunathan, Statistical disclosure techniques based on multiple imputation. Applied Bayesian Modeling and Causal Inference from Incomplete‐Data Perspectives: An Essential Journey with Donald Rubin's Statistical Family, 2004: p. 141-152. DOI:

Raab, G.M., B. Nowok, and C. Dibben, Guidelines for producing useful synthetic data. arXiv preprint arXiv:1712.04078, 2017. DOI:

Drechsler, J. and J.P. Reiter, An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Computational Statistics & Data Analysis, 2011. 55(12): p. 3232-3243. DOI:

Emam, K.E., L. Mosquera, and C. Zheng, Optimizing the synthesis of clinical trial data using sequential trees. Journal of the American Medical Informatics Association, 2021. 28(1): p. 3-13. DOI:

Giles, O., et al., Faking feature importance: A cautionary tale on the use of differentially-private synthetic data. arXiv preprint arXiv:2203.01363, 2022. DOI:

Atkinson, G. and A.M. Batterham, The use of ratios and percentage changes in sports medicine: time for a rethink?·. International journal of sports medicine, 2012. 33(07): p. 505-506. DOI:

Curran-Everett, D., Explorations in statistics: the analysis of ratios and normalized data. Advances in physiology education, 2013. 37(3): p. 213-219. DOI:

Lolli, L., et al., The acute-to-chronic workload ratio: an inaccurate scaling index for an unnecessary normalisation process? 2019, BMJ Publishing Group Ltd and British Association of Sport and Exercise Medicine. p. 1510-1512. DOI:


Additional Files