Preprint / Version 1

Synthetic data for sharing and exploration in high performance sport

Considerations for application

##article.authors##

  • John Warmenhoven University of Technology Sydney
  • Franco Impellizzeri
  • Ian Shrier
  • Andrew Vigotsky
  • Lorenzo Lolli
  • Paolo Menaspà
  • Aaron Coutts
  • Maurizio Fanchini
  • Giles Hooker

DOI:

https://doi.org/10.51224/SRXIV.394

Keywords:

open science, privacy, anonymity, data analysis, transparency

Abstract

Synthetic data represent alternative data sources generated using mathematical procedures to address specific issues in research and practice. Synthetic data has emerging applications in clinical and medical data contexts and may assist in overcoming privacy issues to help support open science practice. The present study discusses the applicability of an established synthetic data generation process using sequential tree-based algorithms (Synthpop package in R) in the context of athlete monitoring data in sport. We provide an educational primer and discussion for potential application of these methods when exploring issues in the field sports and exercise sciences via the application of Synthpop in seven simulation examples applied to a professional football dataset. Although sequential tree-based algorithms can create synthetic data using our reference dataset, we provide considerations for and highlight limitations when constructing synthetic data. To summarize, three types of models can be conceptualised for generating synthetic data: 1) models used for analysis of the original data (answering specific research questions); 2) models used to generate synthetic data, and; 3) models that represent the true generation process for the original data. Misalignments in the specifications of these models might introduce biases that can compromise the utility of synthetic data no matter the purpose. As synthetic data do not constitute a direct replacement of real data from conceptual and empirical standpoints, we believe that researchers embracing this practice must include sufficient documentation concerning the synthetic data generation process purpose, the predictors and model used, and the potential boundary conditions for using the synthetic data in future investigations in sports and other fields.

Metrics

Metrics Loading ...

References

Bullock, G.S., et al., Call for open science in sports medicine. 2022, BMJ Publishing Group Ltd and British Association of Sport and Exercise Medicine. p. 1143-1144. DOI: https://doi.org/10.1136/bjsports-2022-105719

Wilkinson, M.D., et al., The FAIR Guiding Principles for scientific data management and stewardship. Scientific data, 2016. 3(1): p. 1-9. DOI: https://doi.org/10.1038/sdata.2016.18

Rodenberg, R.M., J.T. Holden, and A.D. Brown, Real-time sports data and the first amendment. Wash. JL Tech. & Arts, 2015. 11: p. 63.

Dattani, N., et al., Accessing electronic administrative health data for research takes time. Archives of disease in childhood, 2013. 98(5): p. 391-392. DOI: https://doi.org/10.1136/archdischild-2013-303730

Abay, N.C., et al. Privacy preserving synthetic data release using deep learning. in Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2018, Dublin, Ireland, September 10–14, 2018, Proceedings, Part I 18. 2019. Springer.

Goncalves, A., et al., Generation and evaluation of synthetic patient data. BMC medical research methodology, 2020. 20(1): p. 1-40. DOI: https://doi.org/10.1186/s12874-020-00977-1

Jordon, J., et al., Synthetic Data--what, why and how? arXiv preprint arXiv:2205.03257, 2022. DOI: https://doi.org/10.48550/arXiv.2205.03257

Azizi, Z., et al., Can synthetic data be a proxy for real clinical trial data? A validation study. BMJ open, 2021. 11(4): p. e043497. DOI: https://doi.org/10.1136/bmjopen-2020-043497

Kokosi, T. and K. Harron, Synthetic data in medical research. BMJ Medicine, 2022. 1(1), e000167. DOI: https://doi.org/10.1136/bmjmed-2022-000167

Jiang, N., et al., A method to create a synthetic population with social networks for geographically-explicit agent-based models. Computational Urban Science, 2022. 2(1): p. 7. DOI: https://doi.org/10.1007/s43762-022-00034-1

Reeves, D.M., D.A. Benson, and M.M. Meerschaert, Transport of conservative solutes in simulated fracture networks: 1. Synthetic data generation. Water resources research, 2008. 44(5). DOI: https://doi.org/10.1007/s43762-022-00034-1

Rubin, D.B., Statistical disclosure limitation. Journal of official Statistics, 1993. 9(2): p. 461-468.

Nowok, B., G.M. Raab, and C. Dibben, synthpop: Bespoke creation of synthetic data in R. Journal of statistical software, 2016. 74: p. 1-26. DOI: https://doi.org/10.18637/jss.v074.i11

Braddon, A.E., et al., Exploring the utility of synthetic data to extract more value from sensitive health data assets: A focused example in perinatal epidemiology. Paediatric and Perinatal Epidemiology, 2022. (First Published: 08 December 2022) DOI: https://doi.org/10.1111/ppe.12942

Quintana, D.S., A synthetic dataset primer for the biobehavioural sciences to promote reproducibility and hypothesis generation. Elife, 2020. 9, e53275. DOI: https://doi.org/10.7554/eLife.53275.

Naughton, M., et al., Synthetic Data as a Strategy to Resolve Data Privacy and Confidentiality Concerns in the Sport Sciences: Practical Examples and an R Shiny Application. International Journal of Sports Physiology and Performance, 2023. 18(10), p. 1213-1218. DOI: https://doi.org/10.1123/ijspp.2023-0007

Vaden Jr, K.I., et al., Fully synthetic neuroimaging data for replication and exploration. Neuroimage, 2020. 223: p. 117284. DOI: https://doi.org/10.1016/j.neuroimage.2020.117284

Kokosi, T., et al., An overview on synthetic administrative data for research. International Journal of Population Data Science, 2022. 7(1), p. 1727. DOI: https://doi.org/10.23889/ijpds.v7i1.1727

Fanchini, M., et al., Despite association, the acute: chronic work load ratio does not predict non-contact injury in elite footballers. Science and Medicine in Football, 2018. 2(2): p. 108-114. DOI: https://doi.org/10.1080/24733938.2018.1429014

Schwellnus, M., et al., How much is too much?(Part 2) International Olympic Committee consensus statement on load in sport and risk of illness. British journal of sports medicine, 2016. 50(17): p. 1043-1052. DOI: https://doi.org/10.1136/bjsports-2016-096572

Soligard, T., et al., How much is too much?(Part 1) International Olympic Committee consensus statement on load in sport and risk of injury. British journal of sports medicine, 2016. 50(17): p. 1030-1041. DOI: https://doi.org/10.1136/bjsports-2016-096581

Impellizzeri, F.M., et al., Training load and its role in injury prevention, part 2: conceptual and methodologic pitfalls. Journal of athletic training, 2020. 55(9): p. 893-901. DOI: https://doi.org/10.4085/1062-6050-501-19

Impellizzeri, F.M., et al., Training load and its role in injury prevention, part I: back to the future. Journal of athletic training, 2020. 55(9): p. 885-892. DOI: https://doi.org/10.4085/1062-6050-500-19

Impellizzeri, F.M., et al., Acute: chronic workload ratio: conceptual issues and fundamental pitfalls. International journal of sports physiology and performance, 2020. 15(6): p. 907-913. DOI: https://doi.org/10.1123/ijspp.2019-0864

Impellizzeri, F.M., et al., Training load and injury part 2: questionable research practices hijack the truth and mislead well-intentioned clinicians. journal of orthopaedic & sports physical therapy, 2020. 50(10): p. 577-584. DOI: https://www.jospt.org/doi/10.2519/jospt.2020.9211

Yu, B. and K. Kumbier, Veridical data science. Proceedings of the National Academy of Sciences (PNAS), Physical Sciences, 2019. 117 (8): p. 3920-3929. DOI: https://doi.org/10.1145/3336191.3372191

Impellizzeri, F.M., et al., What role do chronic workloads play in the acute to chronic workload ratio? Time to dismiss ACWR and its underlying theory. Sports Medicine, 2021. 51: p. 581-592. DOI: https://doi.org/10.1007/s40279-020-01378-6

Lolli, L., et al., Mathematical coupling causes spurious correlation within the conventional acute-to-chronic workload ratio calculations. 2019, BMJ Publishing Group Ltd and British Association of Sport and Exercise Medicine. p. 921-922. DOI: https://doi.org/10.1136/bjsports-2017-098110

Williamson, D.S., et al., Repeated measures analysis of binary outcomes: applications to injury research. Accident Analysis & Prevention, 1996. 28(5): p. 571-579. DOI: https://doi.org/10.1016/0001-4575(96)00023-1

Snoke, J., et al., General and specific utility measures for synthetic data. Journal of the Royal Statistical Society: Series A (Statistics in Society), 2018. 181(3): p. 663-688. DOI: https://doi.org/10.1111/rssa.12358

El Emam, K., L. Mosquera, and R. Hoptroff, Practical synthetic data generation: balancing privacy and the broad availability of data. 2020: O'Reilly Media.

Raab, G.M., B. Nowok, and C. Dibben, Assessing, visualizing and improving the utility of synthetic data. arXiv preprint arXiv:2109.12717, 2021. DOI: https://doi.org/10.48550/arXiv.2109.12717

Raab, G.M., B. Nowok, and C. Dibben. synthpop: R package for generating synthetic versions of sensitive microdata for statistical disclosure control. 24/05/2023]; Available from: https://www.synthpop.org.uk/get-started.html.

Conversano, C. and R. Siciliano, Incremental tree-based missing data imputation with lexicographic ordering. Journal of classification, 2009. 26: p. 361-379. DOI: https://doi.org/10.1007/s00357-009-9038-8

Reiter, J.P., Using CART to generate partially synthetic public use microdata. Journal of official statistics, 2005. 21(3): p. 441.

Read, J., et al. Classifier chains for multi-label classification. in Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2009, Bled, Slovenia, September 7-11, 2009, Proceedings, Part II 20. 2009. Springer. DOI: https://doi.org/10.1007/978-3-642-04174-7_17

Spyromitros-Xioufis, E., et al., Multi-target regression via input space expansion: treating targets as inputs. Machine Learning, 2016. 104: p. 55-98. DOI: https://doi.org/10.1007/s10994-016-5546-z

Goodfellow, I.J., et al., Generative adversarial nets (Advances in neural information processing systems)(pp. 2672–2680). Red Hook, NY Curran, 2014.

Yan, C., et al. Generating electronic health records with multiple data types and constraints. in AMIA annual symposium proceedings. 2020. American Medical Informatics Association. (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8075510/)

Reiter, J.P., Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study. Journal of the Royal Statistical Society: Series A (Statistics in Society), 2005. 168(1): p. 185-205. DOI: https://doi.org/10.1111/j.1467-985X.2004.00343.x

Reiter, J.P. and J. Drechsler, Releasing multiply-imputed synthetic data generated in two stages to protect confidentiality. Statistica Sinica, 2010: p. 405-421. DOI: https://www.jstor.org/stable/24308998

Little, R.J., F. Liu, and T.E. Raghunathan, Statistical disclosure techniques based on multiple imputation. Applied Bayesian Modeling and Causal Inference from Incomplete‐Data Perspectives: An Essential Journey with Donald Rubin's Statistical Family, 2004: p. 141-152. DOI: https://doi.org/10.1002/0470090456.ch13

Raab, G.M., B. Nowok, and C. Dibben, Guidelines for producing useful synthetic data. arXiv preprint arXiv:1712.04078, 2017. DOI: https://doi.org/10.48550/arXiv.1712.04078

Drechsler, J. and J.P. Reiter, An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets. Computational Statistics & Data Analysis, 2011. 55(12): p. 3232-3243. DOI: https://doi.org/10.1016/j.csda.2011.06.006

Emam, K.E., L. Mosquera, and C. Zheng, Optimizing the synthesis of clinical trial data using sequential trees. Journal of the American Medical Informatics Association, 2021. 28(1): p. 3-13. DOI: https://doi.org/10.1093/jamia/ocaa249

Giles, O., et al., Faking feature importance: A cautionary tale on the use of differentially-private synthetic data. arXiv preprint arXiv:2203.01363, 2022. DOI: https://doi.org/10.48550/arXiv.2203.01363

Atkinson, G. and A.M. Batterham, The use of ratios and percentage changes in sports medicine: time for a rethink?·. International journal of sports medicine, 2012. 33(07): p. 505-506. DOI: https://doi.org/10.1055/s-0032-1316355

Curran-Everett, D., Explorations in statistics: the analysis of ratios and normalized data. Advances in physiology education, 2013. 37(3): p. 213-219. DOI: https://doi.org/10.1152/advan.00053.2013

Lolli, L., et al., The acute-to-chronic workload ratio: an inaccurate scaling index for an unnecessary normalisation process? 2019, BMJ Publishing Group Ltd and British Association of Sport and Exercise Medicine. p. 1510-1512. DOI: https://doi.org/10.1136/bjsports-2017-098884

Downloads

Additional Files

Posted

2024-04-17