foundations of data science [editorial]

The American Institute of Mathematical Sciences, one of eight NSF-funded mathematical institutes, is supporting a new journal on data sciences called Foundations of Data Science, with editors in chief Ajay Jasra, Kody Law, and Vasileios Maroulas. Since I know them reasonably well (!). I have asked the editors for an editorial and they obliged by sending me the following. [Disclaimer: I have no direct involvement in this journal or with the AIMS. While I am supportive of this endeavour, and wish the editorial team the best in terms of submitted papers and scientific impact, I nonetheless remain philosophically in favour of PCI publishing policies. Note that the journal is free access for the three next years. After that, the fee increase is definitely steep, with subscriptions in the $500-$1000 range and open access levies around $800…]

Data Science has become a term fraught with hype, adulation, and sometimes skepticism by the scientific community. For many it is simply a collection of tools from statistics, mathematics, and computer science to enable continuing improvements in the interpretation of data, which has been a primary driver of science since at least as early as Kepler’s data-driven approach to defining the laws of planetary motion. To others, the phrase has meant the dawning of an entirely new subject, with the power to change our lives as we see it. This latter perspective has lead to governments and private companies investing heavily in data science in order to obtain a competitive edge over their rivals. In the popular media, the notion of artificial intelligence (AI) and machine learning is almost inseparable from data science. This is the ability of an algorithm to process data and possibly prior information, in order to learn about quantities of interest related to the data, and possibly make decisions based upon that knowledge. Despite these contrasting opinions, what is clear is that Data Science is indispensable to the scientific community, and it is our responsibility to provide a concrete and rigorous foundation.

The recent trend driving the enormous interest in this field is the explosion of available data, which has lead to what Jim Gray of Microsoft referred to as the emerging 4th paradigm of data-intensive science. Indeed Mark Cuban predicted at SXSW 2017 that the World’s first trillionaire will be a person who has exploited AI. The first 2 paradigms of scientific discovery are experiment and theory, and computational science is the 3rd. Computational science and engineering (CSE) is driven largely by numerically simulating complex systems. Recently uncertainty is becoming a requisite consideration in complex applications which have been classically treated deterministically. This has lead to an increasing interest in recent years in uncertainty quantification (UQ). Statistics, and in particular Bayesian inference, provides a principled and well-defined approach to the integration of data and UQ, and research into computationally-intensive inference has intensified on the verge of the 4th paradigm. In some domains such as the geosciences, this is referred to as data assimilation, because the data is assimilated into an existing model approximating the underlying physical system. This stems from the more classical field of inverse problems, which entails an often deterministic and static approach to inferring parameters in a physical model given observations, while data assimilation leverages UQ as well as ideas from signal processing and control theory to extend these considerations to the online context. The term Bayesian inverse problems has recently been coined to refer to UQ via Bayesian inference for inverse problems. In either case, one infers or learns unknown parameters given available information, both in the form of thousands of years of model development, and in the form of abundant noisily observed data. This intersection of the 3rd and 4th paradigms is fertile territory in the transition to data intensive discovery, particularly where massive data sets are being generated by scientifically well-understood systems, and where an enormous amount of effort has already been invested in generating reliable computer software for simulating these systems. Areas such as data centric science and engineering, and scientific machine learning, are beginning to see impact in their application to CSE, for example to biological science and engineering in bioinformatics, and more recently to materials science and engineering in materials informatics, i.e. prediction and design of process-structure-properties relationships, as well as to civil and aerospace engineering and manufacturing, in digital twinning and industry 4.0. A notable growth area is the integration of numerical, statistical, and functional analysis, and probability; another natural growth area is the reliable computer software for implementing the resulting methods and algorithms.

Pure 4th paradigm activity includes the wealth of interesting problems arising from the classical machine learning and AI communities, driven largely by data which is generated by or has become available because of the internet, as well as problems of the data-driven approach to model discovery, where one aims to derive a model directly from the data, either without relying on existing understanding or when there is no existing understanding. This very powerful mode of inference is be- coming increasingly realistic in the data age. Complex data problems require a disciplinary interplay to develop new theories and address interdisciplinary questions. For example, in recent years, there have been many new theoretical developments combining ideas from topology and geometry with statistical and machine learning methods, for data analysis, visualization, and dimensionality reduction. Applications range from classification and clustering in fields such as action recognition, handwriting analysis and natural language processing, and biology, to the analysis of complex systems, for example related to national defense and sensor networks. Specifically, techniques including persistent homology and manifold learning have helped to compress nonlinear point cloud data from a new geometrically faithful point of view. In the realm of signal analysis, classification and clustering based on geometric and topological features of the phase-space of the signal has been able to identify features that traditional methods may fail to detect. These new developments should not be seen as overshadowing more classical approaches to computationally-intensive Bayesian inference, optimization, and control, but need to be integrated into a holistic theory and consequent methodologies.

The future of data science, including the developing fields of data centric science and engineering, and the opportunities following from implementation of methods and algorithms on future supercomputers at exascale, relies crucially on synergistic interplay between computer science, mathematics, statistics, and science and engineering disciplines, as well as other subject areas such as manufacturing and industry 4.0. There is a wealth of opportunity to discover new scalable mathematical and statistical approaches, and expand current foundational establishments of data analysis. The humble objective of our new journal, Foundations of Data Science (FoDS), is to bridge a gap in the literature by providing a venue to bring together research which transcends disciplinary boundaries, in a common pursuit of methodology, theory, and applications in data science. As we have outlined with a few examples above, it is our belief that there is great potential in particular at intersections between diㄦent existing disciplines and sub-disciplines, and these areas of intersection, which may often be overlooked as impure by other mainstream journals, will be embraced by this journal. That is, we advocate a proactive approach to contributing to the evolving definition(s) of Data Science (and AI).

We are grateful to our exceptional editorial board of data science all stars from across engineering, statistics, mathematics, and computer science. We are also grateful to AIMS director Shouchuan Hu for his support and for believing in our leadership potential, as well as the diligent staff at AIMS for everything they do to help streamline the process of running a journal. Most importantly, we are grateful to our authors. There is an enormous amount of work being done in this active and growing area. You have many very strong options for submitting your work, and more are emerging each year, so we very much appreciate your support of FoDS. Thank you for reading our welcome message, and joining us on this adventure! We conclude with a call to action:

Foundations of Data Science invites submissions focusing on advances in mathematical, statistical, and computational methods for data science. Results should significantly advance current understanding of data science, by algorithm development, analysis, and/or computational implementation which demonstrates behavior and applicability of the algorithm. Expository and review articles are welcome. Fields covered by the journal include, but are not limited to Bayesian statistics, high performance computing, inverse problems, data assimilation, machine learning, optimization, topological data analysis, spatial statistics, nonparametric statistics, uncertainty quantification, and data centric engineering. Papers which focus on applications in science and engineering are also encouraged, however the method(s) used should be applicable outside of one specific application domain.

Ajay Jasra (Editor-in- Chief)
Kody J. H. Law (Editor-in- Chief)
Vasileios Maroulas (Editor-in- Chief)