Brad Groff writes:
Thought you might find this post by Ferenc Huszar interesting. Commentary on how we create knowledge in machine learning research and how we resolve benchmark results with (belated) theory. Key passage:
You can think of “making a a deep learning method work on a dataset” as a statistical test. I would argue that the statistical power of experiments is very weak. We do a lot of things like early stopping, manual tweaking of hyperparameters, running multiple experiments and only reporting the best results. We probably all know we should not be doing these things when testing hypotheses. Yet, these practices are considered fine when reporting empirical results in ML papers. Many go on and consider these reported empirical results as “strong empirical evidence” in favour of a method.
Huszar’s post is interesting, and also interesting is this linked post by Yann LeCun. My only comment is that we have many different ways of deciding what methods to use, including:
– Mathematical theory,
– Computer simulations,
– Solutions to toy problems,
– Improved performance on benchmark problems,
– Cross-validation and external validation of predictions,
– Success as recognized in a field of application,
– Success in the marketplace.
None of these is enough on its own. Theory and simulations are only as good as their assumptions; results from toy problems and benchmarks don’t necessarily generalize to applications of interest; cross-validation and external validation can work for some sorts of predictions but not others; and subject-matter experts and paying customers can be fooled.
The very imperfections of each of these sorts of evidence gives a clue as to why it makes sense to care about all of them. We can’t know for sure so it makes sense to have many ways of knowing.
For further discussion of this point, see section 26.2 of this article.
The post Ways of knowing in computer science and statistics appeared first on Statistical Modeling, Causal Inference, and Social Science.