Reproducible Research for Pattern Recognition

Posted on Wed 22 July 2015 in courses

One of the key aspects of modern technological research lies on the use of personal computers (PCs) either for the simulation of known phenomena or for the evaluation of data collected from natural observations. Mashups of these data, organized in tables and figures are attached to textual descriptions leading to scientific publications. In the current practice, data sets, code and actionable software leading to those results are excluded upon recording and preservation of articles. This panorama slows down potential scientific development in at least two major aspects: (1) re-using ideas from different sources normally implies on the re-development of software leading to original results and (2) the reviewing process of candidate ideas is based on trust rather than on hard, verifiable evidence that can be thoroughly analyzed.

In this course, I introduce the concept of Reproducible Research (RR), a term that labels scientific work that provides not only a description of the effort leading to stated conclusions, but points to data, software and instructions that allows readers to reproduce author results locally, with all required details and in a very short time. The promised gains of RR are incredible, but it does not come without a cost: in order to boost reproducibility, researchers now need to (re-)organize themselves so as to always be doing RR. This course will walk students through tools and practical exercises in order to implement RR on their daily activities.

Finally, I introduce students to the BEAT Platform: a web-based system for Reproducible Research. BEAT provides an all-in-one experience in RR: tools to graphically create workflows, write algorithms, run, log and search for results in a socially interactive way. All complexity of RR and computation is hidden behind an easy-to-use graphical web interface. Experimentation designed inside the platform can be easily transmitted and reproduced in a matter of seconds.

Topics and Outline

The length of each topic will depend on student motivation and discussions. The minimum course time is ~10 hours. If required, the course can be given in two days (each with, at least, 5 hours of course time).

  • Introduction and Some Programming Background
    • The need for reproducibility
    • Database and protocols: how to do it
    • Tools for RR in the wild
  • Python and Bob
    • Building database packages (encoding protocols)
    • Using Python and Bob for basic Machine Learning
    • Putting all together
  • Going social (BEAT platform):
    • The requirement for a web-based RR tool
    • The BEAT platform
    • Adapting your workflow to the platform
    • From running experiments to publication preparation using only a web-browser

Course Requirements

Participants shall understand the basics of Pattern Recognition, Machine Learning and programming. Knowing the Python programming language is a plus. Here is a list of resources which can be interesting:

  • Theoretical:
    • Machine Learning Basics (e.g. use Bishop's book)
  • Practical:
    • Dive into Python (free tutorial)
    • Numerical and Scientific Programming in Python (numpy, scipy)
    • Bob framework for Signal Processing, Machine Learning and Biometrics

Material

  • Syllabus: essentially, a copy of the above in LaTeX
  • Slides: from the last iteration of the course