R&D

Reproducibility in Computational Research: Tools and Practices That Work

Concrete tools and practices for making computational research reproducible, from version control to environment management and documentation.

The Reproducibility Problem Is Real

Studies across multiple fields have shown that a significant fraction of computational results cannot be independently reproduced, even by the original authors. A 2019 survey found that roughly 70% of researchers had tried and failed to reproduce another scientist's results, and over 50% had failed to reproduce their own.

The causes are rarely fraud. They are mundane: undocumented dependencies, hardcoded file paths, forgotten preprocessing steps, random seeds that were not recorded, and software environments that shifted between runs.

The good news is that reproducible computational research is achievable with tools and practices that are already available. It requires discipline, not heroism.

The Three Pillars of Reproducibility

1. Version Control Everything

Code. Every script, notebook, and analysis program should be in a Git repository. Commit frequently with meaningful messages. Tag releases that correspond to published results.

What to commit:

  • Analysis scripts and programs
  • Configuration files and parameters
  • Pipeline definitions (Makefiles, Snakemake rules, Nextflow scripts)
  • Environment specifications (requirements.txt, environment.yml, Dockerfiles)
  • Documentation (README files, method descriptions)

What not to commit:

  • Large data files (use Git LFS, DVC, or external repositories)
  • Credentials and API keys
  • Generated outputs (these should be regenerable from the committed code and data)

Data. For datasets too large for Git, use data versioning tools:

  • DVC (Data Version Control) tracks data files alongside code in Git, storing the actual data in remote storage (S3, GCS, SSH)
  • Git LFS handles moderately large files (up to a few GB) within the Git workflow
  • Data repositories (Zenodo, Figshare, Dryad) with DOIs for published datasets

Environment. Record your software environment explicitly (more on this below).

2. Manage Your Environment

The most common reproducibility failure: "It works on my machine." Software environments drift over time. A package update changes a default parameter. A system library version affects numerical precision. An operating system upgrade breaks a compiled dependency.

Environment specification files capture what is installed:

  • requirements.txt or pyproject.toml for Python (pin versions: numpy==1.24.3, not just numpy)
  • renv.lock for R (renv captures the exact package versions)
  • environment.yml for Conda environments

Containers go further by capturing the entire operating system environment:

  • Write a Dockerfile that builds your analysis environment from a known base image
  • Pin the base image version (FROM python:3.11.4-slim, not FROM python:latest)
  • Install dependencies from your pinned specification files
  • Test that your analysis runs inside the container

Virtual machines provide the most complete isolation but are heavy. Useful for archiving an exact environment for long-term reproducibility.

In practice, the sweet spot is containers plus pinned dependencies. This covers the vast majority of reproducibility failures without excessive overhead.

3. Automate the Workflow

If reproducing your results requires a human to remember and execute steps in the right order, reproducibility depends on that human's memory and attention. Automate instead.

Workflow managers (Make, Snakemake, Nextflow) define the steps, their dependencies, and how to execute them. Running snakemake or make all regenerates all results from raw data.

Continuous integration (GitHub Actions, GitLab CI) can automatically run your analysis pipeline when code changes, catching reproducibility breakdowns before they accumulate.

Literate programming (Jupyter Notebooks, R Markdown, Quarto) interleaves code, documentation, and results in a single document. This is excellent for analyses that need narrative explanation, but beware: notebooks can be executed out of order, creating hidden state. Use tools like nbstripout to clean notebook outputs before committing, and always verify that the notebook runs cleanly from top to bottom.

Practical Workflow

Here is a concrete workflow for reproducible computational research:

Project Setup

my-project/
    README.md                  # Project description, how to reproduce
    LICENSE                    # Data and code licensing
    Dockerfile                 # Environment definition
    Snakefile                  # Pipeline definition
    config.yaml                # Analysis parameters
    data/
        raw/                   # Immutable raw data (or DVC-tracked pointers)
        processed/             # Generated, gitignored
    src/                       # Analysis source code
    results/                   # Generated outputs, gitignored
    docs/                      # Methodology documentation

Development Loop

  1. Write or modify analysis code in src/
  2. Update Snakefile if the pipeline structure changes
  3. Update config.yaml if parameters change
  4. Run the pipeline: snakemake --use-singularity (or Docker)
  5. Inspect results
  6. Commit code, config, and pipeline changes to Git
  7. Track data changes with DVC if applicable

Before Publication

  1. Clean up code: remove dead code, add comments, ensure consistent style
  2. Verify the pipeline runs from scratch in a clean environment (fresh container, no cached intermediates)
  3. Write a comprehensive README with step-by-step reproduction instructions
  4. Archive the code and environment to a persistent repository (Zenodo for DOI assignment)
  5. Archive data to an appropriate repository
  6. Link code, data, and publication through DOIs

Common Pitfalls

Random seeds. Any analysis involving randomness (simulations, bootstrapping, train/test splits) must record and set random seeds. Document them in the configuration file.

Floating-point non-determinism. Parallel computation can produce slightly different results across runs due to floating-point arithmetic ordering. Document expected precision and use tolerance-based comparisons in tests.

Hidden dependencies. System libraries, environment variables, and filesystem state can affect results without appearing in your dependency specifications. Containers catch most of these.

Manual steps. "Then I manually adjusted the color scale in the figure" breaks reproducibility. Script everything, including figure generation.

Key takeaway: Reproducibility is not about perfection. It is about giving someone else (including future you) a reasonable chance of getting the same results. Version control your code and data. Specify your environment. Automate your workflow. Document what you did and why. These practices take effort upfront but save far more time in the long run.

Let's talk about your r&d needs

Whether you're modernizing your infrastructure, navigating compliance, or building new software - we can help.

Book a 30-min Call