Skip to content

18 October 2023

In this meeting we asked participants to share their experiences about good (and bad) ways to structure a project.

Info

We are currently drafting Project structure and Writing code guidelines.

See the pull request for further details. Please contribute suggestions!

We had six in-person and eight online attendees. Everyone predominantly uses one or more of the following languages:

  • Matlab;
  • Python; and
  • R.

Naming files

The tidyverse style guide includes recommendations for naming files. One interesting recommendation in this guide is:

  • If files should be run in a particular order, prefix each file name with a number. For example:

    00_download.R
    01_clean.R
    02_summarise.R
    ...
    09_plot_figures.R
    10_generate_tables.R
    

Choosing a directory structure

A common starting point is often one or more scripts in the root directory. But we can usually divide a project into several distinct steps or stages, and store the files necessary for each stage in a separate sub-directory.

Tip

Your project structure may change as the project develops. That's okay!

You might, e.g., realise that some files should be moved to a new, or different, sub-directory.

Packaging: Python and R allow you to bundle multiple code files into a "package". This makes it easier to use code that is split up into multiple files. It also makes it simpler to test and verify whether your code can be run on a different computer. To create a package, you need to provide some metadata, including a list of dependencies (packages or libraries that your code needs in order to run). When installing a Python or R package, it will automatically install the necessary dependencies too. You test this out on, e.g., a virtual machine to verify that you've correctly listed all of the necessary dependencies.

Version control: the history may be extremely useful for you, but may contain things you don't want to make publicly available. One solution would be to know from the very start what files you will want to make available and what files you do not (e.g., sensitive data files), but this is not always possible. Another, more realistic, solution is to create a new repository, copy over all of the files that you want to make available, and record these files in a single commit. The public repository will not share the history of your project repository, and that's okay — the public repository's primary purpose is to serve as a snapshot, rather than a complete and unedited history.

Locating files

A common concern how to locate files in different sub-directories (e.g., loading code, reading data files, writing output files) without relying on using absolute paths. For loading code, Python and Matlab allow the user to add directories to the search path (e.g., by modifying sys.path in Python, or calling addpath() in Matlab). But these are not ideal solutions.

  • As a general rule, prefer using relative paths instead of absolute paths.

  • Relative paths are defined relative to the current working directory. For example: sub-directory/file-name and ../other-directory.

  • Absolute paths are defined relative to the root drive or directory. For example: /Users/My Name/... and C:\Users\My Name\....

Absolute paths may not exist on other people's computers.

  • For R, the here package allows you to construct file paths relative to the top-level project directory. For example, if you have a data file in project/input-data/file-1.csv and a script file in project/analysis-code/read-input-data.R, you can locate the data file from within the script with the following code:
library(here)
data_file <- here("input-data/file-1.csv")

Tip

A general solution for any programming language is to break your code into functions, each of which accepts input and/or output file names as arguments (when required). This means that most of your code is entirely independent of your chosen project structure. You can then store/generate all of the file paths in a single file, or in each of your top-level scripts.

Peer review: get feedback on project structure

It can be helpful to get feedback from someone who isn't directly involved in the project. They may view the work from a fresh perspective, and be able to identify aspects that are confusing or unclear.

When inviting someone to review your work, you should identify specific questions or tasks that you would like the reviewer to address.

With respect to project structure, you may want to ask the reviewer to address questions such as:

  • Do the project directories suggest a clear structure or workflow?
  • Does each directory contain files that are clearly related to each other?
  • Do the names of each directory and each file seem reasonable?
  • Are there any files that you would consider renaming or moving?
  • Does the README.md file help you to navigate the project?

You could also ask the reviewer to look at a specific script or code file, and ask questions such as:

  • Should this code be divided into smaller functions?
  • Should this code be divided into multiple files?

Info

For further ideas about useful peer review activities, and how to incorporate them into your workflow, see the following paper:

Implementing code review in the scientific workflow: Insights from ecology and evolutionary biology, Ivimey-Cook et al., Journal of Evolutionary Biology 36(10):1347–1356, 2023.

Styling and formatting

We also discussed opinions about how to name functions, variables, files, etc.

For example, R allows you to use periods (.) in function and variable names, but the tidyverse style guide recommends only using lowercase letters, numbers, and underscores (_).

If you review other people's code, and have other people review your code, you might be surprised by the different styles and conventions that people use. When reviewing code, these differences can be somewhat distracting.

  • Agreeing on, and adhering to, a common style guide can avoid these issues and allow the reviewer to dedicate their attention to actually reading and reasoning about the code.

  • There are tools to automatically format your code ("formatters") and to warn about potential issues, such as unused variables ("linters"). Here are some commonly-used formatters and linters for different languages:

    Language Style guide(s) Formatter Linter
    R tidyverse styler lintr
    Python PEP 8 / The Hitchhiker's Style Guide black ruff
    Julia style guide JuliaFormatter.jl Lint.jl

AI tools for writing and reviewing code

There are AI tools that you can use to write, format, and review code, but you will need to check whether the code is correct. For example, GitHub Copilot is a (commercial) tool that accepts natural-language descriptions and generates computer code.

Tip

Feel free to use AI tools as a way to get started, but don't simply copy-and-paste the code they give you without reviewing it.