11 April 2024¶

In this meeting we asked participants to suggest specific tools, templates, packages, etc, that we should include in our Orientation guide. We used the high-level topics proposed in our previous meeting as a starting point.

Attendance: 7 in person, 9 online.

Git is my lab book updates

We have switched to Material for MkDocs, which greatly improves the user experience.

For example, use the search bar above (press F) to interactively search the entire website.

Purpose of the guide¶

We are aiming to keep the orientation guide short and simple, to avoid overwhelming newcomers.

James Ong: If we can agree on a structure, we can then get people to contribute to specific sections.

TK Le: schedule a one-hour meeting where everyone works on writing content for 30 minutes, and then reviews each others' content for 30 minutes?

Project organisation¶

Key message

A project's structure may need to change over time, and that's fine. What matters is that the structure is explained.

A common theme raised by a number of participants was deciding how to organise your files, dividing projects into cohesive parts, and explaining the relationships between these parts (i.e., how they interact or come together).

Useful tools mentioned in this conversation included:

Using git to track your files;
Using tools such as the here package for R to locate your files without resorting to hard-coded paths or changing the working directory;
Defining your project's workflow with a pipeline tool such as the targets package for R.

Info

We are drafting topical guides about these topics. See the online previews for the following guides:

If you have any suggestions or feedback, please let us know in the pull request!

Working collaboratively¶

Key message

Plan to work collaboratively from the outset. It is highly likely that someone else will end up using your code.

Nick Tierney: you are always collaborating with your future self!

One concern raised was how best to prepare your code for handover.

Several people reflected on their experiences of coming to terms with an existing project;
A common theme was the difficulty in understanding how the project was organised;
Even if the project structure is logical, it needs to be clearly explained in order for new users to understand it.

Pan: You need to think about it from the beginning. There will be more and more people trying to use existing models. I am writing a guideline about vaccination modelling, and referring to readers as the "model user" (developers, modellers, end users). If we plan for others to use our model, we need to develop the model in a way that aims to make it easier for people to use.

Reminder

We have developed a topical guide on how to use git for collaborative research.

Reviewing code and project structure¶

Key message

Feedback from other people can be extremely useful to identify sources of confusion.

The earlier that someone can review your code and project structure, the easier it should be to act on their feedback.

Saras Windecker mentioned that the Infectious Disease Ecology and Modelling (IDEM) team organised code review sessions that the team found really informative, but also reminded everyone how hard it is to have guidelines that are consistent and broadly useful.

Question: was the purpose of these sessions to review code, or to review the project structure?

They were intended to review code, but team members found they had to review the project structure before they could begin to understand and improve the code.

Question: What materials, inputs, resources, etc, can we provide people who are dealing with messy code?

Rob Moss reflected on his experience of picking up a within-host malaria RBC spleen model and how difficult it was to identify which parts of the model were complete and which needed further work. He gradually divided the code into separate files, and regularly ran model simulations to verify that the outputs were consistent with the original code.

Info

Rob is happy to share the original model code, discuss how it was difficult to understand, and to walk through how he restructured and documented it. If you're interested, send him an email.

How to structure your data¶

Key message

If data are shared, they often lack the documentation to make them easy to reuse.

Nick Tierney asked if anyone had thoughts on how to structure their data? Consistent with our earlier discussion, he suggested that one of the most important things is to have a README that explains the project structure. He then shared one of his recent papers Common-sense approaches to sharing tabular data alongside publication.

Question: do you advocate for data to be tidied ("long"), etc?

If you can share the raw data, then you should definitely do so, in addition to provided the analysed data (i.e., after cleaning and processing).
Tidy data can be part of the cleaning and processing, but doesn't have to be.
It's more about explaining what the original data were, how was it cleaned and/or processed, and analysed.

Managing confidential data¶

Key message

There are various ways to manage confidential data, each with pros and cons.

Michael Plank asked for ideas about how to manage confidential data when working with collaborators, to ensure that everyone is using the most recent version of the data. Obviously you don't want to commit the data files in your project repository, so the data files should be listed in your .gitignore file.

One option is to store the confidential data in a separate git repository, with tightly controlled access permissions. You can keep a local copy in a separate directory. If you create a symbolic link to this directory inside your project repository, and add this symbolic link to your .gitignore file, you can still use tools such as here to access the data.
A second option, which was used by a number of teams that worked on COVID-19 projects for the Australian Department of Health, was to host the data on a secure data management platform (mediaflux). Every time new data were received, the data management groups would perform various quality checks and generate analysis-ready data files. They would then notify all of the teams, who would each download the latest data files as part of their computational pipeline.

The most suitable solution probably depends on a combination of factors, including:

The infrastructure available to the lead researcher;
The technical skills of each collaborator; and
The data storage requirements of each data custodian.

Debugging¶

Key message

Debugging is an important skill, but good coding practices are important for making your code easier to test and debug.

A number of people suggested that the orientation guide should provide some information about how to debug your code.

Nick Tierney: I could go on a long rant about debugging, and why we should be teaching how to divide code into small functions that are easier to test!

We also discussed that there are various ways to debug your code, from printing debugging messages, to using an interactive debugger, to writing test cases that explain how the code should work.

Rob Moss: I've used regression tests and continuous integration (CI) to detect when I accidentally change the outputs of my code/model. For example, my SMC package for Python includes tests cases that record simulation outputs, and after running the tests I check that the output files remain unchanged.

Guidelines for using AI¶

Key message

Practices such as code review and testing are even more important for code that is created using generative AI.

Pan: speaking for junior students, a lot of students are now using ChatGPT for their coding, either to create the initial structure or to transform code from one programming language to another language.

Question: Can we provide useful guidelines for those students?

James: this probably isn't something we will cover in the orientation guide. But perhaps we need some guidelines for generative AI use in the topical guides.

Testing your code and ensuring it is reproducible is even more important when using ChatGPT. We know it doesn't always give you correct code, so how can you decide whether what it's given you is useful? It would be great to have an example of code generated by ChatGPT that is incorrect or unnecessarily difficult to understand, and to show how we can improve that code.

A question for the community

Does anyone have any examples of code produced by ChatGPT that didn't function as intended?

Useful resources¶

The following resources were mentioned in this meeting:

Developing a modern data workflow for regularly updated data;
A computational pipeline for the paper Mapping trends in insecticide resistance phenotypes in African malaria vectors;
Common-sense approaches to sharing tabular data alongside publication;
The here package for R; and
The targets package for R.