9 May 2024¶
Presentation by TK Le¶
In this meeting TK Le gave a presentation about a series of COVID-19 modelling projects and how their experiences were impacted by the choice of programming languages, model structure, editors, tools, etc.
Attendance: 4 in person, 4 online.
Info
We welcome presentations about research projects and experiences that relate in any way to reproducibility and good computational research practices. Presentations can be short, informal, and free-form.
Please contact us if you have anything you might like to present!
Three projects, four models¶
This work began in 2022, and was based on code that was originally written by Eamon and Camelia. TK adapted the code to run new scenarios across three separate modelling projects:
- Modelling the impact of hybrid immunity on futute COVID-19 waves;
- Cost-effective boosting allocations in the post-Omicron era of COVID-19 management; and
- Confidential scenario modelling for the Australian Department of Health & Aged Care.
The workflow was divided into a series of distinct models:
-
An immunological model written in R and greta;
-
An agent-based model of population transmission written in C++ (post-processing written in R);
-
A clinical pathways model implemented in MATLAB; and
-
A cost-effectiveness model implemented in R.
TK's primary activities involved implementing different vaccine schedules in the transmission model, and data visualisation of outputs from the transmission model and the clinical pathways, all of which they implemented in Python.
The multi-model structure¶
Key message
There isn't necessarily a single best way to structure a large project.
Question: Was it a benefit to have separate model implementations in different languages, with clearly defined data flows from one model to the next?
Conceptually yes, but this structure also involved a lot of trade-offs — even the sheer volume of data that had to be saved by one model and read into another. It was difficult to pick up and use the code as a whole. And there were related issues, such as being able to successfully install greta.
Nick Tierney: I know that greta can be difficult to install, and I can help people with this.
TK also noted that they found some minor inconsistencies between the models, such as whether a date was used to identify the start or end of its 24-hour interval.
Tools and platforms¶
Key message
Personal preferences can play a major role in deciding which tools are best for a project.
The various models were hosted in repositories on Bitbucket, GitHub, and the University of Melbourne's GitLab instance. TK noted that the only discernible differences between these platforms was how they handled authorisation and authentication.
TK also explored several different editors, some which were language-specific:
- VS Code was the easiest for looking at, and searching through, the code;
- RStudio was the worst; and
- Matlab was somewhat hard to use, but TK acknowledged they were also the least familiar with this language. They also experienced issues running Matlab on high-performance computing platforms.
TK noted they had previous experience with Eclipse (an IDE for Java) and Visual Studio (which felt very "heavy").
Question: what were your favourite things in VS Code, and what made RStudio the worst?
It was easiest to open a project in VS Code, RStudio would always open up an entire project or previous workspace, rather than just opening a single file. RStudio also kept asking to update itself.
TK also strongly disliked the RStudio font, which was another source of friction. They tried installing an RStudio extension for VS Code, but weren't sure how well it worked.
Nick Tierney: R history is really annoying, but you can turn it off. I'm not sure why it's enabled by default, all of the RStudio recommendations involve turning off these kinds of things.
Rahmat Sagara: I'm potentially interested in using VS Code instead of RStudio.
Eamon Conway: the worst thing about VS code is that debugging is very hard to setup.
Task management¶
Key message
Task management is hard, and switching to a new system during a project is extremely difficult.
TK reported trying to use Gitlab issues to plan out what to do and how to do it, but found they weren't a good fit with their workflow. They then trialled Trello boards for different versions, but stopped using them due to a lack of time to manage the boards. In review:
- TK uses todoist for personal stuff (and likes it), and uses TickTick for work tasks — but it's ugly!
- Trello was their favourite so far, but plotting out an entire roadmap wasn't useful — there were major changes in the plan during the project, and it was hard to make time to update the roadmap.
- They are currently using TickTick and paper-based notes, but are always on the lookout for a better task manager.
- One advantage of online systems is that you can easily paste hyperlinks, emails, pictures, etc.
Rob Moss: we know that behaviour changes are hard to achieve, so it's not surprising that a large change was challenging to maintain — ideally we would make small, incremental changes, but this isn't always possible or useful.
Eamon Conway: I like the general idea of using task-tracking software, but I've settled on only using paper. It's always with me, it's always at home, and it's physically under my keyboard!
Ruarai Tobin: I use Notion with a single large Markdown file, you can paste screenshots.
Repository structure¶
Key message
There are many factors and competing concerns that influence the repository structure.
The repository structure changed a lot across these three projects.
In the beginning, the main challenge was separating out the different parts.
While this was achieved, it wasn't immediately obvious where a user was supposed to start — the file structure did not make it clear.
The README.md
file did, however, include an explanation.
By the final project, the repository was divided into a number of directories, each of which was given a numeric prefix (e.g, 0_data
and 4_post_processing
).
However, this was also a little misleading:
-
In order to run the code you start in the folder numbered 1;
-
The post processing in
4_post_processing
was interleaved between some of the other steps (it mostly contains plotting and visualisation code, but also some other stuff); -
The structure also had to differ between running the code on Monash University's MASSIVE HPC platform, and on the University of Melbourne's Spartan HPC platform.
Question: is there an automated pipeline?
TK replied that the user had to run the code in each of the numbered folders in the correct (ascending) order, and that they wanted to make improvements for automating the dependent jobs on Spartan.
Eamon Conway: if you do ever automate it, we should share it with people (e.g., this community) because people may be able to learn from it when they want to use Spartan. I know how to use slurm for job management and can help you automate it.
Data visualisation¶
Key message
Producing good figures takes a lot of time, thought, and experimentation, and also a lot of coding.
It was extremely hard to decide what to show, and how to show it, in order to highlight key messages.
It was very easy to overwhelm the viewer with complicated plots and massive legends. For example, the scenarios involved three epidemic waves, and how can you show relationships between each pair of waves? It is relatively simple to build a 3D plot that shows these relationships, but the viewer can't really interpret the results.
Other activities¶
Key message
Following better practices would have required maybe 50% more time, but there wasn't funding — who will pay for this?
Dedicating time to other activities was not feasible — no one had time, these projects had fixed deadlines and it was challenging to complete the work within these deadlines.
As explained above, data visualisation took longer than expected. And sometimes the code simply wouldn't run on high-performance computing platforms. For example, sometimes Matlab just wouldn't load, there were intermittent failures for no apparent reason and with no useful error messages.
Activities that would have been nice to do, but were not undertaken, included:
- Testing (TK did not have time);
- Code review (nobody else had time); and
- Documentation — some was added afterwards, to accompany papers;
Rob Moss: we're very unlikely to get funding to explicitly cover these activities. If possible, we need to allocate sufficient time in our budgets, as best as possible. Practising these skills on smaller scales can also help us to use them with less overhead in larger projects.
Version control and publishing¶
Key message
This can be challenging even with all of the tools and infrastructure in place.
Question: were all of the projects wrapped up into one central thing?
No, they're all separate. The first project was provided as a zip file attached to the paper. The second project is in a public git repository. The final project is ongoing and remains confidential, it is stored in a private repository on the University of Melbourne's GitLab instance.
Question: did the latest project build on the previous ones?
Yes, and this led to managing current and older versions of the code. For example, TK found a bug that caused a minor discrepancy between times reported in two different models (see The multi-model structure) but it wasn't possible to update the older code and regenerate the associated outputs.
Question: should we use git (and GitHub) only for publication, or during the work itself?
Eamon Conway: Use it from the beginning to track your work, and maybe have different privacy settings (confidential code and/or data).
Rob Moss: you can use a git repository for your own purposes during the work, and upload a snapshot to Figshare or Zenodo to obtain a DOI that you can cite in your paper.
Broader conclusions¶
Changing our behaviour and work habits is hard, and so is finding time to develop these skills. We need to practice these skills on small problems first, rather than on large projects (and definitely not when there are tight deadlines).
A question for the community
Should we organise an event to practice and develop these skills on small-scale problems?