8 August 2024¶
Orientation guide¶
Key message
The aim for the Orientation Guide is to provide a short overview of import concepts, useful tools and resources, and links to relevant tutorials and examples.
In this meeting we discussed how the Orientation Guide could best address the needs of new students and staff. We began by asking participants what skills, tools, and knowledge they've found to be particularly useful, and wish they'd discovered earlier.
Attendance: 5 in person, 2 online.
Core tools and recommended packages¶
Key message
There was strong interest in having opinionated recommendations for helpful software packages and libraries, based on our own experiences.
When we start out, we typically don't know what tools are available and how to choose between them. So having guidance and recommendations from more experienced members of our community can be valuable.
This harks back to TK's presentation and their reflections on choosing the best tools for a specific project or task. For example:
Question
Which editor should a new student use for their project?
We strongly recommend choosing an editor that can automatically format and check your code.
- For R, the best choice is almost surely RStudio.
- For Python there isn't necessarily a single best choice, but good options include PyCharm and Spyder.
- The orientation guide should list editors that are popular within our community.
Eamon suggested that in addition to linking to tutorials for installing common tools such as Python and R, the orientation guide should recommend helpful packages. For example:
-
For R, the
tidyr
package and the broader collection of "tidyverse" packages. -
For Python, Conda is probably the easiest way to install Python and scientific/numeric Python libraries on Windows and OS X (it also supports Linux).
Jacob: it would be nice to have a flowchart or diagram to help identify relevant tools and packages. For example, if you want to (a) analyse tabular data; and (b) use Python; then what package would you recommend? (Our answer was Polars).
Reproducible environments¶
Key message
Virtual environments allow you to install the packages that are required for a specific project, without affecting other projects.
This is useful for a number of reasons, including:
-
Being able to manage each project independently and in isolation;
-
Being able to use different versions of packages in different projects; and
-
Making it simpler to set up and run a project on a new computer.
Python provides built-in tools for virtual environments, and the University of Melbourne's Python Digital Skills Training includes a workshop on Python Virtual Environments.
For R, the renv
packages provides similar capabilities, and the Introduction to renv article provides a good overview.
Reproducible documents and notebooks¶
Key message
Reproducible document formats such as RMarkdown (for R) and Jupyter Notebooks (for Python) provide a way to combine code, data, text, and figures in self-contained and reproducible plain-text files.
For introductions and examples, see:
-
Nick Tierney's RMarkdown for Scientists;
-
The Jupyter Notebook documentation; and
-
The Quarto publishing system is similar to RMarkdown, but allows you to write code in Python, R, and/or Julia, and supports many different output formats.
If you use VS Code to write Quarto documents, when you edit a code block it will open it in Jupyter (for Python code) and this allows you to step through and debug the code to some degree.
Existing training courses and needs¶
Key message
There will be an Introduction to Polars workshop at the SPECTRUM 2024 annual meeting (23-25 September), led by Nefel Tellioglu and Julian Carlin.
We asked participants if they had found any training materials that were particularly useful.
Mackrina said that she is using Python in her PhD project, but previously only had experience with Matlab.
-
Mackrina completed several online Python courses, but found that an in-person training session offered by the University of Melbourne's Digital Skills Training was the most useful. They regularly run a wide range of training sessions, see the list of upcoming sessions.
-
Mackrina found that the pypfilt package made it much easier to write ordinary differential equation (ODE) models and run scenario simulations. Note: Rob is the
pypfilt
author and maintainer. This package is designed for scenario modelling, and model-fitting using Sequential Monte Carlo (SMC) methods.
Other participants chimed in with recommended resources and training needs:
-
Rob: The Software Carpentry provides a good range of lessons.
-
TK: I want to learn how to use Docker and containers, and how to (install and) use greta.
-
Eamon: I'm happy to provide assistance and guidance with using Stan to fit models using Hamiltonian Monte Carlo.
-
Jiahao: Python ODE solvers are not described nearly as well as Matlab's ODE solvers, so they are harder to use.
High-performance computing (HPC)¶
Using GPGPUs for high-performance computing
Jiahao asked: Does anyone in our community have experience with using GPGPUs?
In response to Jaihao's question, Eamon replied that he has found it to be near-impossible, due to a combination of:
- Difficulties in compiling compatible versions of the required libraries; and
- Figuring out how to adapt his code to make effective use of GPGPUs.
This initiated a broader discussion about improving the computational performance of our code and making use of high-performance computing (HPC) resources.
Computational performance was an issue that Nefel encountered when constructing an agent-based model of pneumococcal disease, and she found that code optimisation is a skill that takes time to learn.
We discussed several ways about using multiple CPU cores to make code run more quickly:
-
Using libraries that automatically make use of multiple CPU cores, such as future.apply for R, concurrent.futures for Python, and the Polars data-frame library.
-
Where we want to run large numbers of simulations, the easiest approach can often be for each simulation to only use one CPU core, and to run many simulations in parallel (e.g., on virtual machines that have many CPU cores). However, as TK pointed out, if each simulation uses a large amount of RAM, it may not be possible to run many simulations in parallel with this approach.
-
For larger scale problems, there are HPC platforms such as the University of Melbourne's Spartan and Monash University's MASSIVE. Using these platforms typically requires writing some bespoke code to define and schedule jobs.
Debugging¶
Key message
There was strong interest in running a debugging workshop at the upcoming SPECTRUM 2024 annual meeting (23-25 September). As TK and Nefel have shown in their presentations, skills like debugging are extremely valuable for projects with tight deadlines, but these projects are also the worst time in which to develop and practice these skills.
Info
Attendees confirmed their willingness to evaluate and provide feedback on workshop draft materials.
Rob reflected that many people struggle to effectively debug their code, and can end up wasting a lot of time. Since we all make mistakes when writing code, this can be an extremely valuable skill. This is particularly true when working on, e.g., modelling contracts with government (see, e.g., the recent presentations from TK and Nefel).
We discussed some general guidelines, such as:
-
Structuring your code so that it's easier to debug (e.g., small functions);
-
Avoid hard-coding numerical values, file names, etc, as much as possible;
-
Making use of breakpoints rather than
print()
statements; -
Checking that input arguments to a function satisfy necessary conditions;
-
Checking that output values from a function satisfy necessary conditions;
-
Failing early (e.g., raising exceptions in Python, calling
stop()
in R) rather than returning values that may not be valid.
Eamon: by learning how to debug code, I substantially improved how I write and modularise my code. My functions became smaller, and this helped me to make fewer mistakes.
For example, David Price and I encountered a bug in some C++ code where the function was correct, but made assumptions about the input arguments. These assumptions were initially satisfied, but as other parts of the code were updated, these assumptions were no longer true.
To address this, I often write if
statements at the top of a function to check these kinds of conditions, and stop if there are failures.
You can see examples of this in real-world code from popular packages.
James: I'm happy to provide an example of debugging within an R pipe. Learn Debugging might be a useful resource for our community.
Rob: failing early is good, rather than producing output and then having to discover that it's incorrect (which may not be obvious). Related skills include learning how to read a stack trace, and defensive programming (such as checking input arguments, as Eamon mentioned).
TK: it's really hard to change existing habits. And I'm not doing any coding in my projects right now. My most recent coding experiences were in COVID-19 projects (see TK's presentation) and the very tight deadlines didn't allow me the opportunity to develop and apply new skills.
Rob: everyone already debugs and tests their code to some degree, simply by writing and evaluating code line by line (e.g., in an interactive R or Python session) and by running functions with example arguments to check that they give sensible outputs. We just need to "nudge" these behaviour to make it more systematic and reproducible.