Real-world stories¶
Rob Moss: pypfilt (Python)¶
Last week (Tues 17 Sep, 2024) I encountered an error when running model simulations with my pypfilt Python library. I used the breakpoint technique described on the previous page, and identified the cause in less than 2 minutes.
Here is the original stack trace that made me aware of the error:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
The error occurred on line 1467 (highlighted above). This piece of code looks like:
pypfilt/summary.py | |
---|---|
1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 |
|
So I added a breakpoint()
call on line 1464 and ran the simulations again:
pypfilt/summary.py | |
---|---|
1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 |
|
-
I observed that
value
was a NumPy array. -
This meant that
isinstance(value, np.ndarray)
returnsTrue
. -
So the
save_dataset()
function on line 1460 was failing. -
I called
save_dataset()
and stepped into the function. -
This allowed me to identify the cause:
-
When the
value
array was created, it originally had columns called"arrival_date"
and"count"
; -
The
"arrival_date"
column contains date values; -
Dates must be converted into strings when saving data as a HDF5 dataset;
-
Metadata was attached to the
value
array, to indicate how to convert"arrival_date"
values; -
But then the columns were then renamed to
"time"
and"value"
; -
The metadata was not updated, so the
"time"
values were not converted into strings; and -
This caused
save_dataset()
to fail.
-
Here is the faulty code (in a different file!):
pypfilt/obs.py | |
---|---|
450 451 452 453 454 455 456 457 458 459 460 461 462 |
|
And here is the fix:
pypfilt/obs.py | |
---|---|
450 451 452 453 454 455 456 457 |
|
Michael Lydeamore: ORCID publications (R)¶
I wrote the get_publications_from_orcid()
function (shown below) to retrieve the publications associated with a specific ORCID ID, using the rorcid
package.
Retrieving research publications
This is now available as the publicationsscraper
R package.
The function has some error-catching built in:
-
If the ORCID ID doesn't exist (Error 404) then the function returns this error to the user (the
tryCatch
part, lines 26-35); -
If DOIs aren't returned for a publication, they are filled with
NA
(usingpurrr::map
, lines 50-63); and -
If the
works
frame is empty, or doesn't exist, the function returns an empty frame (handled by theif
statement on line 36).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
|
But I ran into an unusual error: this function failed for some ORCID IDs with valid works.
> get_publications_from_orcid("0000-0001-9379-0010")
Error in `dplyr::select()`:
! Can't subset columns that don't exist.
✖ Column `source.assertion-origin-name.value` doesn't exist.
Hide Traceback
▆
1. ├─global get_publications_from_orcid("0000-0001-9379-0010")
2. │ ├─dplyr::mutate(...)
3. │ ├─dplyr::select(...)
4. │ └─dplyr:::select.data.frame(...)
5. │ └─tidyselect::eval_select(expr(c(...)), data = .data, error_call = error_call)
6. │ └─tidyselect:::eval_select_impl(...)
7. │ ├─tidyselect:::with_subscript_errors(...)
8. │ │ └─rlang::try_fetch(...)
9. │ │ └─base::withCallingHandlers(...)
10. │ └─tidyselect:::vars_select_eval(...)
11. │ └─tidyselect:::walk_data_tree(expr, data_mask, context_mask)
12. │ └─tidyselect:::eval_c(expr, data_mask, context_mask)
13. │ └─tidyselect:::reduce_sels(node, data_mask, context_mask, init = init)
14. │ └─tidyselect:::walk_data_tree(new, data_mask, context_mask)
15. │ └─tidyselect:::as_indices_sel_impl(...)
16. │ └─tidyselect:::as_indices_impl(...)
17. │ └─tidyselect:::chr_as_locations(x, vars, call = call, arg = arg)
18. │ └─vctrs::vec_as_location(...)
19. └─vctrs (local) `<fn>`()
20. └─vctrs:::stop_subscript_oob(...)
21. └─vctrs:::stop_subscript(...)
22. └─rlang::abort(...)
To debug this, I made liberal use of the browser()
command and analysed each part of the pubs
object.
This quickly revealed the cause of the error:
For ORCID works that are not journal publications (e.g., conference proceeedings), the source.assertion-origin-name.value
column contains NA
— in other words, these works have no authors defined.
And if there are no journal publications associated with the ORCID ID, the source.assertion-origin-name.value
column does not exist!
The solution was to only select columns if they existed, by using dplyr::matches()
:
all_pubs[[orcid_id]] <- pubs[[1]]$works |>
dplyr::select(
title = dplyr::matches("title.title.value"),
DOI = dplyr::matches("external-ids.external-id"),
authors = dplyr::matches("source.assertion-origin-name.value"),
publication_year = dplyr::matches("publication-date.year.value"),
journal_name = dplyr::matches("journal-title.value")
)
and then adding any missing columns, filling them with a default value:
append_column_if_missing <- function(.data, column, default_value = NA) {
current_columns <- colnames(.data)
if (!column %in% current_columns) {
return(
.data |>
mutate(!!column := default_value)
)
}
return (.data)
}
# ... |>
append_column_if_missing("title", default_value = NA_character_) |>
append_column_if_missing("DOI", default_value = NA_character_) |>
append_column_if_missing("authors", default_value = NA_character_) |>
append_column_if_missing("publication_year", default_value = NA_integer_) |>
append_column_if_missing("journal_name", default_value = NA_character_) |>
# ...