Real-world stories¶
Rob Moss: pypfilt (Python)¶
Last week (Tues 17 Sep, 2024) I encountered an error when running model simulations with my pypfilt Python library. I used the breakpoint technique described on the previous page, and identified the cause in less than 2 minutes.
Here is the original stack trace that made me aware of the error:
| 1 2 3 4 5 6 7 8 9 10 11 12 13 |  | 
The error occurred on line 1467 (highlighted above). This piece of code looks like:
| pypfilt/summary.py | |
|---|---|
| 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 |  | 
So I added a breakpoint() call on line 1464 and ran the simulations again:
| pypfilt/summary.py | |
|---|---|
| 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 |  | 
- 
I observed that valuewas a NumPy array.
- 
This meant that isinstance(value, np.ndarray)returnsTrue.
- 
So the save_dataset()function on line 1460 was failing.
- 
I called save_dataset()and stepped into the function.
- 
This allowed me to identify the cause: - 
When the valuearray was created, it originally had columns called"arrival_date"and"count";
- 
The "arrival_date"column contains date values;
- 
Dates must be converted into strings when saving data as a HDF5 dataset; 
- 
Metadata was attached to the valuearray, to indicate how to convert"arrival_date"values;
- 
But then the columns were then renamed to "time"and"value";
- 
The metadata was not updated, so the "time"values were not converted into strings; and
- 
This caused save_dataset()to fail.
 
- 
Here is the faulty code (in a different file!):
| pypfilt/obs.py | |
|---|---|
| 450 451 452 453 454 455 456 457 458 459 460 461 462 |  | 
And here is the fix:
| pypfilt/obs.py | |
|---|---|
| 450 451 452 453 454 455 456 457 |  | 
Michael Lydeamore: ORCID publications (R)¶
I wrote the get_publications_from_orcid() function (shown below) to retrieve the publications associated with a specific ORCID ID, using the rorcid package.
Retrieving research publications
This is now available as the publicationsscraper R package.
The function has some error-catching built in:
- 
If the ORCID ID doesn't exist (Error 404) then the function returns this error to the user (the tryCatchpart, lines 26-35);
- 
If DOIs aren't returned for a publication, they are filled with NA(usingpurrr::map, lines 50-63); and
- 
If the worksframe is empty, or doesn't exist, the function returns an empty frame (handled by theifstatement on line 36).
| 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |  | 
But I ran into an unusual error: this function failed for some ORCID IDs with valid works.
> get_publications_from_orcid("0000-0001-9379-0010")
Error in `dplyr::select()`:
! Can't subset columns that don't exist.
✖ Column `source.assertion-origin-name.value` doesn't exist.
Hide Traceback
     ▆
  1. ├─global get_publications_from_orcid("0000-0001-9379-0010")
  2. │ ├─dplyr::mutate(...)
  3. │ ├─dplyr::select(...)
  4. │ └─dplyr:::select.data.frame(...)
  5. │   └─tidyselect::eval_select(expr(c(...)), data = .data, error_call = error_call)
  6. │     └─tidyselect:::eval_select_impl(...)
  7. │       ├─tidyselect:::with_subscript_errors(...)
  8. │       │ └─rlang::try_fetch(...)
  9. │       │   └─base::withCallingHandlers(...)
 10. │       └─tidyselect:::vars_select_eval(...)
 11. │         └─tidyselect:::walk_data_tree(expr, data_mask, context_mask)
 12. │           └─tidyselect:::eval_c(expr, data_mask, context_mask)
 13. │             └─tidyselect:::reduce_sels(node, data_mask, context_mask, init = init)
 14. │               └─tidyselect:::walk_data_tree(new, data_mask, context_mask)
 15. │                 └─tidyselect:::as_indices_sel_impl(...)
 16. │                   └─tidyselect:::as_indices_impl(...)
 17. │                     └─tidyselect:::chr_as_locations(x, vars, call = call, arg = arg)
 18. │                       └─vctrs::vec_as_location(...)
 19. └─vctrs (local) `<fn>`()
 20.   └─vctrs:::stop_subscript_oob(...)
 21.     └─vctrs:::stop_subscript(...)
 22.       └─rlang::abort(...)
To debug this, I made liberal use of the browser() command and analysed each part of the pubs object.
This quickly revealed the cause of the error:
For ORCID works that are not journal publications (e.g., conference proceeedings), the source.assertion-origin-name.value column contains NA — in other words, these works have no authors defined.
And if there are no journal publications associated with the ORCID ID, the source.assertion-origin-name.value column does not exist!
The solution was to only select columns if they existed, by using dplyr::matches():
all_pubs[[orcid_id]] <- pubs[[1]]$works |>
        dplyr::select(
          title = dplyr::matches("title.title.value"),
          DOI = dplyr::matches("external-ids.external-id"),
          authors = dplyr::matches("source.assertion-origin-name.value"),
          publication_year = dplyr::matches("publication-date.year.value"),
          journal_name = dplyr::matches("journal-title.value")
        )
and then adding any missing columns, filling them with a default value:
append_column_if_missing <- function(.data, column, default_value = NA) {
  current_columns <- colnames(.data)
  if (!column %in% current_columns) {
    return(
      .data |>
        mutate(!!column := default_value)
    )
  }
  return (.data)
}
# ... |>
append_column_if_missing("title", default_value = NA_character_) |>
append_column_if_missing("DOI", default_value = NA_character_) |>
append_column_if_missing("authors", default_value = NA_character_) |>
append_column_if_missing("publication_year", default_value = NA_integer_) |>
append_column_if_missing("journal_name", default_value = NA_character_) |>
# ...