A lot of incredibly important work has been done around data science workflows, most notably by Jenny Bryan. If you’re new to thinking about workflows, start with the incredible STAT545 resources and Happy Git and Github for the useR. Jenny’s work got me thinking about my broader workflow.
As a consultant, I work with a ream of developing documents, datasets, requests, outputs and analyses. A collection of analytical ephemera I refer to as analytics objects. When looking at a big project, I’ve found it helpful to start mapping out how these objects interact, where they come from and how they work together.
Here’s a general concept map: individual projects vary alot. But it’s a start point.
Client request objects
My workflow tends to start with client requests and communications – everything from the initial “are you available, we have an idea” email to briefings, notes I’ve taken during meetings, documents I’ve been given.
At the start of the project this can be a lot of documents and it’s not always easy to know where they should sit or how they should be managed.
A sensible solution tends to develop over time, but this is a stage where it’s easy to lose or forget about certain important things if it all stays in your inbox. One thing I often do at the start of a project is a basic document curation in a simple excel sheet so I know what I’ve got, where it came from and what’s in it.
I don’t usually bother curating every email or set of meeting notes, but anything that looks like it may be important or could be forgotten about goes in the list.
The next thing that happens is people give me data, I go and find data or some third party sends data my way.
There’s a lot of data flying about – sometimes it’s different versions of the same thing. Sometimes it’s the supposed to be the same thing and it’s not.
It often comes attached with metadata (what’s in it, where did it come from, who collected it, why) and documents that support that (survey instruments, sampling instructions etc.).
If I could go back and tell my early-career self one thing it would be this: every time someone gives you data, don’t rely on their documentation- make your own.
It may be short, it may be brief, it may simply contain references to someone else’s documentation. But take the time to go through it and make sure you know what you have and what you don’t.
For a more detailed discussion of how I handle this in a low-tech environment/team, see here. Version control systems and R markdown are my strong preference these days- if you’re working with a team that has the capacity to manage these things. Rmarkdown for building data dictionaries, metadata collections and other provenance information is brilliant. But even if you’re not and need to rely on Excel files for notes, don’t skip this step.
Next comes the analysis and communications objects which you’re probably familiar with.
Analysis and communications objects
(Warning: shameless R plug here)
The great thing about R is that it maps most of my analysis and communications objects for me. Using an Rproject as the basis for analysis means that the provenance of all transformed data, analyses and visualisations is baked in. Version control with Github means I’m not messing around with 17 excel files all called some variation of final_analysis.xlsx.
Using Rmarkdown and Shiny for as much communication with the client as possible means that I’ve directly linked my reporting, client-bound visualisations and returned data to my analysis objects.
That said, R can’t manage everything (but they’re working on it). Sometimes you need functionality R can’t provide and R can’t tell you where your data came from if you don’t tell it first. R can’t tell you if you’re scoping a project sensibly.
Collaboration around and Rmarkdown document is difficult when most of your clients are not R users at all. One work around for me has been to:
- Export the Rmarkdown document as a word document
- Have non-technical collaborators make changes and updates via tracked changes
- Depending on the stage of the project input all those back into R by hand or go forwards with the word document.
It’s not a perfect system by any means, but it’s the best I’ve got right now. (If you’ve got better I’d love to hear about that.)
Objects inform other objects
In a continuing environment, your communications objects inform the client’s and so on. Not all of these are used at any given time, but sometimes as they get updated or if projects are long term, important things get lost or confused. Thinking about how all these objects work together helped my workflow tremendously.
The lightbulb moment for me was that I started thinking about all my analytics objects as strategically as Jenny Bryan proposes we think about our statistics workflow. When I do that, the project is better organised and managed from the start.