I’ve been thinking a lot of about workflow lately and how that differs from project to project. There are a few common states I move through with each project, however. I wanted to talk a more about how failure fits in that workflow. As I’ve mentioned before, I quite like failure. It’s a useful tool for a data scientist. And failure has an important place in a data scientist’s workflow in my view.
Here’s my basic workflow. Note the strong similarity in parts to Hadley Wickham’s data science workflow, which I think is an excellent discussion of the process. In this case, I wanted to talk more about an interactive workflow with a client, however, and how failure fits into that.
An interactive workflow
As a consultant, a lot of what I do is interactive with the client. This creates opportunities for better analysis. It also creates opportunities for failure. Let me be clear: some failures are not acceptable and are not in any way beneficial. Those are the failures that happen after ‘analysis complete’. All the failures that happen before that are an opportunity to improve, grow and cement a working relationship with a client. (Some, however, are hideously embarrassing, you want to avoid those in general.)
This workflow is specific to me: your mileage will almost certainly vary. The amount of variation from mine I have no opinion on I’m afraid. As usual, you should take what’s useful to you and jettison the rest.
My workflow starts with a client request. This request is often nebulous, vague and unformed. If that’s the case, then there’s a lot of work around getting the client’s needs and wants into a shape that will be (a) beneficial to the client and (b) achievable. That’s a whole other workflow to discuss on another day.
This is the stage I recommend documenting client requests in the form of some kind of work order so everyone’s on the same page. Some clients have better clarity around what they require than others. Having a document you can refer back to at handover saves a lot of time and difficulty when the client is working outside their domain knowledge. It also helps a lot at final stage validation with the client: it’s easy to then check off that you did what you set out to do.
Once the client request is in workable shape, it’s time to identify and select data sources. This may be client-provided, or it may be external or both. Pro tip: document everything. Every Excel worksheet, .csv, database – everything. Where did it come from, who gave it to you, how and when? I talk about how I do that in part here.
Next I validate the data: does it make sense, is it what I expected, what’s in there? Once that’s all done I need to validate my findings with the client’s expectations. Here’s where a failure is good. You want to pick up if the data is a load of crap EARLY. You can then manage client’s expectations around outcomes – and lay the groundwork for future projects.
If it’s a failure – back to data sourcing and validation. If it’s a pass, on to cleaning and transform, another sub-workflow by itself.
Analyse, Model, Visualise
This part of my workflow is very close to the iterative model proposed by Hadley Wickham that I linked to above. It’s fundamentally repetitive: try, catch problems and insights, repeat. I also like to note the difference between visualisation for finding insight and visualisation for communicating insight. These can be the same, but they’re often different.
Sometimes I find an insight in the statistics and use the visualisation to communicate it. Sometimes I find the insight in the visualisation, then validate with statistics and communicate the insight to the client with a chart. Pro tip: the more complex an idea, the easier it is to present it initially with a chart. Don’t diss the humble bar chart: it’s sometimes 90% of the value I add as a consultant, fancy multi-equation models not withstanding.
This process is full of failure. Code that breaks, models that don’t work, statistics that are not useful. Insights that seem amazing at first, but aren’t valid. Often the client likes regular updates at this point and that’s a reasonable accommodation. However! Be wary about communicating your latest excitement without internal validation. It can set up expectations for your client that aren’t in line with your final findings.
You know you’re ready to move out of this cycle when you run out of failures or time or both.
Communicate and validate
This penultimate stage often takes the longest, communication is hard. Writing some code and throwing a bunch of data in a model is relatively easy. It’s also the most important stage – it’s vital that we communicate in such a way that our client or domain experts can validate what we’re finding. Avoid at all costs the temptation to tech-speak. The client must be able to engage with what you’re saying.
If that all checks out, then great – analysis complete. If it doesn’t, we’re bounced all the way back to data validation. It’s a big failure – but that’s OK. It’s far more palatable than the failures that come after ‘analysis complete’.