Writing business reports

Congratulations you’re a data scientist! You’ve got the tech skills, the maths skills, the stats skills… and now you need to write a business report(s) and nobody ever took five minutes to show you the basics.

That’s what this post is about: the basics of writing for business.

Business reporting is hugely varied – the most important thing to consider is who your audience is and write for them. This is never a one-sizes-fits-all situation and the minute you try to make it one, you will end up with a low-quality product.

Here are some thoughts to help you on your way.


Business reports can be a lot more free-flowing and variable than you might think. The term ‘report’ encompasses a huge range of documents. Here I’ll talk about a formal, structured document – but remember this isn’t always the best approach. Tailor the document to the stakeholder, always!

There are a number of elements to the formal business report and I’ll pick out a few I think need better understanding.

Executive summary

Executive summaries come first after title pages, but before your table of contents and introduction. This is the most important part of the report – the tl;dr (tl;dr: too long, didn’t read).

Most reports suffer from tl;dr. The executive summary is your chance to make an impact in spite of that. It’s not the same as an introduction – it’s almost a mini report.

The executive summary should include what you did, how you did it, what you found out doing so and if there are any limitations or caveats on the doing.

Sometimes an executive summary can feel like a bunch of lists: things you did, things you thought, things you investigated, things you found out. If you need to list off a bunch of things – use bullet points to make that list easier on your reader.

Infographics are also a really useful way to convey information to an audience that doesn’t have the time or capacity to read the details. The advantage of this is they get the surface-level overview – and know where to come back for detail when/if they need it. (For more info on infographics see here.) Infographics are also really shareable which means your work may get wider currency than it would otherwise.


This looks the same as an executive summary on the surface. It’s often confused by many as almost the same: it’s not!

The introduction provides the baseline for the report’s readers. It should include some context (why are we here?) and in some circumstances a basic run down of results and methods is also relevant (what did we find out from being here and how did we find it out?). However, in shorter business reports, these may be best left to their own sections. It’s always a matter of judgement.

It should also include a road map for the report’s readers, so they know where the critical information lies. Something like Section 1 discusses the context of internet access in rural NSW while Section 2 outlines the methods used in this study.

The introduction comes after the table of contents and the executive summary.


This is the main action of your report. Its structure depends on what you’re doing and who you’re doing it for. You may need a methods section – or it may be best relegated to an appendix if you have a non-technical audience. You may need a section on context – or the audience may understand the context already. It all depends.

The trick to structuring your report is to think about what it is your audience needs to take away from reading it: what is the purpose here? What was the purpose of the work in the first place? Start there.

Structure is essentially a drafting problem – see some tips below for help on that.


A poor cousin of the executive summary – this can usually be reasonably brief compared to the rest of the report. It should outline what has been done and its outcome.  This is not the section to introduce anything new!

Including analyses

We’re data scientists and analysts: of course there’s going to be analyses. How do you decide which ones though?

Be aware of the temptation to ‘throw everything in there’ to show everyone what a wonderfully thorough job you did. Does it add to the purpose and structure you defined above? If not, it stays out.

Brag to your twitter pals about the fantastic model you built, though, at least you know they’ll care.

The finer points of tables and figures

Most of our analyses end up in reports as tables and figures. There are a couple of really simple points that are widely unknown about using them. But like many simple things, they’re only simple if somebody has taken the time to tell you about them.

The following applies to a formal report, other formats may have different restrictions. Your company or organisation may have its own style guide and this should always take precedence over the following:

  • Tables and figures should be numbered and labelled. When referring to them specifically, they are proper nouns and should be capitalised, e.g. Table 1 shows some uninteresting statistics, while Figure 427 shows another bar chart. Use the caption option where you can for your chosen tool.
  • Ideally tables and figures should have their own lists at the beginning of the report, like a table of contents. These are generally easy to insert and generate automatically.
  • In a table, column and row headings are often treated as proper nouns and capitalised – consider the overall style of your document.

Colour is a whole series of posts in itself, but I use the following as rules of thumb:

  • If the company or organisation has a colour palette, use that for preference.
  • Be aware of colour vision deficiencies among your readers. Here’s some thoughts on that.
  • Less is usually more when it comes to colour in charts and definitely when it comes to colour in tables.


Writing is well a very difficult skill to master. I should know, I was so bad at it when I started grad school that I bought my Ph.D. supervisor a box of red pens for Christmas one year because he had used so much red ink on my work!

Luckily for me he has a great sense of humour. Even luckier for me, he spent a lot of time helping me develop this skill that has stood me in good stead my whole career.

You won’t be a great technical writer overnight – and that’s OK. But when you’re just starting out or are unsure, here’s what helped me get better.

  • Whenever I’m trying to draft something difficult, I start out with bullet points. It helps me structure an argument without getting lost in the writing part.
  • I started out writing in present tense wherever possible. I used to mix my tenses a lot and it made it hard to understand my work. Where it was sensible/possible, sticking to present tense gave me one less thing to manage. You could do the same for past tense if that works better for your context.
  • Sentences should be reasonably short. I had so many ideas when I was starting out. They’d expand into subclause on subclause then another one .. and before you knew it I had sentences 10 lines long. To this day, an essential part of my writing pattern is going back through my writing and shortening as many sentences as I can.
  • Paragraphs should also be reasonably short in general. Got a paragraph that goes for half a page? Time to break it up!
  • A first draft is your best aid to writing well. Let me also be clear: a first draft is also generally a steaming pile of… let’s call it ‘junk’. It’s a good way of helping your brain organise its ideas. If you’re stuck, write a terrible first draft and bin it. The next will be better!
  • After you’ve spent a lot of time on a document, you need to put it down in order to see the flaws. Take breaks, go work on something else and then come back to it. You’ll be mildly horrified at everything you find that’s bad. But nobody else saw those errors before you did – great!
  • Another technique to find errors or bad writing is to read the draft backwards: sentence by sentence. This is time consuming, but it helps me find the details that need fixing.

What to write in?

When writing technical material Rmarkdown rocks my world. I export to Word when working collaboratively with non-technical people and then build back into the Rmarkdown document. It’s not ideal, but it works.

There’s a whole bunch of reasons why we write – so choose the tool that’s going to work best for you, your team and go for it.

References and Citations

Definitely worth the time – both for your own knowledge and reference but also to build on the work of those you’re using. Remember to cite your packages too. I haven’t always done this in the past! But it’s important to acknowledge the open source contributors whose work we are using – it’s now part of my routine.

And we’re done

That’s a brief overview of some of the things that have helped me with writing for business. Good luck!

A consultant’s workflow

I’ve been thinking a lot of about workflow lately and how that differs from project to project. There are a few common states I move through with each project, however. I wanted to talk a more about how failure fits in that workflow. As I’ve mentioned before, I quite like failure. It’s a useful tool for a data scientist. And failure has an important place in a data scientist’s workflow in my view.

Here’s my basic workflow. Note the strong similarity in parts to Hadley Wickham’s data science workflow, which I think is an excellent discussion of the process. In this case, I wanted to talk more about an interactive workflow with a client, however, and how failure fits into that.

a flow chart describing text below

An interactive workflow

As a consultant, a lot of what I do is interactive with the client. This creates opportunities for better analysis. It also creates opportunities for failure. Let me be clear: some failures are not acceptable and are not in any way beneficial. Those are the failures that happen after ‘analysis complete’. All the failures that happen before that are an opportunity to improve, grow and cement a working relationship with a client. (Some, however, are hideously embarrassing, you want to avoid those in general.)

This workflow is specific to me: your mileage will almost certainly vary. The amount of variation from mine I have no opinion on I’m afraid. As usual, you should take what’s useful to you and jettison the rest.

My workflow starts with a client request. This request is often nebulous, vague and unformed. If that’s the case, then there’s a lot of work around getting the client’s needs and wants into a shape that will be (a) beneficial to the client and (b) achievable. That’s a whole other workflow to discuss on another day.

This is the stage I recommend documenting client requests in the form of some kind of work order so everyone’s on the same page. Some clients have better clarity around what they require than others. Having a document you can refer back to at handover saves a lot of time and difficulty when the client is working outside their domain knowledge. It also helps a lot at final stage validation with the client: it’s easy to then check off that you did what you set out to do.

Once the client request is in workable shape, it’s time to identify and select data sources. This may be client-provided, or it may be external or both. Pro tip: document everything. Every Excel worksheet, .csv, database – everything. Where did it come from, who gave it to you, how and when? I talk about how I do that in part here.

Next I validate the data: does it make sense, is it what I expected, what’s in there? Once that’s all done I need to validate my findings with the client’s expectations. Here’s where a failure is good. You want to pick up if the data is a load of crap EARLY. You can then manage client’s expectations around outcomes – and lay the groundwork for future projects.

If it’s a failure – back to data sourcing and validation. If it’s a pass, on to cleaning and transform, another sub-workflow by itself.

Analyse, Model, Visualise

This part of my workflow is very close to the iterative model proposed by Hadley Wickham that I linked to above. It’s fundamentally repetitive: try, catch problems and insights, repeat. I also like to note the difference between visualisation for finding insight and visualisation for communicating insight. These can be the same, but they’re often different.

Sometimes I find an insight in the statistics and use the visualisation to communicate it. Sometimes I find the insight in the visualisation, then validate with statistics and communicate the insight to the client with a chart. Pro tip: the more complex an idea, the easier it is to present it initially with a chart. Don’t diss the humble bar chart: it’s sometimes 90% of the value I add as a consultant, fancy multi-equation models not withstanding.

This process is full of failure. Code that breaks, models that don’t work, statistics that are not useful. Insights that seem amazing at first, but aren’t valid. Often the client likes regular updates at this point and that’s a reasonable accommodation. However! Be wary about communicating your latest excitement without internal validation. It can set up expectations for your client that aren’t in line with your final findings.

You know you’re ready to move out of this cycle when you run out of failures or time or both.

Communicate and validate

This penultimate stage often takes the longest, communication is hard. Writing some code and throwing a bunch of data in a model is relatively easy. It’s also the most important stage – it’s vital that we communicate in such a way that our client or domain experts can validate what we’re finding. Avoid at all costs the temptation to tech-speak. The client must be able to engage with what you’re saying.

If that all checks out, then great – analysis complete. If it doesn’t, we’re bounced all the way back to data validation. It’s a big failure – but that’s OK. It’s far more palatable than the failures that come after ‘analysis complete’.

Interpreting Models: Coefficients, Marginal Effects or Elasticities?

I’ve spoken about interpreting models before. I think that this is the most important part of our work, communicating results. However, it’s one that’s often overlooked when discussing the how-to of data science. That’s why marginal effects and elasticities are better for this purpose than coefficients alone.

Model build, selection and testing is complex and nuanced. Communicating the model is sometimes harder, because a lot of the time your audience has no technical background whatsoever. Your stakeholders can’t go up the chain with, “We’ve got a model. And it must be a good model because we don’t understand any of it.”

Our stakeholders also have a limited attention span so the explanation process is two fold: explain the model and do it fast.

For these reasons, I usually interpret models for my stakeholders with marginal effects and elasticities, not coefficients or log-odds. Coefficient interpretation is very different for regressions depending on functional form and if you have interactions or polynomials built into your model, then the coefficient is only part of the story. If you have a more complex model like a tobit, conditional logit or other option, then interpretation of coefficients is different for each one.

I don’t know about your stakeholders and reporting chains: mine can’t handle that level of complexity.

Marginal effects and elasticities are also different for each of these models but they are by and large interpreted in the same way. I can explain the concept of a marginal effect once and move on. I don’t even call it a “marginal effect”: I say “if we increase this input by a single unit, I expect [insert thing here]” and move on.

Marginal effects and elasticities are often variable over the range of your sample: they may be different at the mean than at the minimum or maximum, for example. If you have interactions and polynomials, they will also depend on covarying inputs. Some people see this as added layers of complexity.

In the age of data visualisation, I see it as an opportunity to chart these relationships and visualise how your model works for your stakeholders.

We all know they like charts!

Productivity: In the Long Run, It’s Nearly Everything.

“Productivity … isn’t everything, but in the long run it’s nearly everything.” Paul Krugman, The Age of Diminished Expectations (1994)

So in the very long run, what’s the Australian experience? I recently did some work with the Department of Communications and the Arts on digital techniques and developments. Specifically, we were looking at the impacts advances in fields like machine learning, artificial intelligence and blockchain may have on productivity in Australia. I worked with a great team at the department led by the Chief Economist Paul Paterson and we’re looking forward to our report being published.

In the meantime, here’s the very long run on productivity downunder.

Australian Productivity Chart

Yield to Maturity: A Basic Interactive

The yield to maturity concept describes the approximate rate of return a bond generates if it’s held until redemption date. It’s dependent on a few things including the coupon rate (nominal interest rate), face value of the bond, price of the bond and the time until maturity.

It can get a little confusing with the mathematics behind it, so I’ve created a simple Shiny App that allows you to manipulate the inputs to observe what happens. Bear in mind this is not a financial calculator, it’s an interactive for educational purposes. It’s also the approximate not exact yield to maturity of a bond which is fine for our purposes.

I’ve mapped the yield up to 30 year redemption and assumed a face value of $100. Coupon rate varies between 0% and 25%. Current price of the bond can vary between $50 and $150. Mostly, the yield curve is very flat in this simplified approximation- but observe what happens when there is only a short time to maturity (0-5 years) and rates or price are extreme. You can find the interactive directly here.



Remember, this is just an approximation. For a more accurate calculation, see here.

Does it matter in practice? Normal vs t distribution

One of the perennial discussions is normal vs t distributions: which do you use, when, why and so on. This is one of those cases where for most sample sizes in a business analytics/data science context it probably makes very little practical difference. Since that’s such a rare thing for me to say, I thought it was worth explaining.

Now I’m all for statistical rigour: you should use the right one at the right time for the right purpose, in my view. However, this can be one of those cases where if the sample size is large enough, it’s just not that big a deal.

The actual simulations I ran are very simple, just 10 000 draws from normal and t-distributions with the t varying at different degrees of freedom. Then I just plotted the density for each on the same graph using ggplot in R. If you’d like to have a play around with the code, leave a comment to let me know and I’ll post it to github.

Using Natural Language Processing for Survey Analysis

Surveys have a specific set of analysis tools that are used for analysing the quantitative part of the data you collect (stata is my particular poison of choice in this context). However, often the interesting parts of the survey are the unscripted, “tell us what you really think” comments.

Certainly this has been true in my own experience. I once worked on a survey deployed to teachers in Laos regarding resources for schools and teachers. All our quantitative information came back and was analysed, but one comment (translated for me into English by a brilliant colleague) stood out. It read something to the effect of “this is very nice, but the hole in the floor of the second story is my biggest concern as a teacher”. It’s not something that would ever have been included outright in the survey, but a simple sentence told us a lot about the resources this school had access to.

Careful attention to detailed comments in small surveys is possible. But if you have thousands upon thousands of responses, this is far more difficult. Enter natural language processing.

There are a number of tools which can be useful in this context. This is a short overview of some that I think are particularly useful.

  • Word Clouds. These are easy to prepare and very simple, but can be a powerful way to communicate information. Like all data visualisation, there are the good and the bad. This is an example of a very simple word cloud, while this post by Fells Stats illustrates some more sophisticated methods of using the tool.

One possibility to extend on the simple “bag of words” concept is to divide your sample by groups and compare clouds. Or create your own specific dictionary of words and concepts you’re interested in and only cloud those.

Remember that stemming the corpus is critical. For example, “work”, “worked”, “working”, “works” all belong to the same stem. They should be treated as one or else they are likely to swamp other themes if they are particularly common.

Note that no word cloud should be constructed without removing “stop words” like the, and, a, I etc. Dictionaries vary- they can (and should) be tailored to the problem at hand.

  • Network Analysis. If you have a series of topics you want to visualise relationships for, you could try a network-type analysis similar to this. The concept may be particularly useful if you manually decide topics of interest and then examine relationships between them. In this case, the outcome is very much user-dependent/chosen, but may be useful as a visualisation.
  • Word Frequencies. Alone, simple tables of word frequencies are not always particularly useful. In a corpus of documents pertaining to education, noting that “learning” is a common term isn’t something of particular note. However, how do these frequencies change by group? Do teachers speak more about “reading” than principals? Do people in one geographical area or salary bracket have a particular set of high frequency words compared to another? This is a basic exercise in feature/variable engineering. In this case, the usual data analysis tool kit applies (see here, here and here). Remember you don’t need to stop at high frequency words: what about high frequency phrases?
  •  TF-IDF (term frequency-inverse document frequency) matrix. This may provide useful information and is a basis of many more complex analyses. The TF-IDF downweights terms appearing in all documents/comments (“the”, “i”, “and” etc.) while upweighting rare words that may be of interest. See here for an introduction.
  • Are the comments clustered across some lower dimensional space? K-means algorithm may provide some data-driven guidance there. This would be an example of “unsupervised machine learning” vis a vis “this is an alogrithm everyone has been using for 25 years but we need to call it something cool”. This may not generate anything obvious at first- but who is in those clusters and why are they there?
  • Sentiment analysis will be useful, possibly both applied to the entire corpus and to subsets. For example, among those who discussed “work life balance” (and derivative terms) is the sentiment positive or negative? Is this consistent across all work/salary brackets? Are truck drivers more upbeat than bus drivers? Again, basic feature/variable engineering applies here. If you’re interested in this area, you could do a lot worse than learning from Julia Silge who writes interesting and informative tutorials in R on the subject.
  • Latent Dirichlet Algorithm (LDA) and more complex topic analyses. Finally, latent dirichlet algorithm or other more complex topic analyses may be able to directly generate topics directly from the corpus: I think this would take a great deal of time for a new user and may have limited outcomes, particularly if an early analysis would suggest you have a clear idea of which topics are worth investigating already. It is however particularly useful when dealing with enormous corpi. This is a really basic rundown of the concept. This is a little more complex, but has useful information.

So that’s a brief run down of some basic techniques you could try: there are plenty more out there- this is just the start. Enjoy!

Congratulations to the Melbourne Data Science Group!

Last week, I attended the Melbourne Data Science Initiative and it was definitely the highlight of my data science calendar this year! The event was superbly organised by Phil Brierley and his team. Events included tutorials on Machine Learning, Deep Learning, Business Analytics and talks on feature engineering, big data and the need to invest in analytic talent amongst others.

The speakers were knowledgable and interesting with everything covered from the hilarious building of a rhinoceros catapult (thanks to Eugene from Presciient, it’s possible I’ll never forget that one) to the paramount importance of the “higher  purpose” in business analytics as discussed by Evan Stubbs from SAS Australia and New Zealand.

If you’re in or around Melbourne and into Data Science at all, check out the group who put on this event out here.