Going indy: define success

Deciding to go out on your own as an independent consultant can be both freeing and utterly terrifying.

Gone is the external validation from your workplace peers and supervisors (who’s going to tell you you’re a good data scientist now?).

Gone is the 9-5.

Gone are uncomfortable shoes – live the dream!

Working out how you’ll become independent is hard when you haven’t done it before. This is the first in a series of posts. Some of them are going to tell you how I did it, others how I should have done it. I’ll cover the basics of financials, defining your value proposition, networking and whatever else I can think of that might be useful.

But if I had my time over again, there’s one thing I’d do before any of that.

The first thing I think you should do if you’re going to go independent is define success. What does success look like to you?

If you don’t know, you’ll spend a lot of time floundering – maybe doing things that don’t make you happy. Maybe looking for that external validation that isn’t coming. Maybe comparing yourself to companies and people in ways that aren’t sensible. Maybe I did a lot of that in my time and hopefully you don’t have to!

So what does success look like for you?

Success might be financial

Financial success is generally regarded as pretty amazing, but what does it mean? For everybody who goes independent as a consultant it’s going to be different.

Some people might see financial success as being bought out by a large consulting masthead in five years time for a seven figure sum.

For others, financial success looks like the ability to pay your mortgage, your bills and have enough left over for the fun stuff – with the added bonus of having time to do the fun stuff.

Neither is bad or wrong: the right part of the spectrum is the one that’s working for you.

Success might be personal

Personal success has a few dimensions – the respect of your peers, the respect of clients, building something worthwhile. For some people, the freedom to take on pro bono work, mentor others and contribute to great causes is a big drive to go independent. For others, the time to spend with your kids is a measure of success.

Don’t let financial measures of success be your only indicator: the whole point of being independent is the freedom to define yourself and your business.

Success might be intellectual

One important bonus of being independent is the ability to choose work you’d like to take on. To be frank, it can take awhile to get to the point where choices are out there. But once they are, intellectual freedom to pursue projects that are interesting and push your boundaries can be another measure of success.

Success might just be space

I’m going to lay this one out there: data scientists are a pretty distinct bunch of people and for many that comes with a lot of pressure. Success for you might be the space just to be yourself without having to spend extortionate amounts of energy keeping up with the requirements of a corporate or academic world.

That doesn’t mean that at some point in the future you can’t take on the corporate or academic ladder: but the space just to be yourself with minimal overhead is a kind of success worth talking about. Lots of us need it for at least awhile in our lives and independent consulting can be one road to that kind of freedom.

Where to from here?

Once you’ve decided what big picture success looks like for your new consultancy, it’s time to get specific. In the next few weeks, I’ll talk about defining a data science offering and value proposition, defining and dealing with competition, what you need to get started, the financials and whatever else I can think of.

Got something you’d like me to cover or a question? Comments aren’t open because I have all the generic viagra and cialis I could ever need. But feel free to use the contact page or drop me a line on twitter here!

Writing business reports

Congratulations you’re a data scientist! You’ve got the tech skills, the maths skills, the stats skills… and now you need to write a business report(s) and nobody ever took five minutes to show you the basics.

That’s what this post is about: the basics of writing for business.

Business reporting is hugely varied – the most important thing to consider is who your audience is and write for them. This is never a one-sizes-fits-all situation and the minute you try to make it one, you will end up with a low-quality product.

Here are some thoughts to help you on your way.


Business reports can be a lot more free-flowing and variable than you might think. The term ‘report’ encompasses a huge range of documents. Here I’ll talk about a formal, structured document – but remember this isn’t always the best approach. Tailor the document to the stakeholder, always!

There are a number of elements to the formal business report and I’ll pick out a few I think need better understanding.

Executive summary

Executive summaries come first after title pages, but before your table of contents and introduction. This is the most important part of the report – the tl;dr (tl;dr: too long, didn’t read).

Most reports suffer from tl;dr. The executive summary is your chance to make an impact in spite of that. It’s not the same as an introduction – it’s almost a mini report.

The executive summary should include what you did, how you did it, what you found out doing so and if there are any limitations or caveats on the doing.

Sometimes an executive summary can feel like a bunch of lists: things you did, things you thought, things you investigated, things you found out. If you need to list off a bunch of things – use bullet points to make that list easier on your reader.

Infographics are also a really useful way to convey information to an audience that doesn’t have the time or capacity to read the details. The advantage of this is they get the surface-level overview – and know where to come back for detail when/if they need it. (For more info on infographics see here.) Infographics are also really shareable which means your work may get wider currency than it would otherwise.


This looks the same as an executive summary on the surface. It’s often confused by many as almost the same: it’s not!

The introduction provides the baseline for the report’s readers. It should include some context (why are we here?) and in some circumstances a basic run down of results and methods is also relevant (what did we find out from being here and how did we find it out?). However, in shorter business reports, these may be best left to their own sections. It’s always a matter of judgement.

It should also include a road map for the report’s readers, so they know where the critical information lies. Something like Section 1 discusses the context of internet access in rural NSW while Section 2 outlines the methods used in this study.

The introduction comes after the table of contents and the executive summary.


This is the main action of your report. Its structure depends on what you’re doing and who you’re doing it for. You may need a methods section – or it may be best relegated to an appendix if you have a non-technical audience. You may need a section on context – or the audience may understand the context already. It all depends.

The trick to structuring your report is to think about what it is your audience needs to take away from reading it: what is the purpose here? What was the purpose of the work in the first place? Start there.

Structure is essentially a drafting problem – see some tips below for help on that.


A poor cousin of the executive summary – this can usually be reasonably brief compared to the rest of the report. It should outline what has been done and its outcome.  This is not the section to introduce anything new!

Including analyses

We’re data scientists and analysts: of course there’s going to be analyses. How do you decide which ones though?

Be aware of the temptation to ‘throw everything in there’ to show everyone what a wonderfully thorough job you did. Does it add to the purpose and structure you defined above? If not, it stays out.

Brag to your twitter pals about the fantastic model you built, though, at least you know they’ll care.

The finer points of tables and figures

Most of our analyses end up in reports as tables and figures. There are a couple of really simple points that are widely unknown about using them. But like many simple things, they’re only simple if somebody has taken the time to tell you about them.

The following applies to a formal report, other formats may have different restrictions. Your company or organisation may have its own style guide and this should always take precedence over the following:

  • Tables and figures should be numbered and labelled. When referring to them specifically, they are proper nouns and should be capitalised, e.g. Table 1 shows some uninteresting statistics, while Figure 427 shows another bar chart. Use the caption option where you can for your chosen tool.
  • Ideally tables and figures should have their own lists at the beginning of the report, like a table of contents. These are generally easy to insert and generate automatically.
  • In a table, column and row headings are often treated as proper nouns and capitalised – consider the overall style of your document.

Colour is a whole series of posts in itself, but I use the following as rules of thumb:

  • If the company or organisation has a colour palette, use that for preference.
  • Be aware of colour vision deficiencies among your readers. Here’s some thoughts on that.
  • Less is usually more when it comes to colour in charts and definitely when it comes to colour in tables.


Writing is well a very difficult skill to master. I should know, I was so bad at it when I started grad school that I bought my Ph.D. supervisor a box of red pens for Christmas one year because he had used so much red ink on my work!

Luckily for me he has a great sense of humour. Even luckier for me, he spent a lot of time helping me develop this skill that has stood me in good stead my whole career.

You won’t be a great technical writer overnight – and that’s OK. But when you’re just starting out or are unsure, here’s what helped me get better.

  • Whenever I’m trying to draft something difficult, I start out with bullet points. It helps me structure an argument without getting lost in the writing part.
  • I started out writing in present tense wherever possible. I used to mix my tenses a lot and it made it hard to understand my work. Where it was sensible/possible, sticking to present tense gave me one less thing to manage. You could do the same for past tense if that works better for your context.
  • Sentences should be reasonably short. I had so many ideas when I was starting out. They’d expand into subclause on subclause then another one .. and before you knew it I had sentences 10 lines long. To this day, an essential part of my writing pattern is going back through my writing and shortening as many sentences as I can.
  • Paragraphs should also be reasonably short in general. Got a paragraph that goes for half a page? Time to break it up!
  • A first draft is your best aid to writing well. Let me also be clear: a first draft is also generally a steaming pile of… let’s call it ‘junk’. It’s a good way of helping your brain organise its ideas. If you’re stuck, write a terrible first draft and bin it. The next will be better!
  • After you’ve spent a lot of time on a document, you need to put it down in order to see the flaws. Take breaks, go work on something else and then come back to it. You’ll be mildly horrified at everything you find that’s bad. But nobody else saw those errors before you did – great!
  • Another technique to find errors or bad writing is to read the draft backwards: sentence by sentence. This is time consuming, but it helps me find the details that need fixing.

What to write in?

When writing technical material Rmarkdown rocks my world. I export to Word when working collaboratively with non-technical people and then build back into the Rmarkdown document. It’s not ideal, but it works.

There’s a whole bunch of reasons why we write – so choose the tool that’s going to work best for you, your team and go for it.

References and Citations

Definitely worth the time – both for your own knowledge and reference but also to build on the work of those you’re using. Remember to cite your packages too. I haven’t always done this in the past! But it’s important to acknowledge the open source contributors whose work we are using – it’s now part of my routine.

And we’re done

That’s a brief overview of some of the things that have helped me with writing for business. Good luck!

Mapping analytics objects

A lot of incredibly important work has been done around data science workflows, most notably by Jenny Bryan. If you’re new to thinking about workflows, start with the incredible STAT545 resources and Happy Git and Github for the useR. Jenny’s work got me thinking about my broader workflow.

As a consultant, I work with a ream of developing documents, datasets, requests, outputs and analyses. A collection of analytical ephemera I refer to as analytics objects. When looking at a big project, I’ve found it helpful to start mapping out how these objects interact, where they come from and how they work together.

Here’s a general concept map: individual projects vary alot. But it’s a start point.

A concept map with analytics objects.

Client request objects

My workflow tends to start with client requests and communications – everything from the initial “are you available, we have an idea” email to briefings, notes I’ve taken during meetings, documents I’ve been given.

At the start of the project this can be a lot of documents and it’s not always easy to know where they should sit or how they should be managed.

A sensible solution tends to develop over time, but this is a stage where it’s easy to lose or forget about certain important things if it all stays in your inbox. One thing I often do at the start of a project  is a basic document curation in a simple excel sheet so I know what I’ve got, where it came from and what’s in it.

I don’t usually bother curating every email or set of meeting notes, but anything that looks like it may be important or could be forgotten about goes in the list.

a picture of a spreadsheet

Data objects

The next thing that happens is people give me data, I go and find data or some third party sends data my way.

There’s a lot of data flying about – sometimes it’s different versions of the same thing. Sometimes it’s the supposed to be the same thing and it’s not.

It often comes attached with metadata (what’s in it, where did it come from, who collected it, why) and documents that support that (survey instruments, sampling instructions etc.).

If I could go back and tell my early-career self one thing it would be this: every time someone gives you data, don’t rely on their documentation- make your own.

It may be short, it may be brief, it may simply contain references to someone else’s documentation. But take the time to go through it and make sure you know what you have and what you don’t.

For a more detailed discussion of how I handle this in a low-tech environment/team, see here. Version control systems and R markdown are my strong preference these days- if you’re working with a team that has the capacity to manage these things. Rmarkdown for building data dictionaries, metadata collections and other provenance information is brilliant. But even if you’re not and need to rely on Excel files for notes, don’t skip this step.

Next comes the analysis and communications objects which you’re probably familiar with.

Analysis and communications objects

(Warning: shameless R plug here)

The great thing about R is that it maps most of my analysis and communications objects for me. Using an Rproject as the basis for analysis means that the provenance of all transformed data, analyses and visualisations is baked in. Version control with Github means I’m not messing around with 17 excel files all called some variation of final_analysis.xlsx.

Using Rmarkdown and Shiny for as much communication with the client as possible means that I’ve directly linked my reporting, client-bound visualisations and returned data to my analysis objects.

That said, R can’t manage everything (but they’re working on it). Sometimes you need functionality R can’t provide and R can’t tell you where your data came from if you don’t tell it first. R can’t tell you if you’re scoping a project sensibly.

Collaboration around and Rmarkdown document is difficult when most of your clients are not R users at all. One work around for me has been to:

  • Export the Rmarkdown document as a word document
  • Have non-technical collaborators make changes and updates via tracked changes
  • Depending on the stage of the project input all those back into R by hand or go forwards with the word document.

It’s not a perfect system by any means, but it’s the best I’ve got right now. (If you’ve got better I’d love to hear about that.)

Objects inform other objects

In a continuing environment, your communications objects inform the client’s and so on. Not all of these are used at any given time, but sometimes as they get updated or if projects are long term, important things get lost or confused. Thinking about how all these objects work together helped my workflow tremendously.

The lightbulb moment for me was that I started thinking about all my analytics objects as strategically as Jenny Bryan proposes we think about our statistics workflow. When I do that, the project is better organised and managed from the start.

Where do things live in R? R for Excel Users

One of the more difficult things about learning a new tool is the investment you make while you’re learning things you already know in your current tool. That can feel like time wasted – it’s not, but it’s a very frustrating experience. One of the ways to speed up this part is to ‘translate’ concepts you know in your current tool into concepts for your new one.

In that spirit, here’s a brief introduction to where things live in R compared to Excel. Excel is a very visual medium – you can see and manipulate your objects all the time. You can do the same in R, it’s just that they are arranged in slightly different ways.

Where does it live infographic


Data is the most important part. In Excel, it lives in the spreadsheet. In R it lives in a data structure – commonly a data frame. In Excel you can always see your data.

Excel spreadsheet

In R you can too – go to the environment window and click on the spreadsheet-looking icon, it will give you your data in the viewer window if it’s an object that can be reproduced like that (if you don’t have this option, your object may be a list not a data frame). You can’t manipulate the data like this, however – you need code for that. You can also use commands like head(myData) to see the first few lines, tail(myData) to see the last few and print(myData) to see the whole object.

R environment view

view of data in R


Excel uses code to make calculations and create statistics – but it often ‘lives’ behind the object it produces. Sometimes it can make your calculation look like the original data and create confusion for your stakeholders (and for you!).

Excel formula

In R code is used in a similar way to Excel, but it lives in a script, a .R file. This makes it easier to reuse, understand and more powerful to manipulate. Using code in a script saves a lot of time and effort.

R script

Results and calculations

In Excel, results and calculations live in a worksheet in a workbook. It can be easy to confuse with the original data, it’s hard to check if things are correct and re-running analyses (you often re-run them!) is time consuming.

In R, if you give your result or analysis a name, it will be in the Environment, waiting for you – you can print it, copy it, change it, chart it, write it out to Excel for a coworker and recreate it any time you need with your script.

A result in R

That’s just a simple run down – there’s a lot more to R! But it helps a lot to know where everything ‘lives’ as you’re getting started. Good luck!

Closures in R

Put briefly, closures are functions that make other functions. Are you repeating a lot of code, but there’s no simple way to use the apply family or Purrr to streamline the process? Maybe you could write your own closure. Closures enclose access to the environment in which they were created – so you can nest functions within other functions.

What does that mean exactly? Put simply, it can see all the variables enclosed in the function that created it. That’s a useful property!

What’s the use case for a closure?

Are you repeating code, but instead of variables or data that’s changing – it’s the function instead? That’s a good case for a closure. Especially as your code gets more complex, closures are a good way of modularising and containing concepts.

What are the steps for creating a closure?

Building a closure happens in several steps:

  1. Create the output function you’re aiming for. This function’s input is the same as what you’ll give it when you call it. It will return the final output. Let’s call this the enclosed function.
  2. Enclose that function within a constructor function. This constructor’s input will be the parameters by which you’re varying the enclosed function in Step 1. It will output the enclosed function. Let’s call this the enclosing function.
  3. Realise you’ve got it all wrong, go back to step 1. Repeat this multiple times. (Ask me how I know..)
  4. Next you need to create the enclosed function(s) (the ones from Step 1) by calling the enclosing function (the one from Step 2).
  5. Lastly, call and use your enclosed functions.

An example

Say I want to calculate the mean, SD and median of a data set. I could write:
x <- c(1, 2, 3)

That would definitely be the most efficient way of going about it. But imagine that your real use case is hundreds of statistics or calculations on many, many variables. This will get old, fast.

I’m calling those three functions each in the same way, but the functions are changing rather than the data I’m using. I could write a closure instead:

stat <- function(stat_name){



This is made up of two parts: function(x){} which is the enclosed function and stat() which is the enclosing function.

Then I can call my closure to build my enclosed functions:

mean_of_x <- stat(mean)
sd_of_x <- stat(sd)
median_of_x <- stat(median)

Lastly I can call the created functions (probably many times in practice):


I can repeat this for all the statistics/outcomes I care about. This example is too trivial to be realistic – it takes about double the lines of code and is grossly ineffficient! But it’s a simple example of how closures work.

More on closures

If you’re producing more complex structures, closures are very useful. See Jason’s post from Left Censored for a realistic bootstrap example – closures can streamline complex pieces of code reducing mistakes and improving the process you’re trying to build. It takes modularity to the next step.

For more information in R see Hadley Wickham’s section on closures in Advanced R.


A Primer on Basic Probability

… and by basic, I mean basic. I sometimes find people come to me with questions and no one has ever taken the time to give them the most basic underpinnings in probability that would make their lives a lot easier. A friend of mine is having this problem and is on a limited time frame for solving it, so this is quick and dirty and contains both wild ad-lib on my part and swearing. When I get some more time, I’ll try and expand and improve, but for now it’s better than nothing.

Youtube explainer: done without microphone, sorry- time limit again.

Slides I used:


I mentioned two links in the screencast. One was Allen Downey’s walkthrough with python, you don’t need to know anything about Python to explore this one: well worth it. The other is Victor Powell’s visualisation of conditional probability. Again, worth a few minutes exploration.

Good luck! Hit me up in the comments section if you’ve got any questions, this was a super quick run through so it’s a summary at best.

Data Visualisation: Hex Codes, Pantone Colours and Accessibility

One of the things I find hardest about data visualisation is colouring. I’m not a natural artist, much preferring everything in gentle shades of monochrome. Possibly beige. Obviously for any kind of data visualisation, this limited .Quite frankly this is the kind of comfort zone that needs setting on fire.

I’ve found this site really helpful: it’s a listing of the Pantone colours with both Hex and RGB codes for inserting straight into your visualisations. It’s a really useful correspondence if I’m working with someone (they can give me the Pantone colour numbers of their website or report palette- I just search the page).

One thing I’ve found, however, is that a surprising (to me) number of people have some kind of colour-based visual impairment. A palette that looks great to me may be largely meaningless to someone I’m working with. I found this out in one of those forehead slapping moments when I couldn’t understand why a team member wasn’t seeing the implications of my charts. That’s because, to him, those charts were worse than useless. They were a complete waste of his time.

Some resources I’ve found helpful in making my visualisations more accessible are the colourblind-friendly palettes discussed here and this discussion on R-Bloggers. The latter made me realise that up until now I’ve been building visualisations that were obscuring vital information for many users.

The things I think are important for building an accessible visualisation are:

  • Yes, compared to more subtle palettes, colour-blind friendly palettes look like particularly lurid unicorn vomit. They don’t have to look bad if you’re careful about combinations, but I’m of the opinion that prioritising accessibility for my users is more important than “pretty”.
  • Redundant encoding (discussed in the R-bloggers link above) is a great way ensuring users can make out the information you’re trying to get across. To make sure this is apparent in your scale, use a combination of scale_colour_manual() and scale_linetype_manual(). The latter works the same as scale_colour_manual() but is not as well covered in the literature.
  • Consider reducing the information you’re putting into each chart, or using a combination of facets and multiple panels. The less there is to differentiate, the easier it can be on your users. This is a good general point and not limited to those with colourblindness.

Using Natural Language Processing for Survey Analysis

Surveys have a specific set of analysis tools that are used for analysing the quantitative part of the data you collect (stata is my particular poison of choice in this context). However, often the interesting parts of the survey are the unscripted, “tell us what you really think” comments.

Certainly this has been true in my own experience. I once worked on a survey deployed to teachers in Laos regarding resources for schools and teachers. All our quantitative information came back and was analysed, but one comment (translated for me into English by a brilliant colleague) stood out. It read something to the effect of “this is very nice, but the hole in the floor of the second story is my biggest concern as a teacher”. It’s not something that would ever have been included outright in the survey, but a simple sentence told us a lot about the resources this school had access to.

Careful attention to detailed comments in small surveys is possible. But if you have thousands upon thousands of responses, this is far more difficult. Enter natural language processing.

There are a number of tools which can be useful in this context. This is a short overview of some that I think are particularly useful.

  • Word Clouds. These are easy to prepare and very simple, but can be a powerful way to communicate information. Like all data visualisation, there are the good and the bad. This is an example of a very simple word cloud, while this post by Fells Stats illustrates some more sophisticated methods of using the tool.

One possibility to extend on the simple “bag of words” concept is to divide your sample by groups and compare clouds. Or create your own specific dictionary of words and concepts you’re interested in and only cloud those.

Remember that stemming the corpus is critical. For example, “work”, “worked”, “working”, “works” all belong to the same stem. They should be treated as one or else they are likely to swamp other themes if they are particularly common.

Note that no word cloud should be constructed without removing “stop words” like the, and, a, I etc. Dictionaries vary- they can (and should) be tailored to the problem at hand.

  • Network Analysis. If you have a series of topics you want to visualise relationships for, you could try a network-type analysis similar to this. The concept may be particularly useful if you manually decide topics of interest and then examine relationships between them. In this case, the outcome is very much user-dependent/chosen, but may be useful as a visualisation.
  • Word Frequencies. Alone, simple tables of word frequencies are not always particularly useful. In a corpus of documents pertaining to education, noting that “learning” is a common term isn’t something of particular note. However, how do these frequencies change by group? Do teachers speak more about “reading” than principals? Do people in one geographical area or salary bracket have a particular set of high frequency words compared to another? This is a basic exercise in feature/variable engineering. In this case, the usual data analysis tool kit applies (see here, here and here). Remember you don’t need to stop at high frequency words: what about high frequency phrases?
  •  TF-IDF (term frequency-inverse document frequency) matrix. This may provide useful information and is a basis of many more complex analyses. The TF-IDF downweights terms appearing in all documents/comments (“the”, “i”, “and” etc.) while upweighting rare words that may be of interest. See here for an introduction.
  • Are the comments clustered across some lower dimensional space? K-means algorithm may provide some data-driven guidance there. This would be an example of “unsupervised machine learning” vis a vis “this is an alogrithm everyone has been using for 25 years but we need to call it something cool”. This may not generate anything obvious at first- but who is in those clusters and why are they there?
  • Sentiment analysis will be useful, possibly both applied to the entire corpus and to subsets. For example, among those who discussed “work life balance” (and derivative terms) is the sentiment positive or negative? Is this consistent across all work/salary brackets? Are truck drivers more upbeat than bus drivers? Again, basic feature/variable engineering applies here. If you’re interested in this area, you could do a lot worse than learning from Julia Silge who writes interesting and informative tutorials in R on the subject.
  • Latent Dirichlet Algorithm (LDA) and more complex topic analyses. Finally, latent dirichlet algorithm or other more complex topic analyses may be able to directly generate topics directly from the corpus: I think this would take a great deal of time for a new user and may have limited outcomes, particularly if an early analysis would suggest you have a clear idea of which topics are worth investigating already. It is however particularly useful when dealing with enormous corpi. This is a really basic rundown of the concept. This is a little more complex, but has useful information.

So that’s a brief run down of some basic techniques you could try: there are plenty more out there- this is just the start. Enjoy!

Tutorials and Guides: A curated list

This post is a curated list of my favourite tutorials and guides because “that one where Hadley Wickham was talking about cupcakes” isn’t the most effective search term. You can find my list of cheat sheets here. There are a lot of great resources on data science (I’ve included my top picks), so I don’t intend to reinvent the wheel here. This is just a list of my favourites all laid out so I can find them again or point other people in their direction when it comes up in conversation. I’ve also added a number of the “how to” type posts I’ve written on this blog as I often answer an enquiry in that format.

Data Science

Tutorials and videos: General

Puppets teach data science too

  • Render a 3D object in R. I have no idea where I would ever use this information in my practice, but it’s presented BY A PUPPET. Great fun.
  • DIY your data science. Another offering from the puppet circle on the data science venn diagram.



Work Flow

  • Guide to modern statistical workflow. Really great organisation of background material.
  • Tidy data, tidy models. Honestly, if there was one thing that had been around 10 years ago, I wish this was it. The amount of time and accuracy to be saved using this method is phenomenal.
  • Extracting data from the web. You found the data, now what to do? Look here.

Linear Algebra



Machine learning

Data visualisation

Natural Language Processing

I’ll continue to update this list as I find things I think are useful or interesting.

Edit: actually, “that one where Hadley Wickham was talking about cupcakes” is surprisingly accurate as a search term.

Screen Shot 2016-06-23 at 9.05.37 PM

Data Analysis: Enough with the Questions Already

We’ve talked a lot about data analysis lately. First we asked questions. Then we asked more. Hopefully when you’re doing your own analyses you have your own questions to ask. But sooner or later, you need to stop asking questions and start answering them.

Ideally, you’d really like to write something that doesn’t leave the reader with a keyboard imprint across their forehead due to analysis-induced narcolepsy. That’s not always easy, but here are some thoughts.

Know your story.

Writing up data analysis shouldn’t be about listing means, standard deviations and some dodgy histograms. Yes, sometimes you need that stuff- but mostly what you need is a compelling narrative. What is the data saying to support your claims?

It doesn’t all need to be there. 

You worked out that tricky bit of code and did that really awesome piece of analysis that led you to ask questions and… sorry, no one cares. If it’s not a direct part of your story, it probably needs to be consigned to telling your nerd friends on twitter- at least they’ll understand what you’re talking about. But keep it out of the write up!

How is it relevant?

Data analysis is rarely the end in and of itself. How does your analysis support the rest of your project? Does it offer insight for modelling or forecasting? Does it offer insight for decision making? Make sure your reader knows why it’s worth reading.

Do you have an internal structure?

Data analysis is about translating complex numerical information into text. A clear and concise structure for your analysis makes life much easier for the reader.

If you’re staring at the keyboard wondering if checking every social media account you ever had since high school is a valid procrastination option: try starting with “three important things”. Then maybe add three more. Now you have a few things to say and can build from there.

Who are you writing for?

Academia, business, government, your culture, someone else’s, fellow geeks, students… all of these have different expectations around communication.  All of them are interested in different things. Try not to have a single approach for communicating analysis to different groups. Remember what’s important to you may not be important to your reader.

Those are just a few tips for writing up your analyses. As we’ve said before: it’s not a one-size-fits-all approach. But hopefully you won’t feel compelled to give a list of means, a correlation matrix and four dodgy histograms that fit in the space of a credit card. We can do better than that!