My favourite (data science) universe

I’ve started mapping out all the links, tutorials and packages I’ve been stashing away. Some are for personal use, some are just interesting and some I use for teaching other people.

Eventually, lists got to be a little clunky, so I started mapping out my favourite parts of the data science universe. It’s still a work in progress. There’s still plenty that isn’t filled in: here be dragons. You can see the full sized version, zoom in and out using the link below.


Going indy: define success

Deciding to go out on your own as an independent consultant can be both freeing and utterly terrifying.

Gone is the external validation from your workplace peers and supervisors (who’s going to tell you you’re a good data scientist now?).

Gone is the 9-5.

Gone are uncomfortable shoes – live the dream!

Working out how you’ll become independent is hard when you haven’t done it before. This is the first in a series of posts. Some of them are going to tell you how I did it, others how I should have done it. I’ll cover the basics of financials, defining your value proposition, networking and whatever else I can think of that might be useful.

But if I had my time over again, there’s one thing I’d do before any of that.

The first thing I think you should do if you’re going to go independent is define success. What does success look like to you?

If you don’t know, you’ll spend a lot of time floundering – maybe doing things that don’t make you happy. Maybe looking for that external validation that isn’t coming. Maybe comparing yourself to companies and people in ways that aren’t sensible. Maybe I did a lot of that in my time and hopefully you don’t have to!

So what does success look like for you?

Success might be financial

Financial success is generally regarded as pretty amazing, but what does it mean? For everybody who goes independent as a consultant it’s going to be different.

Some people might see financial success as being bought out by a large consulting masthead in five years time for a seven figure sum.

For others, financial success looks like the ability to pay your mortgage, your bills and have enough left over for the fun stuff – with the added bonus of having time to do the fun stuff.

Neither is bad or wrong: the right part of the spectrum is the one that’s working for you.

Success might be personal

Personal success has a few dimensions – the respect of your peers, the respect of clients, building something worthwhile. For some people, the freedom to take on pro bono work, mentor others and contribute to great causes is a big drive to go independent. For others, the time to spend with your kids is a measure of success.

Don’t let financial measures of success be your only indicator: the whole point of being independent is the freedom to define yourself and your business.

Success might be intellectual

One important bonus of being independent is the ability to choose work you’d like to take on. To be frank, it can take awhile to get to the point where choices are out there. But once they are, intellectual freedom to pursue projects that are interesting and push your boundaries can be another measure of success.

Success might just be space

I’m going to lay this one out there: data scientists are a pretty distinct bunch of people and for many that comes with a lot of pressure. Success for you might be the space just to be yourself without having to spend extortionate amounts of energy keeping up with the requirements of a corporate or academic world.

That doesn’t mean that at some point in the future you can’t take on the corporate or academic ladder: but the space just to be yourself with minimal overhead is a kind of success worth talking about. Lots of us need it for at least awhile in our lives and independent consulting can be one road to that kind of freedom.

Where to from here?

Once you’ve decided what big picture success looks like for your new consultancy, it’s time to get specific. In the next few weeks, I’ll talk about defining a data science offering and value proposition, defining and dealing with competition, what you need to get started, the financials and whatever else I can think of.

Got something you’d like me to cover or a question? Comments aren’t open because I have all the generic viagra and cialis I could ever need. But feel free to use the contact page or drop me a line on twitter here!

Writing business reports

Congratulations you’re a data scientist! You’ve got the tech skills, the maths skills, the stats skills… and now you need to write a business report(s) and nobody ever took five minutes to show you the basics.

That’s what this post is about: the basics of writing for business.

Business reporting is hugely varied – the most important thing to consider is who your audience is and write for them. This is never a one-sizes-fits-all situation and the minute you try to make it one, you will end up with a low-quality product.

Here are some thoughts to help you on your way.


Business reports can be a lot more free-flowing and variable than you might think. The term ‘report’ encompasses a huge range of documents. Here I’ll talk about a formal, structured document – but remember this isn’t always the best approach. Tailor the document to the stakeholder, always!

There are a number of elements to the formal business report and I’ll pick out a few I think need better understanding.

Executive summary

Executive summaries come first after title pages, but before your table of contents and introduction. This is the most important part of the report – the tl;dr (tl;dr: too long, didn’t read).

Most reports suffer from tl;dr. The executive summary is your chance to make an impact in spite of that. It’s not the same as an introduction – it’s almost a mini report.

The executive summary should include what you did, how you did it, what you found out doing so and if there are any limitations or caveats on the doing.

Sometimes an executive summary can feel like a bunch of lists: things you did, things you thought, things you investigated, things you found out. If you need to list off a bunch of things – use bullet points to make that list easier on your reader.

Infographics are also a really useful way to convey information to an audience that doesn’t have the time or capacity to read the details. The advantage of this is they get the surface-level overview – and know where to come back for detail when/if they need it. (For more info on infographics see here.) Infographics are also really shareable which means your work may get wider currency than it would otherwise.


This looks the same as an executive summary on the surface. It’s often confused by many as almost the same: it’s not!

The introduction provides the baseline for the report’s readers. It should include some context (why are we here?) and in some circumstances a basic run down of results and methods is also relevant (what did we find out from being here and how did we find it out?). However, in shorter business reports, these may be best left to their own sections. It’s always a matter of judgement.

It should also include a road map for the report’s readers, so they know where the critical information lies. Something like Section 1 discusses the context of internet access in rural NSW while Section 2 outlines the methods used in this study.

The introduction comes after the table of contents and the executive summary.


This is the main action of your report. Its structure depends on what you’re doing and who you’re doing it for. You may need a methods section – or it may be best relegated to an appendix if you have a non-technical audience. You may need a section on context – or the audience may understand the context already. It all depends.

The trick to structuring your report is to think about what it is your audience needs to take away from reading it: what is the purpose here? What was the purpose of the work in the first place? Start there.

Structure is essentially a drafting problem – see some tips below for help on that.


A poor cousin of the executive summary – this can usually be reasonably brief compared to the rest of the report. It should outline what has been done and its outcome.  This is not the section to introduce anything new!

Including analyses

We’re data scientists and analysts: of course there’s going to be analyses. How do you decide which ones though?

Be aware of the temptation to ‘throw everything in there’ to show everyone what a wonderfully thorough job you did. Does it add to the purpose and structure you defined above? If not, it stays out.

Brag to your twitter pals about the fantastic model you built, though, at least you know they’ll care.

The finer points of tables and figures

Most of our analyses end up in reports as tables and figures. There are a couple of really simple points that are widely unknown about using them. But like many simple things, they’re only simple if somebody has taken the time to tell you about them.

The following applies to a formal report, other formats may have different restrictions. Your company or organisation may have its own style guide and this should always take precedence over the following:

  • Tables and figures should be numbered and labelled. When referring to them specifically, they are proper nouns and should be capitalised, e.g. Table 1 shows some uninteresting statistics, while Figure 427 shows another bar chart. Use the caption option where you can for your chosen tool.
  • Ideally tables and figures should have their own lists at the beginning of the report, like a table of contents. These are generally easy to insert and generate automatically.
  • In a table, column and row headings are often treated as proper nouns and capitalised – consider the overall style of your document.

Colour is a whole series of posts in itself, but I use the following as rules of thumb:

  • If the company or organisation has a colour palette, use that for preference.
  • Be aware of colour vision deficiencies among your readers. Here’s some thoughts on that.
  • Less is usually more when it comes to colour in charts and definitely when it comes to colour in tables.


Writing is well a very difficult skill to master. I should know, I was so bad at it when I started grad school that I bought my Ph.D. supervisor a box of red pens for Christmas one year because he had used so much red ink on my work!

Luckily for me he has a great sense of humour. Even luckier for me, he spent a lot of time helping me develop this skill that has stood me in good stead my whole career.

You won’t be a great technical writer overnight – and that’s OK. But when you’re just starting out or are unsure, here’s what helped me get better.

  • Whenever I’m trying to draft something difficult, I start out with bullet points. It helps me structure an argument without getting lost in the writing part.
  • I started out writing in present tense wherever possible. I used to mix my tenses a lot and it made it hard to understand my work. Where it was sensible/possible, sticking to present tense gave me one less thing to manage. You could do the same for past tense if that works better for your context.
  • Sentences should be reasonably short. I had so many ideas when I was starting out. They’d expand into subclause on subclause then another one .. and before you knew it I had sentences 10 lines long. To this day, an essential part of my writing pattern is going back through my writing and shortening as many sentences as I can.
  • Paragraphs should also be reasonably short in general. Got a paragraph that goes for half a page? Time to break it up!
  • A first draft is your best aid to writing well. Let me also be clear: a first draft is also generally a steaming pile of… let’s call it ‘junk’. It’s a good way of helping your brain organise its ideas. If you’re stuck, write a terrible first draft and bin it. The next will be better!
  • After you’ve spent a lot of time on a document, you need to put it down in order to see the flaws. Take breaks, go work on something else and then come back to it. You’ll be mildly horrified at everything you find that’s bad. But nobody else saw those errors before you did – great!
  • Another technique to find errors or bad writing is to read the draft backwards: sentence by sentence. This is time consuming, but it helps me find the details that need fixing.

What to write in?

When writing technical material Rmarkdown rocks my world. I export to Word when working collaboratively with non-technical people and then build back into the Rmarkdown document. It’s not ideal, but it works.

There’s a whole bunch of reasons why we write – so choose the tool that’s going to work best for you, your team and go for it.

References and Citations

Definitely worth the time – both for your own knowledge and reference but also to build on the work of those you’re using. Remember to cite your packages too. I haven’t always done this in the past! But it’s important to acknowledge the open source contributors whose work we are using – it’s now part of my routine.

And we’re done

That’s a brief overview of some of the things that have helped me with writing for business. Good luck!

Mapping analytics objects

A lot of incredibly important work has been done around data science workflows, most notably by Jenny Bryan. If you’re new to thinking about workflows, start with the incredible STAT545 resources and Happy Git and Github for the useR. Jenny’s work got me thinking about my broader workflow.

As a consultant, I work with a ream of developing documents, datasets, requests, outputs and analyses. A collection of analytical ephemera I refer to as analytics objects. When looking at a big project, I’ve found it helpful to start mapping out how these objects interact, where they come from and how they work together.

Here’s a general concept map: individual projects vary alot. But it’s a start point.

A concept map with analytics objects.

Client request objects

My workflow tends to start with client requests and communications – everything from the initial “are you available, we have an idea” email to briefings, notes I’ve taken during meetings, documents I’ve been given.

At the start of the project this can be a lot of documents and it’s not always easy to know where they should sit or how they should be managed.

A sensible solution tends to develop over time, but this is a stage where it’s easy to lose or forget about certain important things if it all stays in your inbox. One thing I often do at the start of a project  is a basic document curation in a simple excel sheet so I know what I’ve got, where it came from and what’s in it.

I don’t usually bother curating every email or set of meeting notes, but anything that looks like it may be important or could be forgotten about goes in the list.

a picture of a spreadsheet

Data objects

The next thing that happens is people give me data, I go and find data or some third party sends data my way.

There’s a lot of data flying about – sometimes it’s different versions of the same thing. Sometimes it’s the supposed to be the same thing and it’s not.

It often comes attached with metadata (what’s in it, where did it come from, who collected it, why) and documents that support that (survey instruments, sampling instructions etc.).

If I could go back and tell my early-career self one thing it would be this: every time someone gives you data, don’t rely on their documentation- make your own.

It may be short, it may be brief, it may simply contain references to someone else’s documentation. But take the time to go through it and make sure you know what you have and what you don’t.

For a more detailed discussion of how I handle this in a low-tech environment/team, see here. Version control systems and R markdown are my strong preference these days- if you’re working with a team that has the capacity to manage these things. Rmarkdown for building data dictionaries, metadata collections and other provenance information is brilliant. But even if you’re not and need to rely on Excel files for notes, don’t skip this step.

Next comes the analysis and communications objects which you’re probably familiar with.

Analysis and communications objects

(Warning: shameless R plug here)

The great thing about R is that it maps most of my analysis and communications objects for me. Using an Rproject as the basis for analysis means that the provenance of all transformed data, analyses and visualisations is baked in. Version control with Github means I’m not messing around with 17 excel files all called some variation of final_analysis.xlsx.

Using Rmarkdown and Shiny for as much communication with the client as possible means that I’ve directly linked my reporting, client-bound visualisations and returned data to my analysis objects.

That said, R can’t manage everything (but they’re working on it). Sometimes you need functionality R can’t provide and R can’t tell you where your data came from if you don’t tell it first. R can’t tell you if you’re scoping a project sensibly.

Collaboration around and Rmarkdown document is difficult when most of your clients are not R users at all. One work around for me has been to:

  • Export the Rmarkdown document as a word document
  • Have non-technical collaborators make changes and updates via tracked changes
  • Depending on the stage of the project input all those back into R by hand or go forwards with the word document.

It’s not a perfect system by any means, but it’s the best I’ve got right now. (If you’ve got better I’d love to hear about that.)

Objects inform other objects

In a continuing environment, your communications objects inform the client’s and so on. Not all of these are used at any given time, but sometimes as they get updated or if projects are long term, important things get lost or confused. Thinking about how all these objects work together helped my workflow tremendously.

The lightbulb moment for me was that I started thinking about all my analytics objects as strategically as Jenny Bryan proposes we think about our statistics workflow. When I do that, the project is better organised and managed from the start.

Object not found: R

An infographic with some tips for managing the 'object not found' error in R.


Full text for those using screen readers:

R Error Frustration?

Object not found.

This means R couldn’t find something it went looking for – a function or a variable/data frame usually.

Have you tried?

  • Spelling errors. Some are obvious, some less so in a block of code e.g. lamdba for lambda. Tip: mark each place in your code block where the ‘unfound object’ is and then use “find” in the editor to make sure you’ve caught them all.
  • Where is your object defined? In which environment? Tip: draw a diagram that explains the relationships between your functions and then step through it line by line.
  • Is the object where R thinks it should be? Where did you tell R it was – a search path, a data frame or somewhere else? Can you physically check if the object is in that space?

More visuals in rstats, please

Anyone who’s been reading along for awhile will realise by now I’m an infographic maven (here, here, here and here to start with). If a post has an infographic attached to it, chances are the infographic was designed long before the post was. A few people have asked about them lately, so here’s my quick rundown.

Why the infographics?

Visuals cater to a very large proportion of people for whom large amounts of text are not ideal for consuming technical information – and by ‘large amounts’ I mean ‘more than a few paragraphs’.

This kind of content is also easier for users who are speaking English as as second language to access – the text is broken down into pieces, the visuals offer further information and there’s just less of it to have to parse. (Although I haven’t seen this done, I anticipate it’s also easier to translate an infographic of some kind than, say, a full-length vignette.)

While visuals are a hugely successful medium for this kind of technical content, that doesn’t mean we should toss out the vignettes and blog posts or that we should stop using them to convey information: this is very demonstrably a terrible idea! Vignettes and blog posts provide a vital understanding for detailed, difficult concepts. The more we have, the better.

But ideally, we’d also pay attention to providing visual information as well.

How to infographic?

For someone whose main role in life is as developer or a data scientist, the prospect of “and now there’s one more thing you HAVE to do” is really not very helpful. Not every package or concept needs an infographic by any means. However, if you’re someone who’d like to communicate more with a wider audience, then maybe visuals of some kind are worth a shot.

That said, I have zero design skills. ZERO. My idea of an understandable colour wheel is gentle shades of monochrome. There are tools that allow you to build useful infographics without serious design skills.

My favourite is Canva, which I’ve been using for years. It has both paid and free versions. I used the free version for years quite happily, but recently upgraded to the premium content. Some content then has additional fees on top of that – but I avoid it quite easily. If you see an infographic from me, it was probably built in Canva. The platform goes beyond infographics. For example, the useR!2018 sponsorship prospectus was built in Canva (please interpret this as a plea for sponsorship and go take a look).

I’ve also used simple Excel or Powerpoint drawings + the magic of the screenshot. It’s hokey and doesn’t look that great, but if it’s getting the point across then I just roll with it. This is my alternative for flow charts, which Canva is not good at in my opinion. If anybody has a better idea, I’d love to hear it.

This blogpost has a number of other tools that I’m in the process of checking out.



tl;dr: visuals are good ways to teach a wider cross section of people things. You don’t need to have good design skills to try them out.



Post script: In fact, often the post accompanying the infographic on this blog is usually just a slightly more detailed rehash of the infographic. Why? Because the post is acting as an accessibility device for the infographic- the post wasn’t the point at all. An infographic can be just a whole bunch of nothing for a non-sighted reader and alt-text only gets you so far. So the post repeats the information in a format and style that is compatible with a screen reader. (I use the alt text to tell the reader this is where the information is.)

Decoding error messages in R

Decoding error messages in R can be difficult for newcomers, that’s why I’m working on helpPlease. However, in the meantime, it’s important to be able to understand R errors and warnings in more detail than simply ‘R says no’. So here’s a quick rundown:

Errors in R an infographic

R gives both errors and warnings

An error is “R says no”. It’s R’s way of telling you why the chunk of code is not possible to execute.

Warnings mean “R says OK sure but maybe you won’t like what you’re going to get”. It’s R’s way of telling you the code is behaving in a different way than you might reasonably expect.

Decoding an error message

The error message typically comes in three parts. Here’s a common example from my code: I’ve tried to access a part of a array that doesn’t exist – my array has a column dimension of 5, so when R goes looking for a the 100th column it’s understandably confused and just gives up.

R error message

There are three main parts to this message:

  1. The declaration that it is an Error
  2. The location of the error – it’s in the line of my code fit[5,100,]
  3. The problem this mistake in my code caused: the subscript is out of bounds, i.e. I asked R to go an retrieve a part of this array that did not exist.

Decoding a warning message

Warning messages can be very variable in format, but there are often common elements. Here’s a common one that ggplot gives me:

ggplot2 warning message

Here I’ve asked ggplot2 to put a line chart together for me, but some of my data frame is missing. Ggplot2 can still put the chart together, but it’s letting me know I have missing values.

While warning messages can be very variable, there are some common elements that turn up fairly regularly:

  1. The declaration of a warning
  2. The behaviour being warned about
  3. The piece of code that caused the warning

Now that you know what warnings and errors are and what’s in them: how do you find out what they mean?

Where can you find help?

There’s lots of information out there to help you decode your warning and error messages. Here are some that I use all the time:

  • Typing ? or ?? and the name of the function that’s going wrong in the console will give you help within R itself
  • Googling the error message, warning or package is often very useful
  • Stack Overflow or the RStudio community forums can be searched for other people’s (solved!) problems
  • The vignettes and examples for the package you’re using are a wealth of information
  • Blog posts that use the package or function you are can be a very good step-by-step guide of how to prepare your data for the tool you’re trying to use
  • Building a reprex (a reproducible example) is a good way of getting ready to ask a question on Stack Overflow or the R community forums.

Good luck! And in the meantime, if you should come across an R message that could use explaining in plain text I’d really love to hear from you (especially if you’re new!).


excelTransition is designed as a series of ‘training wheels’ functions which allow you to create some outputs similar to those you’d already have created in Excel with a minimum amount of coding and time.

It’s a package designed for you to use and abandon quickly. One of the most costly things about learning a new tool is the time you spend learning to do simple things you can already do in your current tool. excelTransition will help you produce some (very) basic analyses in minimum time, leaving you more time to work on learning R.

It’s ideal for someone at the very beginning of their learning about programming. If you’re an experienced programmer, you may not need these ‘training wheels’ at all.

The package is currently under development and you can view it on Github.

Help Please – a package for new R users

Starting out in R can be quite overwhelming. There are lots of resources and people around who want to help, but navigating those can be hard and some of the R error messages and explainers within R itself can seem like something of a foreign language.

helpPlease is there to bridge the gap. The gap closes by itself over time, but let’s build a bridge and make life easier.

The package is in proof of concept stage and you can see it on Github. If you have an error message that could benefit by being explained in plain language, a term used in R that could use a plain language explanation, encouragement for new users or a troubleshooting tip, we’d love to hear from you.