More visuals in rstats, please

Anyone who’s been reading along for awhile will realise by now I’m an infographic maven (here, here, here and here to start with). If a post has an infographic attached to it, chances are the infographic was designed long before the post was. A few people have asked about them lately, so here’s my quick rundown.

Why the infographics?

Visuals cater to a very large proportion of people for whom large amounts of text are not ideal for consuming technical information – and by ‘large amounts’ I mean ‘more than a few paragraphs’.

This kind of content is also easier for users who are speaking English as as second language to access – the text is broken down into pieces, the visuals offer further information and there’s just less of it to have to parse. (Although I haven’t seen this done, I anticipate it’s also easier to translate an infographic of some kind than, say, a full-length vignette.)

While visuals are a hugely successful medium for this kind of technical content, that doesn’t mean we should toss out the vignettes and blog posts or that we should stop using them to convey information: this is very demonstrably a terrible idea! Vignettes and blog posts provide a vital understanding for detailed, difficult concepts. The more we have, the better.

But ideally, we’d also pay attention to providing visual information as well.

How to infographic?

For someone whose main role in life is as developer or a data scientist, the prospect of “and now there’s one more thing you HAVE to do” is really not very helpful. Not every package or concept needs an infographic by any means. However, if you’re someone who’d like to communicate more with a wider audience, then maybe visuals of some kind are worth a shot.

That said, I have zero design skills. ZERO. My idea of an understandable colour wheel is gentle shades of monochrome. There are tools that allow you to build useful infographics without serious design skills.

My favourite is Canva, which I’ve been using for years. It has both paid and free versions. I used the free version for years quite happily, but recently upgraded to the premium content. Some content then has additional fees on top of that – but I avoid it quite easily. If you see an infographic from me, it was probably built in Canva. The platform goes beyond infographics. For example, the useR!2018 sponsorship prospectus was built in Canva (please interpret this as a plea for sponsorship and go take a look).

I’ve also used simple Excel or Powerpoint drawings + the magic of the screenshot. It’s hokey and doesn’t look that great, but if it’s getting the point across then I just roll with it. This is my alternative for flow charts, which Canva is not good at in my opinion. If anybody has a better idea, I’d love to hear it.

This blogpost has a number of other tools that I’m in the process of checking out.

 

 

tl;dr: visuals are good ways to teach a wider cross section of people things. You don’t need to have good design skills to try them out.

 

 

Post script: In fact, often the post accompanying the infographic on this blog is usually just a slightly more detailed rehash of the infographic. Why? Because the post is acting as an accessibility device for the infographic- the post wasn’t the point at all. An infographic can be just a whole bunch of nothing for a non-sighted reader and alt-text only gets you so far. So the post repeats the information in a format and style that is compatible with a screen reader. (I use the alt text to tell the reader this is where the information is.)

On moving the box, not the whiskers

A lot of institutions and data scientists are keen to take the best of the best and make them better. They are working with elites in terms of intellectual ability, the socio economic lottery that takes intelligence and operationalises it and all the other bits and pieces that go into that.

These institutions and data scientists are working with the whiskers of the population box plot. They’re taking the people on the upper edge of the distribution and they’re keen to push them further out: to achieve more, do more, create more. Bravo! This is important and should continue.

Box plot

However, there’s another group of people that I think need to learn data science skills in general and to learn how to code in particular. Australia has no national workforce plan: but we know and acknowledge that data is at the heart of our economy going forward.

In order to make the most of our future, we need a large number of people in the box to learn the skills that will give them access to a digital, data driven economy. These people are not elites. They often do not believe they have a strong mathematics skills set and they don’t have PhDs. But we need them.

Data science in general and coding in particular is a useful, important skill set. There will always be space and need for elite data scientists, trained by elite institutions and mentors. But we also need to start thinking about how we’re going to move the box, not just the whiskers.

If you think about it, the productivity gains from moving a small proportion of the box upwards are enormous compared to moving just the whiskers.

R for Excel users

Moving over to R (or any other programming language) from Excel can feel very daunting. One of the big stumbling blocks, in my view, is having a mental understanding of how we store data in structures in R. You can view your data structures in R, but unlike Excel where it’s in front of your face, it’s not always intuitive to the user just starting out.

There’s lots of great information on the hows, whys and wherefores: here’s a basic rundown of some of the common ways we structure our data in R and how that compares to what you’re already familiar with: Excel.

Homogeneous data structures


basic data structures infographic

 

Homogeneous in this case just means all the ‘bits’ inside these structures need to be of the same type. There are many types of data in R, but the basic ones you need to know when just starting out for the first time are:

  • Numbers. These come in two varieties:
    • Doubles – where you are wanting and using decimal points, for example 1.23 and 4.56.
    • Integers- where you don’t, for example 1, 2, 3.
  • Strings. This is basically just text data – made up of characters. For example, dog, cat, bird.
  • Booleans. These take two forms TRUE and FALSE.

Homogeneous data structures are vectors, matrices and arrays. All the contents of these structures have to have the same type. They need to be numbers OR text OR booleans or other types – but no mixing.

Let’s go through them one-by-one:

  • Vectors. You can think of a vector like a column in a spreadsheet – there’s an arbitrary number of slots and data in each one. There’s a catch – the data types all have to be the same: all numbers, all strings, all booleans or other types. Base R has a good selection of options for working with this structure.
  • Matrices. Think of this one as the whole spreadsheet – a series of columns in a two dimensional arrangement. But! This arrangement is homogeneous – all types the same. Base R has you covered here!
  • Arrays. This is the n-dimensional equivalent of the matrix- a bundle of worksheets in the workbook if you will. Again, it’s homogenous. The abind package is really useful for manipulating arrays. If you’re just starting out, you probably don’t need this yet!

The advantage of homogeneous structures is that they can be faster to process – but you have to be using serious amounts of processing power for this to matter a lot. So don’t worry too much about that for now. The disadvantage is that they can be restrictive compared to some other structures we’ll talk about next.

 

Heterogeneous structures

Basic data structures heterogeneous

 

Heterogeneous data structures just mean that the content can be of a variety of types. This is a really useful property and makes these structures very powerful. There are two main forms, lists and data frames.

  • Lists. Like a vector, we can think about a list like a column from a spreadsheet. But unlike a vector, the content of the list can be any type.
  • Data frames. A data frame is really a list of lists. Generally the content of each sub-list (column of the data frame) is the same (like you’d expect in a spreadsheet) but that’s not necessarily the case. Data frames can have named columns (so can other structures) and you can access data using those names.

Data frames can be extended to quite complex structures. Data frames don’t have to be ‘flat’. Because you can make lists of lists, you can have data frames where one or more of the columns have lists in each slot, they’re called nested data frames.

This and other properties makes the data type extremely powerful for manipulating data. There’s a whole series of operations and functions in R dedicated to manipulating data frames. Matrices and vectors can be converted into data frames, one way is the function as.data.frame(my_matrix).

The disadvantage of this structure is it can be slower to process – but if you’re at the stage of coding where you’re not sure if this matters to you, it probably doesn’t just now! R is set up to do a bunch of really useful things using data frames. This is the data structure probably most similar to an Excel sheet.

How do you know what structure you’re working with? If you have an object in R and you’re not sure if it’s a matrix, or a vector, a list or a data frame call str(object). It will tell you what you’re working with.

So that’s a really simple take on some simple data structures in R: quite manageable, because you already understand lots of these concepts from your work in Excel. It’s just a matter of translating them into a different environment.

 

Acknowledgement: Did you like the whole homogeneous/heterogeneous structure idea? That isn’t my idea – Hadley Wickham in Advanced R talks about it in much more detail.

Things I wish I’d noticed in grad school

Back in the day, I tended to get a little hyper-focussed on things. I’m sure someone, sometime, somewhere pointed this stuff out to me. But at the time it went over my head and I learned these things the hard way. Maybe my list of things I wish I’d noticed helps someone else.

  • Your professional contacts matter and it’s OK to ask for help. You’re not researching in a vacuum, the people around you want to help.
  • You need to look outside your department and university. There’s a bigger, wider world out there and while what’s going on inside your little world seems like it’s important: you need to be aware of what’s outside too.
  • Being methodologically/theoretically robust matters, yes. But learning when to let it go is going to be harder than learning the theory/methodology. No easy answers here, all you can do is make your decision and own it.
  • It doesn’t matter how much you read, you’re not going to be an expert across your whole field. Just be aware of the field and be an expert in what you’re doing right now. That’s OK.
  • Get a life. Really.

Tiny Coders

I’ve mentioned it before, but I run the local code club out here in rural Australia. We are using the Code Club curriculum, designed for kids aged 9-12. Due to our particular circumstances with transport and distance, our code club needs to offer fun and learning for the age range 5-8 as well. Some of our littles are finding the materials too challenging to be fun, so as of this week we are running two streams:

  • The “Senior Dev Team”: in time-honoured managerial tradition, I told them they could be senior devs with a badge, if they helped the littles. That’s right, more responsibility and nothing but a badge to show for it. The senior dev team is going to keep going with the regular code club projects and they are smashing them out. Seriously, all I need to do is get these kids a black t-shirt each and they’re regular programmers already.
  • The “red team”: these are our kids that are struggling with the projects we have been doing and not having fun because of it. We’ll be doing multistage projects with lots of optional end points for kids to stop and go play: these are really young kids sitting down to code after six hours of school, so for some of them 20 minutes is more than enough. For them, it’s enough that they learn that computers and code are fun and interesting. For the older/more capable kids in this group we’ll still be learning about loops and conditional statements and all the good stuff, but our projects will be pared back and more basic so they aren’t overwhelming.

Our first red team project is here: Flying Cat Instructions and on Github here.

Of course, none of this would be possible without an amazing team of dedicated parent and teacher volunteers: many of whom had very little computer skills before we started and NO coding skills. They’re as amazing as the kids.

Data Visualisation: Hex Codes, Pantone Colours and Accessibility

One of the things I find hardest about data visualisation is colouring. I’m not a natural artist, much preferring everything in gentle shades of monochrome. Possibly beige. Obviously for any kind of data visualisation, this limited .Quite frankly this is the kind of comfort zone that needs setting on fire.

I’ve found this site really helpful: it’s a listing of the Pantone colours with both Hex and RGB codes for inserting straight into your visualisations. It’s a really useful correspondence if I’m working with someone (they can give me the Pantone colour numbers of their website or report palette- I just search the page).

One thing I’ve found, however, is that a surprising (to me) number of people have some kind of colour-based visual impairment. A palette that looks great to me may be largely meaningless to someone I’m working with. I found this out in one of those forehead slapping moments when I couldn’t understand why a team member wasn’t seeing the implications of my charts. That’s because, to him, those charts were worse than useless. They were a complete waste of his time.

Some resources I’ve found helpful in making my visualisations more accessible are the colourblind-friendly palettes discussed here and this discussion on R-Bloggers. The latter made me realise that up until now I’ve been building visualisations that were obscuring vital information for many users.

The things I think are important for building an accessible visualisation are:

  • Yes, compared to more subtle palettes, colour-blind friendly palettes look like particularly lurid unicorn vomit. They don’t have to look bad if you’re careful about combinations, but I’m of the opinion that prioritising accessibility for my users is more important than “pretty”.
  • Redundant encoding (discussed in the R-bloggers link above) is a great way ensuring users can make out the information you’re trying to get across. To make sure this is apparent in your scale, use a combination of scale_colour_manual() and scale_linetype_manual(). The latter works the same as scale_colour_manual() but is not as well covered in the literature.
  • Consider reducing the information you’re putting into each chart, or using a combination of facets and multiple panels. The less there is to differentiate, the easier it can be on your users. This is a good general point and not limited to those with colourblindness.

Yes, you can: learn data science

Douglas Adams had it right in Dirk Gently’s Holistic Detective Agency. Discussing the mathematical complexity of the natural world, he writes:

… the mind is capable of understanding these matters in all their complexity and in all their simplicity. A ball flying through the air is responding to the force and direction with which it was thrown, the action of gravity, the friction of the air which it must expend its energy on overcoming, the turbulence of the air around its surface, and the rate and direction of the ball’s spin. And yet, someone who might have difficulty consciously trying to work out what 3 x 4 x 5 comes to would have no trouble in doing differential calculus and a whole host of related calculations so astoundingly fast that they can actually catch a flying ball.

If you can catch a ball, you are performing complex calculus instinctively. All we are doing in formal mathematics and data science is putting symbols and a syntax around the same processes you use to catch that ball.

Maybe you’ve spent a lot of your life believing you “can’t” or are “not good at” mathematics, statistics or whatever bugbear of the computational arts is getting to you. These are concepts we begin to internalise at a very early age and often carry them through our lives.

The good news is yes you can. If you can catch that ball (occasionally at least!) then there is a way for you to learn data science and all the things that go with it. It’s just a matter of finding the one that works for you.

Yes you can.

Teaching kids to code

Kids coding is a topical issue, particularly given the future of employment. The jobs our children will be doing are different to the ones our parents did/are doing and to our own. Programming skills are one of the few things that the experts agree are important.

There are lots of great online resources already in place to help children learn the computer skills they will need in the future. You can start early, you can make it fun and it doesn’t have to cost you a fortune.

Let me be clear: this isn’t a parenting blog. I do have kids. I do program. I do have a kid that wants to learn to program (mostly I think because he thinks I’ll give him a free pass on other human-necessary skills such as creativity, interpersonal relationships and trying on sports day).

My personal parenting philosophy (if anyone cares) is that kids learn very well when you give them interesting tools to explore the world with. That might include programming, but for some kids it won’t. That’s OK. It doesn’t mean they’re never going to get a job: it just means they may prefer to climb trees because they’re kids. There’s a lot of learning to be had up a tree.

But part of providing interesting resources with which to explore the world is knowing where to find them. Here’s a run down of some resources broken down by age group. Yes, kids can start as early as preschool!

Preschool Age (4 +)

The best resources for kids this age are fun interactive apps. If it’s not fun, they won’t engage and frankly nobody wants to stand over a small child making them do something when they could be learning autonomously through undirected play. Here are my favourites:

  • Lightbot. This is a fun interactive app available on Android and Apple that teaches kids the basics of programming using icons rather than language-based code. It comes in both junior coding (4-8 years) and programming puzzles (9+) and my kids have had the apps for six months and enjoyed them.
  • Cargo-bot was recommended to me by a fellow programming-parent and I love the interface and the puzzles. My friends have had the app for a few months and young I. enjoys it a lot.
  • Flow isn’t a coding app. It’s an app that encourages visual motor planning development. Anyone that’s done any coding at all will know that visual motor planning is a critical skill for programming. First this then that. If I put this here then that needs to go there. Flow is a great game that helps kids develop this kind of planning. And that’s helpful not only for programming, but everything else too.

School Age Kids (9 +)

Once kids are comfortable reading and manipulating English as a language, they can move on to a language-based program. There are a few different ones available, some specifically designed for kids like Tynker and Scratch.  For the kid that I have in this age bracket- taking into account his interests and temperament- I’m just going to go straight to Python or R for him. As with everything parenting: your mileage may vary and that’s OK.

Some resources for learning python with kids include:

  • This great post from Geekwire. Really simple ideas to engage with your kid.
  • Python Tutorials for kids 13+ is a companion site to the For Dummies book Python for kids I’ve mentioned previously. We got the book from the library a month or so back and I’m thinking of shelling out the $$ to buy it and keep it here permanently.
  • The Invent with Python blog has some great discussion of the issue generally.

R doesn’t seem to have as many kid-friendly resources, but the turtle graphics package looks like it might be worth a try.

General Resources for Teaching Kids to Code

Advocates for programming have been beating this drum for a long time. I came across a number of useful posts while writing this one, so here they are for your reference:

Good luck and enjoy coding with your kid. And if your kid doesn’t want to learn code, enjoy climbing that tree instead!

Open datasets for analysis

So you’re a new data scientist and you’re exploring everything the internet has to offer (a lot). But having explored, you’re ready to try something on your own. Here is a (short) list of data sources you can tackle:

I’ll keep adding to the list as I come across interesting things.

Three things every new data scientist should know

Anyone who has spent any time in the online data science community knows that this kind of post is a genre all on its own. “N things you should know/do/be/learn/never do” is something that pops up in my twitter feed several times a day. These posts range from useful ways to improve your own practice to clickbait listing reams of accomplishments that make Miss Bingley’s “accomplished young ladies” speech in Pride and Prejudice appear positively unambitious.

Miss Bingley’s pronouncement could be easily be applied to data scientists everywhere:

“Oh! certainly,” cried his faithful assistant, “no [woman] can be really esteemed accomplished who does not greatly surpass what is usually met with. A woman must have a thorough knowledge of music, singing, drawing, dancing, and the modern languages, to deserve the word; and besides all this, she must possess a certain something in her air and manner of walking, the tone of her voice, her address and expressions, or the word will be but half-deserved.”

Swap out the references to women with “data scientist”, throw in a different skill set and there we have it:

“Oh! certainly,” cried his faithful assistant, “no data scientist can be really esteemed accomplished who does not greatly surpass what is usually met with. A data scientist must have a thorough knowledge of programming in every conceivable language that was, is or shall be, linear algebra, business acumen, obscure models only ever applied in obscure places, and whatever is “hot” this year, to deserve the title; and besides all this, she must possess a certain something in her air and manner of tweeting, the tone of her blogging, her linkedin profile and be a snappy dresser, or the title will be but half-deserved.”

Put like that, you’d be forgiven for not allowing the Miss Bingleys of the world to define you.

If I had a list of things to say to new data scientists, they wouldn’t have much to do with data science at all:

  1. You define yourself and your own practice. Not twitter, not an online community, not blogs from people who may or may not know your work. Data science is an incredibly broad array of people, ideas and tools. Maybe you’re in the middle of it, maybe you’re on the edge. That’s OK, it’s all valuable.
  2. You’re more than a bot. This is an industry that is increasing automation every day. You add value to your organisation in ways that a bot never can. What is the value you add? Cultivate and grow it.
  3. The online community is a wonderful place full of people who want to help you grow your practice and potential. Dive in and explore: but remember that the advice and pronouncements are just that. They don’t always apply to you all the time. Take what’s useful today and put the rest aside until it’s useful later.

It’s a short list!