What’s a variable in R?

I had a great discussion with a bunch of 9-11 year olds about what a variable is the other day. It occurred to me that I was coding for years before I understood what was ‘under the hood’ so to speak.

Here’s a quick rundown of variables in R.

Variables

We use the word ‘variable’ to describe things that can change, or that could take multiple values. In Excel, typically this is a column with a heading giving the variable name.

In R, it might be a column in a data frame we access with a name or using subscripts. Or it might be a standalone object.

Using variables

We can access and use variables a few different ways in R:
– If it’s a stand alone object, we can just call it by name.
– If it’s a part of a data frame, we can use the $ notation
– Alternatively, we can use subscripts

But what’s going on under the hood?

A variable is just a type of object – you can think of an object in code as just a thing with a name.
– R puts the ‘thing’ in a part of the computer’s memory
– It labels that part of the memory with the variable’s name.
– We can later update that object in memory with new values, or point to the same object using a different name.

If we code fruit <- ‘apple’, then the computer puts ‘apple’ somewhere in its memory and points to that using the label fruit.

If we code awesomeR <- TRUE, the computer puts TRUE somewhere in its memory and points to that using the label awesomeR.

If we code x <- 45, the computer puts 45 somewhere in its memory and points to it using the label x.

An infographic about variables in R.

Mapping analytics objects

A lot of incredibly important work has been done around data science workflows, most notably by Jenny Bryan. If you’re new to thinking about workflows, start with the incredible STAT545 resources and Happy Git and Github for the useR. Jenny’s work got me thinking about my broader workflow.

As a consultant, I work with a ream of developing documents, datasets, requests, outputs and analyses. A collection of analytical ephemera I refer to as analytics objects. When looking at a big project, I’ve found it helpful to start mapping out how these objects interact, where they come from and how they work together.

Here’s a general concept map: individual projects vary alot. But it’s a start point.

A concept map with analytics objects.

Client request objects

My workflow tends to start with client requests and communications – everything from the initial “are you available, we have an idea” email to briefings, notes I’ve taken during meetings, documents I’ve been given.

At the start of the project this can be a lot of documents and it’s not always easy to know where they should sit or how they should be managed.

A sensible solution tends to develop over time, but this is a stage where it’s easy to lose or forget about certain important things if it all stays in your inbox. One thing I often do at the start of a project  is a basic document curation in a simple excel sheet so I know what I’ve got, where it came from and what’s in it.

I don’t usually bother curating every email or set of meeting notes, but anything that looks like it may be important or could be forgotten about goes in the list.

a picture of a spreadsheet

Data objects

The next thing that happens is people give me data, I go and find data or some third party sends data my way.

There’s a lot of data flying about – sometimes it’s different versions of the same thing. Sometimes it’s the supposed to be the same thing and it’s not.

It often comes attached with metadata (what’s in it, where did it come from, who collected it, why) and documents that support that (survey instruments, sampling instructions etc.).

If I could go back and tell my early-career self one thing it would be this: every time someone gives you data, don’t rely on their documentation- make your own.

It may be short, it may be brief, it may simply contain references to someone else’s documentation. But take the time to go through it and make sure you know what you have and what you don’t.

For a more detailed discussion of how I handle this in a low-tech environment/team, see here. Version control systems and R markdown are my strong preference these days- if you’re working with a team that has the capacity to manage these things. Rmarkdown for building data dictionaries, metadata collections and other provenance information is brilliant. But even if you’re not and need to rely on Excel files for notes, don’t skip this step.

Next comes the analysis and communications objects which you’re probably familiar with.

Analysis and communications objects

(Warning: shameless R plug here)

The great thing about R is that it maps most of my analysis and communications objects for me. Using an Rproject as the basis for analysis means that the provenance of all transformed data, analyses and visualisations is baked in. Version control with Github means I’m not messing around with 17 excel files all called some variation of final_analysis.xlsx.

Using Rmarkdown and Shiny for as much communication with the client as possible means that I’ve directly linked my reporting, client-bound visualisations and returned data to my analysis objects.

That said, R can’t manage everything (but they’re working on it). Sometimes you need functionality R can’t provide and R can’t tell you where your data came from if you don’t tell it first. R can’t tell you if you’re scoping a project sensibly.

Collaboration around and Rmarkdown document is difficult when most of your clients are not R users at all. One work around for me has been to:

  • Export the Rmarkdown document as a word document
  • Have non-technical collaborators make changes and updates via tracked changes
  • Depending on the stage of the project input all those back into R by hand or go forwards with the word document.

It’s not a perfect system by any means, but it’s the best I’ve got right now. (If you’ve got better I’d love to hear about that.)

Objects inform other objects

In a continuing environment, your communications objects inform the client’s and so on. Not all of these are used at any given time, but sometimes as they get updated or if projects are long term, important things get lost or confused. Thinking about how all these objects work together helped my workflow tremendously.

The lightbulb moment for me was that I started thinking about all my analytics objects as strategically as Jenny Bryan proposes we think about our statistics workflow. When I do that, the project is better organised and managed from the start.

Object not found: R

An infographic with some tips for managing the 'object not found' error in R.

 

Full text for those using screen readers:

R Error Frustration?

Object not found.

This means R couldn’t find something it went looking for – a function or a variable/data frame usually.

Have you tried?

  • Spelling errors. Some are obvious, some less so in a block of code e.g. lamdba for lambda. Tip: mark each place in your code block where the ‘unfound object’ is and then use “find” in the editor to make sure you’ve caught them all.
  • Where is your object defined? In which environment? Tip: draw a diagram that explains the relationships between your functions and then step through it line by line.
  • Is the object where R thinks it should be? Where did you tell R it was – a search path, a data frame or somewhere else? Can you physically check if the object is in that space?

Decoding error messages in R

Decoding error messages in R can be difficult for newcomers, that’s why I’m working on helpPlease. However, in the meantime, it’s important to be able to understand R errors and warnings in more detail than simply ‘R says no’. So here’s a quick rundown:

Errors in R an infographic

R gives both errors and warnings

An error is “R says no”. It’s R’s way of telling you why the chunk of code is not possible to execute.

Warnings mean “R says OK sure but maybe you won’t like what you’re going to get”. It’s R’s way of telling you the code is behaving in a different way than you might reasonably expect.

Decoding an error message

The error message typically comes in three parts. Here’s a common example from my code: I’ve tried to access a part of a array that doesn’t exist – my array has a column dimension of 5, so when R goes looking for a the 100th column it’s understandably confused and just gives up.

R error message

There are three main parts to this message:

  1. The declaration that it is an Error
  2. The location of the error – it’s in the line of my code fit[5,100,]
  3. The problem this mistake in my code caused: the subscript is out of bounds, i.e. I asked R to go an retrieve a part of this array that did not exist.

Decoding a warning message

Warning messages can be very variable in format, but there are often common elements. Here’s a common one that ggplot gives me:

ggplot2 warning message

Here I’ve asked ggplot2 to put a line chart together for me, but some of my data frame is missing. Ggplot2 can still put the chart together, but it’s letting me know I have missing values.

While warning messages can be very variable, there are some common elements that turn up fairly regularly:

  1. The declaration of a warning
  2. The behaviour being warned about
  3. The piece of code that caused the warning

Now that you know what warnings and errors are and what’s in them: how do you find out what they mean?

Where can you find help?

There’s lots of information out there to help you decode your warning and error messages. Here are some that I use all the time:

  • Typing ? or ?? and the name of the function that’s going wrong in the console will give you help within R itself
  • Googling the error message, warning or package is often very useful
  • Stack Overflow or the RStudio community forums can be searched for other people’s (solved!) problems
  • The vignettes and examples for the package you’re using are a wealth of information
  • Blog posts that use the package or function you are can be a very good step-by-step guide of how to prepare your data for the tool you’re trying to use
  • Building a reprex (a reproducible example) is a good way of getting ready to ask a question on Stack Overflow or the R community forums.

Good luck! And in the meantime, if you should come across an R message that could use explaining in plain text I’d really love to hear from you (especially if you’re new!).

Where do things live in R? R for Excel Users

One of the more difficult things about learning a new tool is the investment you make while you’re learning things you already know in your current tool. That can feel like time wasted – it’s not, but it’s a very frustrating experience. One of the ways to speed up this part is to ‘translate’ concepts you know in your current tool into concepts for your new one.

In that spirit, here’s a brief introduction to where things live in R compared to Excel. Excel is a very visual medium – you can see and manipulate your objects all the time. You can do the same in R, it’s just that they are arranged in slightly different ways.

Where does it live infographic

Data

Data is the most important part. In Excel, it lives in the spreadsheet. In R it lives in a data structure – commonly a data frame. In Excel you can always see your data.

Excel spreadsheet

In R you can too – go to the environment window and click on the spreadsheet-looking icon, it will give you your data in the viewer window if it’s an object that can be reproduced like that (if you don’t have this option, your object may be a list not a data frame). You can’t manipulate the data like this, however – you need code for that. You can also use commands like head(myData) to see the first few lines, tail(myData) to see the last few and print(myData) to see the whole object.

R environment view

view of data in R

Code

Excel uses code to make calculations and create statistics – but it often ‘lives’ behind the object it produces. Sometimes it can make your calculation look like the original data and create confusion for your stakeholders (and for you!).

Excel formula

In R code is used in a similar way to Excel, but it lives in a script, a .R file. This makes it easier to reuse, understand and more powerful to manipulate. Using code in a script saves a lot of time and effort.

R script

Results and calculations

In Excel, results and calculations live in a worksheet in a workbook. It can be easy to confuse with the original data, it’s hard to check if things are correct and re-running analyses (you often re-run them!) is time consuming.

In R, if you give your result or analysis a name, it will be in the Environment, waiting for you – you can print it, copy it, change it, chart it, write it out to Excel for a coworker and recreate it any time you need with your script.

A result in R

That’s just a simple run down – there’s a lot more to R! But it helps a lot to know where everything ‘lives’ as you’re getting started. Good luck!

R for Excel users

Moving over to R (or any other programming language) from Excel can feel very daunting. One of the big stumbling blocks, in my view, is having a mental understanding of how we store data in structures in R. You can view your data structures in R, but unlike Excel where it’s in front of your face, it’s not always intuitive to the user just starting out.

There’s lots of great information on the hows, whys and wherefores: here’s a basic rundown of some of the common ways we structure our data in R and how that compares to what you’re already familiar with: Excel.

Homogeneous data structures


basic data structures infographic

 

Homogeneous in this case just means all the ‘bits’ inside these structures need to be of the same type. There are many types of data in R, but the basic ones you need to know when just starting out for the first time are:

  • Numbers. These come in two varieties:
    • Doubles – where you are wanting and using decimal points, for example 1.23 and 4.56.
    • Integers- where you don’t, for example 1, 2, 3.
  • Strings. This is basically just text data – made up of characters. For example, dog, cat, bird.
  • Booleans. These take two forms TRUE and FALSE.

Homogeneous data structures are vectors, matrices and arrays. All the contents of these structures have to have the same type. They need to be numbers OR text OR booleans or other types – but no mixing.

Let’s go through them one-by-one:

  • Vectors. You can think of a vector like a column in a spreadsheet – there’s an arbitrary number of slots and data in each one. There’s a catch – the data types all have to be the same: all numbers, all strings, all booleans or other types. Base R has a good selection of options for working with this structure.
  • Matrices. Think of this one as the whole spreadsheet – a series of columns in a two dimensional arrangement. But! This arrangement is homogeneous – all types the same. Base R has you covered here!
  • Arrays. This is the n-dimensional equivalent of the matrix- a bundle of worksheets in the workbook if you will. Again, it’s homogenous. The abind package is really useful for manipulating arrays. If you’re just starting out, you probably don’t need this yet!

The advantage of homogeneous structures is that they can be faster to process – but you have to be using serious amounts of processing power for this to matter a lot. So don’t worry too much about that for now. The disadvantage is that they can be restrictive compared to some other structures we’ll talk about next.

 

Heterogeneous structures

Basic data structures heterogeneous

 

Heterogeneous data structures just mean that the content can be of a variety of types. This is a really useful property and makes these structures very powerful. There are two main forms, lists and data frames.

  • Lists. Like a vector, we can think about a list like a column from a spreadsheet. But unlike a vector, the content of the list can be any type.
  • Data frames. A data frame is really a list of lists. Generally the content of each sub-list (column of the data frame) is the same (like you’d expect in a spreadsheet) but that’s not necessarily the case. Data frames can have named columns (so can other structures) and you can access data using those names.

Data frames can be extended to quite complex structures. Data frames don’t have to be ‘flat’. Because you can make lists of lists, you can have data frames where one or more of the columns have lists in each slot, they’re called nested data frames.

This and other properties makes the data type extremely powerful for manipulating data. There’s a whole series of operations and functions in R dedicated to manipulating data frames. Matrices and vectors can be converted into data frames, one way is the function as.data.frame(my_matrix).

The disadvantage of this structure is it can be slower to process – but if you’re at the stage of coding where you’re not sure if this matters to you, it probably doesn’t just now! R is set up to do a bunch of really useful things using data frames. This is the data structure probably most similar to an Excel sheet.

How do you know what structure you’re working with? If you have an object in R and you’re not sure if it’s a matrix, or a vector, a list or a data frame call str(object). It will tell you what you’re working with.

So that’s a really simple take on some simple data structures in R: quite manageable, because you already understand lots of these concepts from your work in Excel. It’s just a matter of translating them into a different environment.

 

Acknowledgement: Did you like the whole homogeneous/heterogeneous structure idea? That isn’t my idea – Hadley Wickham in Advanced R talks about it in much more detail.

Closures in R

Put briefly, closures are functions that make other functions. Are you repeating a lot of code, but there’s no simple way to use the apply family or Purrr to streamline the process? Maybe you could write your own closure. Closures enclose access to the environment in which they were created – so you can nest functions within other functions.

What does that mean exactly? Put simply, it can see all the variables enclosed in the function that created it. That’s a useful property!

What’s the use case for a closure?

Are you repeating code, but instead of variables or data that’s changing – it’s the function instead? That’s a good case for a closure. Especially as your code gets more complex, closures are a good way of modularising and containing concepts.

What are the steps for creating a closure?

Building a closure happens in several steps:

  1. Create the output function you’re aiming for. This function’s input is the same as what you’ll give it when you call it. It will return the final output. Let’s call this the enclosed function.
  2. Enclose that function within a constructor function. This constructor’s input will be the parameters by which you’re varying the enclosed function in Step 1. It will output the enclosed function. Let’s call this the enclosing function.
  3. Realise you’ve got it all wrong, go back to step 1. Repeat this multiple times. (Ask me how I know..)
  4. Next you need to create the enclosed function(s) (the ones from Step 1) by calling the enclosing function (the one from Step 2).
  5. Lastly, call and use your enclosed functions.

An example

Say I want to calculate the mean, SD and median of a data set. I could write:
x <- c(1, 2, 3)
mean(x)
median(x)
sd(x)

That would definitely be the most efficient way of going about it. But imagine that your real use case is hundreds of statistics or calculations on many, many variables. This will get old, fast.

I’m calling those three functions each in the same way, but the functions are changing rather than the data I’m using. I could write a closure instead:

stat <- function(stat_name){

function(x){
stat_name(x)
}

}

This is made up of two parts: function(x){} which is the enclosed function and stat() which is the enclosing function.

Then I can call my closure to build my enclosed functions:

mean_of_x <- stat(mean)
sd_of_x <- stat(sd)
median_of_x <- stat(median)

Lastly I can call the created functions (probably many times in practice):

mean_of_x(x)
sd_of_x(x)
median_of_x(x)

I can repeat this for all the statistics/outcomes I care about. This example is too trivial to be realistic – it takes about double the lines of code and is grossly ineffficient! But it’s a simple example of how closures work.

More on closures

If you’re producing more complex structures, closures are very useful. See Jason’s post from Left Censored for a realistic bootstrap example – closures can streamline complex pieces of code reducing mistakes and improving the process you’re trying to build. It takes modularity to the next step.

For more information in R see Hadley Wickham’s section on closures in Advanced R.