Exploring Correlation and the Simple Linear Regression Model

I’ve been wanting to learn Shiny for quite some time, since it seems to me that it’s a fantastic tool for communicating data science concepts. So I created a very simple app which allows you to manipulate a data generation process from weak through to strong correlation and then interprets the associated regression slope coefficient for you.

Here it is!

The reason I made it is because whilst we often teach simple linear regression and correlation as two intermeshed ideas, students at this level rarely have the opportunity to manipulate the concepts to see how they interact. This is easily fixable with a simple app in shiny. If you want to start working in Shiny, then I highly recommend Oliver Keyes’ excellent start up guide which was extremely easy to follow for this project.

app view

Teaching kids to code

Kids coding is a topical issue, particularly given the future of employment. The jobs our children will be doing are different to the ones our parents did/are doing and to our own. Programming skills are one of the few things that the experts agree are important.

There are lots of great online resources already in place to help children learn the computer skills they will need in the future. You can start early, you can make it fun and it doesn’t have to cost you a fortune.

Let me be clear: this isn’t a parenting blog. I do have kids. I do program. I do have a kid that wants to learn to program (mostly I think because he thinks I’ll give him a free pass on other human-necessary skills such as creativity, interpersonal relationships and trying on sports day).

My personal parenting philosophy (if anyone cares) is that kids learn very well when you give them interesting tools to explore the world with. That might include programming, but for some kids it won’t. That’s OK. It doesn’t mean they’re never going to get a job: it just means they may prefer to climb trees because they’re kids. There’s a lot of learning to be had up a tree.

But part of providing interesting resources with which to explore the world is knowing where to find them. Here’s a run down of some resources broken down by age group. Yes, kids can start as early as preschool!

Preschool Age (4 +)

The best resources for kids this age are fun interactive apps. If it’s not fun, they won’t engage and frankly nobody wants to stand over a small child making them do something when they could be learning autonomously through undirected play. Here are my favourites:

  • Lightbot. This is a fun interactive app available on Android and Apple that teaches kids the basics of programming using icons rather than language-based code. It comes in both junior coding (4-8 years) and programming puzzles (9+) and my kids have had the apps for six months and enjoyed them.
  • Cargo-bot was recommended to me by a fellow programming-parent and I love the interface and the puzzles. My friends have had the app for a few months and young I. enjoys it a lot.
  • Flow isn’t a coding app. It’s an app that encourages visual motor planning development. Anyone that’s done any coding at all will know that visual motor planning is a critical skill for programming. First this then that. If I put this here then that needs to go there. Flow is a great game that helps kids develop this kind of planning. And that’s helpful not only for programming, but everything else too.

School Age Kids (9 +)

Once kids are comfortable reading and manipulating English as a language, they can move on to a language-based program. There are a few different ones available, some specifically designed for kids like Tynker and Scratch.  For the kid that I have in this age bracket- taking into account his interests and temperament- I’m just going to go straight to Python or R for him. As with everything parenting: your mileage may vary and that’s OK.

Some resources for learning python with kids include:

  • This great post from Geekwire. Really simple ideas to engage with your kid.
  • Python Tutorials for kids 13+ is a companion site to the For Dummies book Python for kids I’ve mentioned previously. We got the book from the library a month or so back and I’m thinking of shelling out the $$ to buy it and keep it here permanently.
  • The Invent with Python blog has some great discussion of the issue generally.

R doesn’t seem to have as many kid-friendly resources, but the turtle graphics package looks like it might be worth a try.

General Resources for Teaching Kids to Code

Advocates for programming have been beating this drum for a long time. I came across a number of useful posts while writing this one, so here they are for your reference:

Good luck and enjoy coding with your kid. And if your kid doesn’t want to learn code, enjoy climbing that tree instead!

Using Natural Language Processing for Survey Analysis

Surveys have a specific set of analysis tools that are used for analysing the quantitative part of the data you collect (stata is my particular poison of choice in this context). However, often the interesting parts of the survey are the unscripted, “tell us what you really think” comments.

Certainly this has been true in my own experience. I once worked on a survey deployed to teachers in Laos regarding resources for schools and teachers. All our quantitative information came back and was analysed, but one comment (translated for me into English by a brilliant colleague) stood out. It read something to the effect of “this is very nice, but the hole in the floor of the second story is my biggest concern as a teacher”. It’s not something that would ever have been included outright in the survey, but a simple sentence told us a lot about the resources this school had access to.

Careful attention to detailed comments in small surveys is possible. But if you have thousands upon thousands of responses, this is far more difficult. Enter natural language processing.

There are a number of tools which can be useful in this context. This is a short overview of some that I think are particularly useful.

  • Word Clouds. These are easy to prepare and very simple, but can be a powerful way to communicate information. Like all data visualisation, there are the good and the bad. This is an example of a very simple word cloud, while this post by Fells Stats illustrates some more sophisticated methods of using the tool.

One possibility to extend on the simple “bag of words” concept is to divide your sample by groups and compare clouds. Or create your own specific dictionary of words and concepts you’re interested in and only cloud those.

Remember that stemming the corpus is critical. For example, “work”, “worked”, “working”, “works” all belong to the same stem. They should be treated as one or else they are likely to swamp other themes if they are particularly common.

Note that no word cloud should be constructed without removing “stop words” like the, and, a, I etc. Dictionaries vary- they can (and should) be tailored to the problem at hand.

  • Network Analysis. If you have a series of topics you want to visualise relationships for, you could try a network-type analysis similar to this. The concept may be particularly useful if you manually decide topics of interest and then examine relationships between them. In this case, the outcome is very much user-dependent/chosen, but may be useful as a visualisation.
  • Word Frequencies. Alone, simple tables of word frequencies are not always particularly useful. In a corpus of documents pertaining to education, noting that “learning” is a common term isn’t something of particular note. However, how do these frequencies change by group? Do teachers speak more about “reading” than principals? Do people in one geographical area or salary bracket have a particular set of high frequency words compared to another? This is a basic exercise in feature/variable engineering. In this case, the usual data analysis tool kit applies (see here, here and here). Remember you don’t need to stop at high frequency words: what about high frequency phrases?
  •  TF-IDF (term frequency-inverse document frequency) matrix. This may provide useful information and is a basis of many more complex analyses. The TF-IDF downweights terms appearing in all documents/comments (“the”, “i”, “and” etc.) while upweighting rare words that may be of interest. See here for an introduction.
  • Are the comments clustered across some lower dimensional space? K-means algorithm may provide some data-driven guidance there. This would be an example of “unsupervised machine learning” vis a vis “this is an alogrithm everyone has been using for 25 years but we need to call it something cool”. This may not generate anything obvious at first- but who is in those clusters and why are they there?
  • Sentiment analysis will be useful, possibly both applied to the entire corpus and to subsets. For example, among those who discussed “work life balance” (and derivative terms) is the sentiment positive or negative? Is this consistent across all work/salary brackets? Are truck drivers more upbeat than bus drivers? Again, basic feature/variable engineering applies here. If you’re interested in this area, you could do a lot worse than learning from Julia Silge who writes interesting and informative tutorials in R on the subject.
  • Latent Dirichlet Algorithm (LDA) and more complex topic analyses. Finally, latent dirichlet algorithm or other more complex topic analyses may be able to directly generate topics directly from the corpus: I think this would take a great deal of time for a new user and may have limited outcomes, particularly if an early analysis would suggest you have a clear idea of which topics are worth investigating already. It is however particularly useful when dealing with enormous corpi. This is a really basic rundown of the concept. This is a little more complex, but has useful information.

So that’s a brief run down of some basic techniques you could try: there are plenty more out there- this is just the start. Enjoy!