I’ve been working as a data scientist for a little while now, and when I think back to my previous jobs in economics and econometric research, I realise that there are a few skills and practices that are common in data science which would have been really helpful. So I’m putting this listicle out there, in case it’s useful for anybody else.
The skills listed here have been included if, and only if, I reckon they’d be useful in applied economics.
There’s a bunch of other data science practices that might be interesting, but I haven’t listed them because they don’t pass the usefulness test. For example, permutation importance is much better than significance testing for checking evaluating a right-hand-side variable, but no economics journal would accept it, so it’s not on the list. Again, constructing complex SQL queries is bread and butter in data science, but my hunch is that it wouldn’t be that useful for day to day econometrics work, so I didn’t include it.
1. Source control for everything, all the time
This is the most valuable single thing that I wish I’d known about five years ago. I’d vaguely heard of git, but it’s often described as being great for team collaboration (which is true), so I didn’t think it was relevant for someone who mostly worked alone. But running
git init is one of the first things I do when I start a solo project these days.
Here are the two problems that git fixes. The first is the situation where you find yourself thinking “I know this model was working, but then I tinkered with it a bit and now it’s broken and I can’t remember how to unbreak it”. One git tutorial suggests you think about it as being just like clicking Save when you’re playing a game. Whenever you’re happy with your progress on a project, you can make a git commit and then experiment freely.
Have you ever had a project folder that’s full of files called
analysis_final_2_edited_Jan15.R, ad infinitum? That’s the second problem git fixes. It keeps snapshots of your project directories at moments in time, which you can restore or tidy away whenever you want. You never have to remember, months later, which of the "final" files was the one that actually got used.
You can also use the same setup for your articles. Git works best with plain text files, and if you’re an economist you’re most likely working in LaTeX, so you can efficiently track changes and work with coauthors by keeping the article itself in your repo.
There’s a 15-minute interactive tutorial over on GitHub, the cloud-based git hosting service, which will give you everything you need to start using git.
2. Cross validation
Everyone who’s done basic econometrics knows how to do cross validation. All it means is keeping a holdout set separate from the data you use to fit a model, so that you can evaluate its performance.
In data science you use exactly the same kind of technique. The difference is how you think about it. Cross validation in econometrics is used occasionally, and usually takes a back seat to significance testing, whereas data scientists use it constantly because they’ve learned to be a little paranoid about how well their models might generalise.
If econometricians took on some of the same healthy paranoia, their models would become more robust and reliable. And it would save you from the embarrassment of choosing a model because it has a high R2, only to find it falls to pieces when you use it to make forecasts.
3. R and Python
These are the two languages everybody uses, and for different reasons they’re worth investing time to learn if you haven’t already. As well as being much better than the commercial software most quant economists use, they’re both completely free.
R is a weird language with a lot of quirks. It has two big advantages over the competition. The first is that there’s a massive amount of add-on packages that let you do lots of things quickly and easily. Here is a list of packages with an econometrics focus, and I particularly recommend the package called forecast if you work with time series. Second, a particular package called ggplot2 lets you produce high-quality graphs. In the same way that LaTeX is worth learning because it makes your articles look professional and slick, it’s worth spending some time getting into the ggplot2 headspace, because it makes your charts look beautiful.
Working in Python encourages you to write clean and reusable code, in the way that EViews and Matlab encourage you to write bad code. Once you’ve got the basics, Pro Python is a great resource to check out. The Python ‘data science stack’ is much simpler than R’s menagerie of packages: you just have Pandas for manipulating data, numpy for numerical operations, and SKLearn for fitting models.
In five years’ time we’ll probably all be using Julia for everything, but for now it’s still under development.
One more thing
‘Data scientist’ sounds like a bullshit made-up job title, but it’s actually a real thing, he said defensively. It means your job is in applied computational statistics and quantitative modelling, with a sideline in basic software engineering, using techinques from statistical machine learning. So there.