We are all statisticians now
or should be to a certain extent, if we take recently anointed Google numbers guru Hal Varian’s words to heart. The former economist (a very heavy maths-focused one at that) is frequently quoted as saying that statistician will be the next ‘sexy’ job (just like engineer was), but the line, from McKinsey goes much deeper:
I keep saying the sexy job in the next ten years will be statisticians. People think I’m joking, but who would’ve guessed that computer engineers would’ve been the sexy job of the 1990s? The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids. Because now we really do have essentially free and ubiquitous data. So the complimentary scarce factor is the ability to understand that data and extract value from it.
I think statisticians are part of it, but it’s just a part. You also want to be able to visualize the data, communicate the data, and utilize it effectively. But I do think those skills—of being able to access, understand, and communicate the insights you get from data analysis—are going to be extremely important. Managers need to be able to access and understand the data themselves.
I recently started working my way through Ben Fry’s Visualizing Data and adding Fry’s process to Varian’s shows some of the deep changes people need to make in order to embrace the new numeracy. Visualizing Data is more about Fry’s Processing language and how to hook it to datasets than it is about thinking visually or how to work through those datasets to find a pattern or evocative image, but it begins with a seven-step process:
ACQUIRE — Obtain the data, whether from a file on a disk or a source over a network.
PARSE — Provide some structure for the data’s meaning, and order it into categories.
FILTER — remove all but the data of interest.
MINE — Apply methods from statistics or data mining as a way to discern patterns or place the data in mathematical context.
REPRESENT — Choose a basic visual model, such as a bar graph, list, or tree.
REFINE — Improve the basic presentation to make it clearer and more visually engaging.
INTERACT — Add methods for manipulating the data or controlling what features are visible.
This does a nice job of highlighting that Varian’s charge is a mix of skills for managers, practitioners, and interpreters alike. Some of the steps are naive or described in a way that invites unhealthy simplisticism (simplicity == good, simplisticism, the thing we often get instead of simple is reductive, which is always bad). MINEing and REPRESENTing are the steps where numbers emerge into something living and actionable. MINE, as defined by Fry, is focused on software, rather than cognitive styles and elastic minds, for the generation of insights and pattern recognition. Certainly software is needed, but the hypotheses and candidate patterns you validate with the software come from soft eyes, something I blogged about a while ago. Similarly, REPRESENT is posed as choosing from a list of standard data tropes. But hey, it’s a software book and we all know Fry is more visual than that.
The real point is that this path shows a range of skills and validation even broader than what Varian points to. Someone working with someone working with data needs to know, understand, and respect the technical underpinnings of the first two steps, which set up the infrastructure of your entire data exercise. Like software, you need to measure twice, cut once here because this is the infrastructure of your inquiry and you won’t be able to change it quickly. Filter, mine, represent are subjects for another book perhaps, but they put you in the land of Tufte, Orwell, as well Flowing Data and statistics — a mix of simple communication, humanities, and the techniques of numbers.
The last one was also pretty interesting. I love how Fry reminds people to let the data grow with the audience by giving some interactivity. Sure, you do the first crack at it, but letting your audience go deeper, create their own juxtapositions, or simply play with the data gets them more engaged, allows for even more meaning to emerge from the data.
http://www.kipbot.com/blog/2008/03/05/dd-my-grad-school-footnote/