Archive for the 'analytics' Category

Nifty Cross-Channel Experience with B&N

B&N’s “pick me up” is a great cross-channel integration. I’m using my fantasy baseball drafts as a reason to finally learn a Mac-OSX database program, specifically FileMaker Pro. According to bn.com, “The Missing Manual” for FMP appears to be available at the Park Slope store. I signed up to have someone to reserve the book for me and here’s the confirmation:

bnpickmeup.png

In the next couple hours, I’m supposed to get an email telling me the book is there. I love how they took any possible confusion out of the process — ‘don’t come to the store’ till you get the email, give the email about an hour. I just did it a few minutes ago, so the only possible room for annoyance is if I just don’t get an email and I have no idea how to track the request. Still, it’s pretty cool.

====

UPDATE: In less than an hour, I got both my email and a text message. Pretty sweet.

bnpickmeupiphone.PNG

Video: “Pie charts suck so beware of them”

Nice Ignite talk by Alex Lundry, who, according to a quick Google hit, does a lot of market and political research and is a consultant to the GOP, has a really great Ignite talk about data viz, visual thinking, and some politics.

Silly Stat: ‘Kindle books outsold real books on Christmas day’ | Mobile Entertainment News

While I am big fan of the Kindle, thisstory from mobile entertainment news - referencing Amazon’s statement that they sold more e-books on Christmas day than real books - is silly. I mean, didn’t anybody stop to think about the number?

Of course, more people bought e-Books than real books on Christmas Day. Who goes to Amazon on Christmas Day to order a real book? (On Christmas Day, amidst the egg nog, the coffee cake, and all the gifts, what kind of doof says “oh I’m going to need this real book in a couple days, better go order it.” None, aside from me.) OTOH, how many people who received a Kindle on Christmas day immediately go and buy some books? Every. Single. One. Of. Them.

#brainfail

We are all statisticians now

or should be to a certain extent, if we take recently anointed Google numbers guru Hal Varian’s words to heart. The former economist (a very heavy maths-focused one at that) is frequently quoted as saying that statistician will be the next ’sexy’ job (just like engineer was), but the line, from McKinsey goes much deeper:

I keep saying the sexy job in the next ten years will be statisticians. People think I’m joking, but who would’ve guessed that computer engineers would’ve been the sexy job of the 1990s? The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids. Because now we really do have essentially free and ubiquitous data. So the complimentary scarce factor is the ability to understand that data and extract value from it.

I think statisticians are part of it, but it’s just a part. You also want to be able to visualize the data, communicate the data, and utilize it effectively. But I do think those skills—of being able to access, understand, and communicate the insights you get from data analysis—are going to be extremely important. Managers need to be able to access and understand the data themselves.

I recently started working my way through Ben Fry’s Visualizing Data and adding Fry’s process to Varian’s shows some of the deep changes people need to make in order to embrace the new numeracy. Visualizing Data is more about Fry’s Processing language and how to hook it to datasets than it is about thinking visually or how to work through those datasets to find a pattern or evocative image, but it begins with a seven-step process:

ACQUIRE — Obtain the data, whether from a file on a disk or a source over a network.

PARSE — Provide some structure for the data’s meaning, and order it into categories.

FILTER — remove all but the data of interest.

MINE — Apply methods from statistics or data mining as a way to discern patterns or place the data in mathematical context.

REPRESENT — Choose a basic visual model, such as a bar graph, list, or tree.

REFINE — Improve the basic presentation to make it clearer and more visually engaging.

INTERACT — Add methods for manipulating the data or controlling what features are visible.

This does a nice job of highlighting that Varian’s charge is a mix of skills for managers, practitioners, and interpreters alike. Some of the steps are naive or described in a way that invites unhealthy simplisticism (simplicity == good, simplisticism, the thing we often get instead of simple is reductive, which is always bad). MINEing and REPRESENTing are the steps where numbers emerge into something living and actionable. MINE, as defined by Fry, is focused on software, rather than cognitive styles and elastic minds, for the generation of insights and pattern recognition. Certainly software is needed, but the hypotheses and candidate patterns you validate with the software come from soft eyes, something I blogged about a while ago. Similarly, REPRESENT is posed as choosing from a list of standard data tropes. But hey, it’s a software book and we all know Fry is more visual than that.

The real point is that this path shows a range of skills and validation even broader than what Varian points to. Someone working with someone working with data needs to know, understand, and respect the technical underpinnings of the first two steps, which set up the infrastructure of your entire data exercise. Like software, you need to measure twice, cut once here because this is the infrastructure of your inquiry and you won’t be able to change it quickly. Filter, mine, represent are subjects for another book perhaps, but they put you in the land of Tufte, Orwell, as well Flowing Data and statistics — a mix of simple communication, humanities, and the techniques of numbers.

The last one was also pretty interesting. I love how Fry reminds people to let the data grow with the audience by giving some interactivity. Sure, you do the first crack at it, but letting your audience go deeper, create their own juxtapositions, or simply play with the data gets them more engaged, allows for even more meaning to emerge from the data.

http://www.kipbot.com/blog/2008/03/05/dd-my-grad-school-footnote/

Paul Krugman’s Rules of Research

From his Nobel talk slides:

picture-1.png

The meaning of the first one was not immediately apparent to me, so I found a longer version of the rules, where Krugman explains: “Pay attention to what intelligent people are saying, even if they do not have your customs or speak your analytical language.”

Applies to many, many things.

Psychology of Polls

This election has been weird for me with the polls. The biggest riddle for me has been: why, in a race that the media describe as a dead heat, has John McCain been throwing hail Mary passes, acting in such a desperate fashion? The most frequent answer is that the media wants drama so that people follow the news more closely, which I can buy, but I’m still kind of baffled and annoyed about the highly democratic (all polls are worth reporting) approach to polling data.

Weird moment of NY Times coverage in today’s/yesterday’s online version (interesting: I just realized that the phrase “today’s paper” is really twisted on-line. Long time coming, that thought). On the front page is an article about the campaigns in battleground states with the graphic:
picture-2.png

The image is one of the worst case scenarios for Democrats and pretty far removed from current thinking: no toss-up states, no recognition of polling showing some of the reds leaning blue. So, what’s the editorial thinking driving that? Is the Times trying to panic its liberal readers into reading that article? What role does polling data play in the news? I’m confused.

Very grim, 2004, version of the election in that graphic. In fact, if we assume that most readers and viewers are becoming quite familiar with shades of red and blue for barely, weakly, and strongly dem or repub and that they expect some neutral color to indicate toss-up, this graphic gives no indication that there are any battleground states.

Then there’s the Times’s ongoing electoral map graphic which appears in the right column of the Politics section:
picture-4.png

It’s even more jarring when the two items appear together:
picture-3.png

Poll reportage has been tricky for several elections: exit polls on the east coast have been thought to influence west coast voting, exit polls created the mis-announcement of the Florida winner in 2000, landslide vote predictions make people nervous about voter turnout. I’m having a hard time figuring out what role poll reportage is supposed to play in the election.

Good line from NYT Book Review

“The plural of anecdote is not data”
- from a review of Friedman’s new book

Numerati Generation Gap: Nate Silver & Dan Rather

Fun interview by Dan Rather of fivethirtyeight’s Nate Silver:

Some interesting things to note:

  • it’s fun to look at Dan Rather’s bemused near-smirk. You can just hear him thinking “you dork, why don’t you stick to baseball stats”
  • the number of times Rather refers to complex statistical methods for either the baseball work or the fivethirtyeight work
  • the psychohistory line about “any one game doesn’t matter” but when you hit a critical mass of data, in polls or stats, you can “find nuggets of wisdom”
  • There’s a weird thing going on in this discussion about stats and polling where some very simple math is being turned into high science. If you spend a little time looking at Baseball Prospectus, it’s all algebra. There may be some underlying techniques in the crunching of the numbers, like regression, but the formulas are pretty simple. fivethirtyeight is largely a question of weighting polls, based on some historical data. It’s just not that complicated. Silver’s dissection of the GWU/Battleground poll is barely even a dissection — he just looked at the methodology and saw that they over-indexed older voters! I’m starting to find it frightening how innumerate people are . . . or is it how illogical they are given that it’s middle school math level?

    Barnacles, Butterflies, and . . . Buffoons?

    I’m reading Numerati, a fun read about the rising importance of data and modelling (and a healthy antidote to some of the extremes of Super Crunchers. In general, the book has a better, less fetishistic tone, one that acknowledges the power of what’s going on, but keeps it real:

    The only folks who can make sense of the data are crack mathematicians, computer scientists, and engineers. They know how to turn the bits of our lives into symbols . . . [he has a nice jag about using index cards to keep track of dietary patterns, and how inefficient that would be. It’s a bit of humanizing text, but I don’t feel like typing it.] The key to this process is to find similarities and patterns. We humans do this instinctively, it’s how we figured out, long ago, which plants to eat and how to talk. But while some of us were focusing on more specfic challenges, others were thinking more symbolically. I picture early humans sitting around a fire. Some, naturally, are jousting for the biggest piece of mate or busy with mating rituals. But off to the side, a select few are toying with stones thinking “if each of these pebbles represent one mammoth, then this rock . . . “

    Somehow, those paleo-ners playing with the stones instead of mating or eating meat managed to survive long enough to pass on their genes until, millions of years later, they could become Hari Seldons of the 21st century.

    The key thread of the first fourth of the book (which is where I am, according to the impossible to count progress dots on my Kindle), is how people are trying to turn data points into meaningful models of people. The first test cases are supermarkets, where discount programs and smart carts are being deployed to gather data points about people. One of the first things that emerges is that there are customers who do too good a job of taking advantage of sales and promotions. These people, called “barnacles” by the numerati and marketers who really never intended for people to take advantage of sales, are the people who watch the movies they rent on Netflix, rather than let them sit on the coffee table collecting dust, or the people who actually go to the gym and try to live up to their New Years Resolution or lower their blood pressure. These barnacles should be “fired” by retailers, as they drag down profits.

    On the other side, you have “butterflies”: “customers who drop in at the store on occasion, spend good money, and then flit away, sometimes for months or years on end.” Since they’re unreliable, it a waste of time to lavish courteous, much less fawning, treatment of them.

    I suppose that means that the most desirable customers are buffoons . . . those who don’t scrutinize, price-seek or use the products and services they buy and those who are easily ensnared in a seller’s field of gravity.

    It’s kind of fun to watch marketing lurch between respecting the customer’s individuality and trying to model them into flippable switches.

    Everything is SABERMetrics, even politics

    As part of my poll-obsessing, I finally checked out fivethirtyeight, recommended to me by Alex. Short version is that Nate Silver, the author of the site, is also a leader of Baseball Prospectus. He is credited with creating the very powerful PECOTA system, which rethinks baseball statistics — mostly through pure intelligence, but there is some math that exceeds the AD&D level — and in the process creates a much better explanatory and predictive tool. (It also played no small part in helping to create fantasy baseball’s popularity and even help baseball make a comeback when people thought the fast-paced, pre-felonious NBA was going to surpass America’s pastime.)

    fivethirtyeight is, and I don’t think this is oversimplifying, doing for political polling what it did for baseball stats: finding truths by refining, critiquing, and improving simplistic polling data. Today’s post on the site was one of those aha moments:

    I have gotten an increasing number of questions about the GWU/Battleground Poll, which presently gives John McCain a 2-point national lead, even as essentially every other current national poll shows Barack Obama with a lead of at least 5 points.

    Just because a poll is an outlier doesn’t necessarily mean that it’s doing something wrong. Pollsters may have legitimate reasons for having a different perspective on the election, and they may also occasionally produce odd results due to chance alone.

    In this case, however, the poll seems to be making a relatively fundamental mistake: it is not weighting by age.

    For months, I’ve been wondering why the hell some polls have been reporting a neck and neck race, while others show Obama steadily gaining ground. (Even stranger, why on earth is the always admirable John McCain pulling such silly stunts, throwing hail Marys, if it’s a dead heat?) Finally, someone explains it, and oh how bizarrely simple it turns out to be.

    For those who are curious, here’s the weighting of the battleground poll in question:

    18-34 17%
    35-44 12%
    45-64 40%
    65+ 31%

    Compared to the US Census/2004 election data:

    18-34 26%
    35-44 17%
    45-64 38%
    65+ 19%

    Pretty clear. This poll massively overrepresents older voters who, at a local polling level, have been averse to Obama for a variety of reasons, and massively underrepresent the younger voters who Obama has targeted in campaign activities and who are likely to respond to the post Baby-boomer voice he’s cultivated.

    So simple, no math. Can’t tell if I’m impressed at the baseball-stats freaks or disgusted at the innumeracy of the media, or even literate newspaper reading people.

    Next Page »