The Big Knowledge

Thursday, March 13, 2014

You Don't Need a Dashboard

For at least as long as I've been working in data analytics, the clamoring from the datarati for Dashboards! Dashboards! Dashboards! has consistently risen year over year in pitch and volume. Spotfire and Tableau produce two really popular products, and my company has at least three different "dashboard-style" plugins. This week I also discovered Kibana, which frankly looks really awesome, because it's free and dead-easy to set up if you already have ElasticSearch (and ES is also dead-easy to set up; however, as I discovered the hard way, beware of multiple people screwing around with ElasticSearch prototyping on the same subnet, as the out-of-the-box configuration will make everybody's ElasticSearch node automagically join into one cluster transparently, with obviously undesirable results. Data integration!)

The most common thing I hear about dashboards is that people want them to "spot trends". And, as far as I can tell: No, you don't. Not really. One golden truth I have learned from working in data analytics is this: If you cannot pose a concrete question that you would like to answer or a concrete problem that you would like to solve, then you're wasting your money. Dashboards do neither of these things, at least not in the way most people use them, which is often as command center props. The process of formulating a question in a way that can be answered using data is not simple, and hence visualizing data in ways that is essentially static will not answer meaningful questions. And by "static" I don't mean "has no temporal aspect". Charts that show values over time are still static if the things being plotted cannot be changed easily and intuitively. And, if you spend a lot of time changing the visualizations and exploring different hypotheses, then you're not using a "dashboard"; think about it: a car dashboard is a thing you look at to get an immediate read on something, like your speed or fuel level. How often do you switch your speed readout from MPH to KPH? So, if that's you: congrats! You're a data scientist! You should stop thinking about dashboards and start thinking about a real, scalable data analysis platform.

Thursday, March 6, 2014

From the Archives

Note: This is a repost from my old blog, explaining my departure from academia; this came up in response to some recent discussions about a candidate who was deciding between attending CS grad school and getting a job, and how to dissuade them from the former, which led here:
http://anothersb.blogspot.co.uk/2014/02/goodbye-academia.html
Several people asked me to repost this in a place where they could read it, so here it is.

In case you've been coming here day after day, wondering why I have stopped posting, and where in the world is Carmen Sandiego, I thought I would update my (rapidly shrinking) fan base with my whereabouts: for almost a year, with the prospect of my funding drying up, and no publication in easy sight, I spent some time taking stock of my options, and deciding what I wanted to do with the rest of my life. And, the fact of the matter is: I was bored. I was unhappy. I really think single molecule biophysics is awesome, and fun, and there's great stuff happening. But I had discarded enough plastic pipette tips to last a lifetime, and was finding it increasingly difficult to care about the rate constant for phosphate release of E. Coli RNA polymerase in the presence of XYZ. Additionally, I had grown to really love the bay area, and didn't want to leave my friends, my girlfriend, and my hang glider behind.

My options, as I saw it, were:

Keep on keepin' on, and try to find a professorship at the University of Wallamaloo, or wherever I could, and hope that things would be more fun and interesting as a mid-grade intellectual at a mid-grade university.
Back out, and try to find another postdoctoral appointment doing something completely different, and hope that, four years down the line, at 38, I wasn't totally burnt out on that as well. I considered computational evolution, or even getting a masters in EE or CS, and seeing where that took me.
Get a real job.

I talked to a lot of people, and received some encouragement from some corners, most notably from Ben Ovryn at AECOM who strongly encouraged me to not give up on academic science. Just as conspicuously, I received no such encouragement from my research advisor, who, when I discussed my options with him, basically shrugged.

I applied for biotech jobs, software jobs, and a professorship at City College of New York, because I thought it would be fun to live there and teach there. I considered taking a year off to write a book, or become a professional hang glider pilot, or both. I thought of opening an artisanal sandwich cart in San Francisco with a friend of mine, because, let's face it, you can't get a decent deli style sandwich in the bay area. In the end, I had a job offer from a high flying biotech startup that would have required me to move to the east coast, and a job offer to work at a friend's software company, and I chose the latter.

So, this is where I am. I'm currently employed as a Forward Deployed Engineer at Palantir Technologies in Palo Alto. The software is incredible, the people are amazingly smart and fun, and my group is mostly comprised of Ph.D.s who left science to try something else, and wound up here. I do a bit of everything: I project manage, I code, I do some outreach, I integrate data, and I sometimes look for new and interesting ways to use our product. I've been here for four months, and it's been pretty much non-stop excitement.

I've learned a lot. First, that there are a lot of really smart people out in the private sector, if you look in the right places. Scary smart people, the kind that academics will tell you don't work in the private sector. Second, I've realized that many of the dysfunctional relationships I had in the academic world were not actually due to my personality flaws, but were largely due to the peculiar culture that tolerates (and in some cases rewards) dysfunctional interpersonal relationships in academia. It's refreshing to work with people who are smart, engaged, enthusiastic, and who genuinely want to work together to create something worthwhile and powerful. I think, to some extent, the archetypal academic interaction is the pissing contest, where people jockey for status, because status is the only currency in the academic world. All other forms of interaction are subordinate to the pissing contest. It's refreshing to step away from that world. And, third, it took me two or three months out of academia to realize how really bored I was with what I was doing. It's not that I think it's intrinsically boring; it's just that it wasn't really driving me to do more and accomplish more, but I had had myself convinced that this was the way to go, that this was interesting because everybody else said it was. With some hindsight, I can see that if I had found it really that fascinating, I would have been eager to get up and get in there to do more. And there just wasn't that drive, and it was making me miserable.

So, I'm here, I'm finally liking what I'm doing, and I'm liking the people I'm doing it with. I'm getting up in the morning excited to come over here and face the challenges of the day. I'm still advising the grad student who's following up on my work a little bit, and I'm even doing some consulting for a biotech startup in Silicon Valley, just to stay in the game for fun. And, if I ever start to get bored with what I'm doing, I'll remember what it feels like, and I'll do something else. I don't know if I'll keep updating this blog again. Now that I've come back and gotten the long-overdue explanation out of the way, I may just post little sciencey tidbits here and there to amuse myself. We'll see.

Tuesday, January 14, 2014

The Civilized Soldier

"The civilized soldier when shot recognizes that he is wounded and knows that the sooner he is attended to the sooner he will recover. He lies down on his stretcher and is taken off the field to his ambulance, where he is dressed or bandaged. Your fanatical barbarian, similarly wounded, continues to rush on, spear or sword in hand; and before you have the time to represent to him that his conduct is in flagrant violation of the understanding relative to the proper course for the wounded man to follow—he may have cut off your head." --Sir John Ardagh, discussing the necessity of "dum dums" (or expanding bullets) for international warfare, at the Hague Convention of 1899, which subsequently banned their use for international conflicts. Via Wikipedia.

Monday, November 18, 2013

Reflections from Bouldering

I've been spending a lot of time bouldering recently, a few times a week. Besides strongly incentivizing me to lose 10 lbs, I have started to learn some interesting lessons, the hard way.

Indoor bouldering is like rock climbing, but the highest it gets is about 17 feet, the floors are padded about two feet thick, and there are no ropes. That means I can show up whenever I want, alone, and climb for as little or as much time as I want, and not need someone to belay me. It also means that the first few times you climb, it can be pretty unnerving because when you fall, you just fall, boom, onto the mat. In fact I noticed that as I grew more tired during climbing, if I thought I wasn't going to be able to make it to the top, I would frequently climb or jump down while I still had control, even if I had some power left, because I wanted to avoid being all the way at the top and having no strength left, forcing me to fall uncontrolled from the top, as opposed to falling in a controlled way from halfway up. But, after you fall from the top a few times, this turns out to be a mistake: falling from the top doesn't hurt. That's why they let you do it and don't get sued too often (although, the place is actually blanketed in cameras, in deference to our tort-happy society: just in case someone does something stupid and sues, they have you on record.)

But, the more interesting thing that I discovered was that the barrier to failure was often simply exhaustion rather than skill. And this has a particularly interesting consequence: often, the best next move is making the next move. As a beginner, your instinct is to stop at each hold, look around, and see where the next move is. Which hold can you reach without falling over? But, watching other skilled climbers in the gym, they do it differently: first, they study the route before they start climbing. Then, once they're on their way up, they move gracefully and smoothly from one hold to another, and importantly, they keep moving. While you're stopped, looking around, your arms are growing tired, your tendons are aching, the skin on your fingers is starting to grate under the hand holds. And what I found, through brutal trial and error, was that I was much more consistently successful if I just kept moving. In an indoor bouldering gym, the holds are laid out somewhat logically, but also somewhat deviously, so that it's not always obvious what the solution is. But your brain moves pretty quickly, and without even realizing you're doing it, you're not even considering 9 out of the next 12 possible moves. The extra 5 seconds that it takes per move to decide amongst those three remaining moves is probably the difference between near-complete exhaustion and complete exhaustion. And complete exhaustion means failure. I'm a big fan of stopping and thinking about what you're doing, but the lesson is, when you're resource constrained and time is not on your side, don't think too hard. You might back yourself into a corner, but it's no worse than falling on your ass.

Friday, November 1, 2013

We Apologize for the Interruption in our Interrupted Programming

Not much call for blogging these days; most of the interesting Data Blogging Topics(tm) have been around the Snowden NSA leaks, and I've been trying to slog through a number of other things at work, so it's hard to find the energy to get invested in it. But, I wanted to stop by the blog and give you a brief update, to the assembled masses who may read this later (and, I have found out the hard way, blogs left unattended can come back and bite you in the ass.)

When the Snowden/NSA leaks first started coming out, the scope was pretty limited. Phone call metadata logging was the big topic, and my comments were primarily technical in nature. My decision to not express a particular opinion on the politics might have been construed as a tacit approval, or at least a lack of outrage, and I think the latter was probably not far off.

In the meantime, however, a lot more things have come out, such as the fact that the NSA has p0wnz0red the entire internet, and we've been eavesdropping on foreign heads of state and American citizens for essentially no reason. So, I wanted to update, for clarity, my feelings: this is odious, unamerican, and a fundamental breach of the public trust. The excellent New Yorker article about Alan Rusbridger, the editor of The Guardian, indicates that we're still just seeing the tip of the iceberg; the only limiting factor is how fast the journalists can process and understand the documents they've been given*. If you want to read more, the incomprable Bruce Schneier is your go to source, and I truly couldn't hope to add anything.

Sadly, in my search for hyperbole to compare this with, my mind goes back only as far as the buildup to the Iraq War, and it's hard for me to draw a comparison really: it's apples to oranges. The Iraq war was a breach of the public trust in a fundamental way, but it involved a lie which resulted in the deaths of over 100,000 humans. It's hard for me to draw a meaningful comparison there that doesn't minimize those deaths. But the consequences of fundamentally weakening the internet is hard to grasp, in both its scope and consequences. The ripples from this tidal wave will continue to leave marks in the sand well into the next generation, and only time will tell.

*The article refers also to Rusbridger's memoir in which he interleaves his Herculean work publishing the Snowden leaks with his year long struggle to master a particularly difficult work by Chopin. I immediately thought, "He must not have any children," but of course, we find out, he does. It is times like this that I am reminded of the late great David Foster Wallce's characterization of Wilhelm Leibniz in one of my favorite books ever written, "Everything And More: A Compact History of Infinity". He describes Leibniz, one of the inventors of the calculus, as "a lawyer/diplomat/courtier/philosopher for whom math was sort of an offshoot hobby", which he tags, in typical David Foster Wallace fashion, with a footnote, saying only, "Surely, we all hate people like this."

Wednesday, September 25, 2013

Less Talky, More Hacky

Light posting lately. I'm spending my time learning MongoDB, CouchDB, jQuery, Bootstrap, and node.js, in pursuit of various projects. Gotta get hip with the web technologies, dontchaknow.

I was an invited speaker at Sibos last week in Dubai, and spoke on money laundering and counter-fraud in commercial banking, and specifically how the same types of data comes up over and over again, so it doesn't make sense to field a different platform for each problem. My talk was apparently well received.

Thursday, September 12, 2013

Theory Thursday: The Central Limit Theorem

Back when I was employed researching the mysteries of the universe by pipetting lots of stuff, I used to say that physics was the study of things that are pointy in the middle and small on the ends, so that we can ignore the ends. Essentially, the idea in much of science is to figure out how to boil a phenomenon down to a few variables that have reasonably well defined values, i.e., a mean or average value. All measured variables have a distribution, but not all distributions are pointy in the middle, and only for distributions that are pointy in the middle does it make sense to calculate the average value. As a classic counterexample, the power-law distribution doesn't have a pointy middle:

You can still, of course, calculate the mean value of this distribution by summing over all the values and dividing by the number of values. But the point is that it won't mean much intuitively: "most" of the values won't be "around" the mean value. They're all over the place. So, if we want to be able to talk about measuring a "variable", we'd like it to have a peak in the middle. In particular, it would be really handy if our distribution was a Normal Distribution (aka, the famous Bell Curve, or Gaussian, after the legendary mathematician and physicist Carl Freidrich Gauss.)

A normal distribution has a couple of very nice properties that make math a lot easier:

The mean, median, and mode are all the same.
It's mathematically tractable to work with and has a simple form.

Luckily for us, the Central Limit Theorem has our back. What it says, basically, is that if you take a whole bunch of random variables, what you get out will probably* be pretty close to a normal distribution. And this is good news for people who like things high in the middle and flat on both ends**. Most real processes in the world are the result of a bunch of sub-processes, each of which has its own distribution. For instance, the average number of fish in a lake may depend on the average rainfall, the average temperature, the average number of fisherman, and the average amount of food, each of which in turn is affected by a number of other variables. When we mush these all together, things tend towards a normal distribution, which lets us deal with most natural processes in a tractable way mathematically, giving us a universe in which many things of interest have well defined average values, because they're peak-y.

*Without getting too deep in the weeds, this is true assuming your distributions have both a finite mean and a finite variance. Some power-law distributions do not have a finite variance, because they have what's called a "fat tail": basically, they don't converge to zero fast enough, so there's lots of stuff way out towards infinity. If all your variables are like this, you're in trouble. Luckily for us, the real world is mostly composed of things that have finite variance.

**As opposed to Ohio, which is high in the middle and round on both ends.