The Big Knowledge: 2013

Monday, November 18, 2013

Reflections from Bouldering

I've been spending a lot of time bouldering recently, a few times a week. Besides strongly incentivizing me to lose 10 lbs, I have started to learn some interesting lessons, the hard way.

Indoor bouldering is like rock climbing, but the highest it gets is about 17 feet, the floors are padded about two feet thick, and there are no ropes. That means I can show up whenever I want, alone, and climb for as little or as much time as I want, and not need someone to belay me. It also means that the first few times you climb, it can be pretty unnerving because when you fall, you just fall, boom, onto the mat. In fact I noticed that as I grew more tired during climbing, if I thought I wasn't going to be able to make it to the top, I would frequently climb or jump down while I still had control, even if I had some power left, because I wanted to avoid being all the way at the top and having no strength left, forcing me to fall uncontrolled from the top, as opposed to falling in a controlled way from halfway up. But, after you fall from the top a few times, this turns out to be a mistake: falling from the top doesn't hurt. That's why they let you do it and don't get sued too often (although, the place is actually blanketed in cameras, in deference to our tort-happy society: just in case someone does something stupid and sues, they have you on record.)

But, the more interesting thing that I discovered was that the barrier to failure was often simply exhaustion rather than skill. And this has a particularly interesting consequence: often, the best next move is making the next move. As a beginner, your instinct is to stop at each hold, look around, and see where the next move is. Which hold can you reach without falling over? But, watching other skilled climbers in the gym, they do it differently: first, they study the route before they start climbing. Then, once they're on their way up, they move gracefully and smoothly from one hold to another, and importantly, they keep moving. While you're stopped, looking around, your arms are growing tired, your tendons are aching, the skin on your fingers is starting to grate under the hand holds. And what I found, through brutal trial and error, was that I was much more consistently successful if I just kept moving. In an indoor bouldering gym, the holds are laid out somewhat logically, but also somewhat deviously, so that it's not always obvious what the solution is. But your brain moves pretty quickly, and without even realizing you're doing it, you're not even considering 9 out of the next 12 possible moves. The extra 5 seconds that it takes per move to decide amongst those three remaining moves is probably the difference between near-complete exhaustion and complete exhaustion. And complete exhaustion means failure. I'm a big fan of stopping and thinking about what you're doing, but the lesson is, when you're resource constrained and time is not on your side, don't think too hard. You might back yourself into a corner, but it's no worse than falling on your ass.

Friday, November 1, 2013

We Apologize for the Interruption in our Interrupted Programming

Not much call for blogging these days; most of the interesting Data Blogging Topics(tm) have been around the Snowden NSA leaks, and I've been trying to slog through a number of other things at work, so it's hard to find the energy to get invested in it. But, I wanted to stop by the blog and give you a brief update, to the assembled masses who may read this later (and, I have found out the hard way, blogs left unattended can come back and bite you in the ass.)

When the Snowden/NSA leaks first started coming out, the scope was pretty limited. Phone call metadata logging was the big topic, and my comments were primarily technical in nature. My decision to not express a particular opinion on the politics might have been construed as a tacit approval, or at least a lack of outrage, and I think the latter was probably not far off.

In the meantime, however, a lot more things have come out, such as the fact that the NSA has p0wnz0red the entire internet, and we've been eavesdropping on foreign heads of state and American citizens for essentially no reason. So, I wanted to update, for clarity, my feelings: this is odious, unamerican, and a fundamental breach of the public trust. The excellent New Yorker article about Alan Rusbridger, the editor of The Guardian, indicates that we're still just seeing the tip of the iceberg; the only limiting factor is how fast the journalists can process and understand the documents they've been given*. If you want to read more, the incomprable Bruce Schneier is your go to source, and I truly couldn't hope to add anything.

Sadly, in my search for hyperbole to compare this with, my mind goes back only as far as the buildup to the Iraq War, and it's hard for me to draw a comparison really: it's apples to oranges. The Iraq war was a breach of the public trust in a fundamental way, but it involved a lie which resulted in the deaths of over 100,000 humans. It's hard for me to draw a meaningful comparison there that doesn't minimize those deaths. But the consequences of fundamentally weakening the internet is hard to grasp, in both its scope and consequences. The ripples from this tidal wave will continue to leave marks in the sand well into the next generation, and only time will tell.

*The article refers also to Rusbridger's memoir in which he interleaves his Herculean work publishing the Snowden leaks with his year long struggle to master a particularly difficult work by Chopin. I immediately thought, "He must not have any children," but of course, we find out, he does. It is times like this that I am reminded of the late great David Foster Wallce's characterization of Wilhelm Leibniz in one of my favorite books ever written, "Everything And More: A Compact History of Infinity". He describes Leibniz, one of the inventors of the calculus, as "a lawyer/diplomat/courtier/philosopher for whom math was sort of an offshoot hobby", which he tags, in typical David Foster Wallace fashion, with a footnote, saying only, "Surely, we all hate people like this."

Wednesday, September 25, 2013

Less Talky, More Hacky

Light posting lately. I'm spending my time learning MongoDB, CouchDB, jQuery, Bootstrap, and node.js, in pursuit of various projects. Gotta get hip with the web technologies, dontchaknow.

I was an invited speaker at Sibos last week in Dubai, and spoke on money laundering and counter-fraud in commercial banking, and specifically how the same types of data comes up over and over again, so it doesn't make sense to field a different platform for each problem. My talk was apparently well received.

Thursday, September 12, 2013

Theory Thursday: The Central Limit Theorem

Back when I was employed researching the mysteries of the universe by pipetting lots of stuff, I used to say that physics was the study of things that are pointy in the middle and small on the ends, so that we can ignore the ends. Essentially, the idea in much of science is to figure out how to boil a phenomenon down to a few variables that have reasonably well defined values, i.e., a mean or average value. All measured variables have a distribution, but not all distributions are pointy in the middle, and only for distributions that are pointy in the middle does it make sense to calculate the average value. As a classic counterexample, the power-law distribution doesn't have a pointy middle:

You can still, of course, calculate the mean value of this distribution by summing over all the values and dividing by the number of values. But the point is that it won't mean much intuitively: "most" of the values won't be "around" the mean value. They're all over the place. So, if we want to be able to talk about measuring a "variable", we'd like it to have a peak in the middle. In particular, it would be really handy if our distribution was a Normal Distribution (aka, the famous Bell Curve, or Gaussian, after the legendary mathematician and physicist Carl Freidrich Gauss.)

A normal distribution has a couple of very nice properties that make math a lot easier:

The mean, median, and mode are all the same.
It's mathematically tractable to work with and has a simple form.

Luckily for us, the Central Limit Theorem has our back. What it says, basically, is that if you take a whole bunch of random variables, what you get out will probably* be pretty close to a normal distribution. And this is good news for people who like things high in the middle and flat on both ends**. Most real processes in the world are the result of a bunch of sub-processes, each of which has its own distribution. For instance, the average number of fish in a lake may depend on the average rainfall, the average temperature, the average number of fisherman, and the average amount of food, each of which in turn is affected by a number of other variables. When we mush these all together, things tend towards a normal distribution, which lets us deal with most natural processes in a tractable way mathematically, giving us a universe in which many things of interest have well defined average values, because they're peak-y.

*Without getting too deep in the weeds, this is true assuming your distributions have both a finite mean and a finite variance. Some power-law distributions do not have a finite variance, because they have what's called a "fat tail": basically, they don't converge to zero fast enough, so there's lots of stuff way out towards infinity. If all your variables are like this, you're in trouble. Luckily for us, the real world is mostly composed of things that have finite variance.

**As opposed to Ohio, which is high in the middle and round on both ends.

Wednesday, September 11, 2013

Fingerprint Scanners and Network Privacy Effects

Yesterday, I had some snark for the assertion that Apple using biometric identification in a consumer product amounted to then taking your fingerprints "against your will". I also considered the ethical aspects of whether your neighbor's privacy choices affect yours. But from a technical perspective, I find myself still very interested in Jacob Appelbaum's assertion that this will have an impact on overall privacy (or, specifically, his privacy) via "network effects," and found myself thinking through what this might mean. What follows is a probably overly pedantic analysis of the idea of privacy network effects in general.

First, let's define what a "network effect" is in this context: technically, network effects of technology are ways in which the adoption of a technology by someone else makes that technology more or less valuable for me. As an example of a positive network effect, e-mail is more valuable if more people use it, because I can reach more people using e-mail. An example of a negative network effect is traffic or network congestion: the more people who use cars, the more traffic I have to contend with. I think, technically, we would construe a network effect on privacy for the iPhone fingerprint scanner to be one in which adoption of the device by others reduces @ioerror's privacy if he uses the same device. However, I think we can safely conclude that @ioerror won't be using an iPhone 5S, or if he does, he'll use a sharpie to disable the fingerprint scanner. So, more broadly construed, we might consider network effects in which other peoples' adoption of the iPhone 5S reduces @ioerror's privacy, or even more generally, reduces the privacy of other people who don't use the phone in general.

It's important to distinguish this from simple consumer choice: there may be an overall reduction in privacy because of peoples' choice to use the iPhone 5S fingerprint scanner, but they may make that choice entirely based on considerations of convenience. This is an important distinction because, in the absence of network effects, it means that there's effectively no moral angle to the fingerprint scanner: the very fact that a large market exists for such devices means that community standards accept such choices as valid*.

There are a few mechanisms by which we can imagine privacy network effects being propagated. I think it's clear from context that the case that @ioerror is worried about is the normalization of biometric identification: including fingerprint scanners in phones which lots of people use will make people less more complacent about fingerprint scanners in general. Is there evidence for this? There are certainly a lot of cases of the public accepting lower privacy standards for specific purposes. For instance, when TSA imposed full-body scanning at airports, a lot of people shrugged and walked through the scanners. But, there was no obviously identifiable network effect: we didn't start to see full-body scanners replace metal detectors at federal buildings or schools (although it may be too soon to tell.) It may be the case that there are downstream effects: have we seen a profusion of metal detectors in public places (ball games, emergency rooms, schools) in general? Probably; I can't find statistics on this, but casual observation strongly suggests it. Is there a case to be made that this is due to normalization of security technology into our everyday lives? Again, very possibly. But is that due to a network effect, or due to simply government policy and heightened media attention? That is much harder to establish.

Perhaps a more compelling example is the profusion of sites that now let you log in using your Facebook ID instead of tracking logins on their own. As such logins become more common, it's easier to shrug at the (very real) privacy considerations of linking your Facebook account to each additional site. An important difference between these two cases is cost: metal detectors and full body scanners are expensive. Software is cheap. Which leads us to a second potential mechanism for network effects: By including such devices in their mass produced phones, Apple will effectively bring the cost of such devices down to the point where other phone manufacturers may start using them, and it may come to the point where it is difficult to buy a smart phone without one. This, I think, is a much more easily demonstrated mechanism of network effect. However, both are highly indirect: the adoption of the technology by party A does not directly impact party B's privacy: it's only through a very indirect set of policy, economic, and attitude changes that such effects could be propagated, and it's far from clear that these effects are even close in magnitude to the simple market demand for such devices.

Then, of course, it bears questioning: how would fingerprint scanners actually impact our privacy? First, there's what I call The Strong Hypothesis:

I've updated the iPhone fingerprint scanner diagram with an important detail Apple left out. pic.twitter.com/k63flx8y6j
— Dylan Tweney (@dylan20) September 10, 2013

The Strong Hypothesis is that the NSA will gather fingerprints en masse from iPhones and other devices, then use them to create a national database. Six months ago I would have rated this tinfoil-hat-silly. But, of course, the revelations of the last few weeks make a lot of us look pretty silly for thinking that way, so it's no longer possible to simply disregard that possibility out of hand.

The Weak Hypothesis is that, for instance, the FBI will be able to subpoena your fingerprints from Apple in order to compare them against fingerprints they've collected, when previously they would have had no way of getting your fingerprints short of hauling you in. Whether or not this could happen depends a lot on how the technology is managed, and it seems more likely than not that Apple will store the fingerprint data on the device in a way that precludes remote access. But Apple has done stupider things before, and trusting their commitment to privacy is probably not a good strategy.

So, to close the loop, a network-effects-privacy-impact might look something like this: Apple's introduction of the fingerprint reader to the iPhone 5S lowers cost and social barriers to similar devices, and we start seeing fingerprint scanners not just on phones, but on laptops, cars, at the airport, and even at the gym. Oh, wait...

*The same argument doesn't necessarily hold for things like cigarettes though: sale and consumption of cigarettes imposes externalities on people who don't consume them, in the form of second-hand smoke, and increased public healthcare costs. Even if the sale of cigarettes is evidence that the community approves of cigarette consumption in and of itself, the effects on others have highly complicating moral effects. This is why it's important to establish whether there are network effects in deciding whether there's a moral aspect.

Tuesday, September 10, 2013

Fingerprint Panic!

So, it sounds like the shiniest new iPhone will have a fingerprint scanner for security. Bruce Schneier, naturally, has some interesting and relevant technical considerations which he voices, in particular, about what happens if Apple decides to store fingerprint data in the cloud. But it seems like there should be a secure way to do this: nobody is realistically going to actually store an image of the fingerprint, not even on the phone itself. Instead, they'll store a hash of some set of metrics derived from the fingerprint. If you add salt to the hash, and the salt is stored on the phone, then you still need an unlocked physical device for the hash to be useful at all; the cloud-stored version is useless, just like a salted password file.

On the other end of the spectrum, there's this:

Allow me to suggest, for the paranoid, a few practical steps to be taken here to foil The Man:

Don't buy an iPhone. Problem solved.
If you absolutely must play Angry Birds, turn off the fingerprint reader and use a passcode. Better yet, use a non-numeric passcode.
If you don't believe that turning off the fingerprint scanner will foil the NSA's backdoor into your phone, try using a Sharpie to color over the fingerprint scanner window.

With respect to the network effects: what do you care if someone chooses convenience over privacy? People do it all the time, with their Safeway club card, their credit card, their choice to go through the scanner instead of get a pat-down at the airport, etc. Privacy is a very personal decision. Some people crave it, and it's their constitutional right which I support staunchly. But lamenting the "network" or "societal" effects of other people choosing security, convenience, fame, or money over privacy makes you little different than a church pastor denouncing the gay lifestyle because of the effect it will have on children. Society as a whole is constantly making decisions about their personal trade offs of privacy versus convenience, and you can always go Galt and peace out to a cabin in the woods if the unwashed masses refuse to hear your speech. You might want to give up tweeting if you go that route though: you give up far more privacy in practical terms through Twitter, Google, and Facebook than you would from a fingerprint scanner on your phone.

Thursday, September 5, 2013

What I Did With My Summer Vacation

This was my third year attending the Burning Man festival in the Black Rock Desert in Nevada, although "attending" isn't really what you do at Burning Man; you are there to experience it. It is a 60,000 person city that springs into existence for a week, with interactive art, events, and parties, all of which is either burned or disassembled at the end of the week. Like a sand mandala, it will never be there in the same way again. This year I took very few photos (if you'd like to see some incredible photos, see these by Jim Urquhart, a Reuters photographer who I met at last year's burn). But I spent a lot of time in quiet contemplation and introspection, a lot of time meeting new people, a lot of time dancing alone and with others, and a lot of time hugging people and making new connections. A short list of things I learned this year:

Not all of the pieces of my personality are well suited for all purposes. I can be impatient, overly solution-focused at the expense of the feelings of others, and intimidating. I often beat myself up over these "flaws", but I realized that these are not, in fact, flaws. Every part of my personality, my humanity, is important, is a part of who I am, and is valuable. Sometimes, you bring some parts of your personality to bear in certain situations, and sometimes you let other parts take a back seat, because it's the best place to get you from A to B. But that doesn't mean that those parts of your personality are bad, they're simply the wrong tools. Next time you think that by not saying something, you're being untrue to your authentic self, ask yourself whether you're really being inauthentic, or whether you're just choosing to utilize a different skill (patience, forbearance, acceptance) rather than the one that's easiest and comes most naturally to you. And, then, when you're out with your friends later, speak unvarnished truth, be funny, and abrasive, and a bit loony. They love you for it. That's why they're your friends.
Solid state hardware can, in fact, succumb to dust and heat and fail in the desert, the same way things with moving parts can.
Don't try to write your greatest hits anthology while you're on your second album. Don't think about how you're going to tell your story to people while you're still in the middle of it. Experience it mindfully and fully, and absorb it. Let the emotion, the joy, the sadness, the frustration, the come to you, experience them without judgment, and remember them. Wait until it's time to tell the story to put them into context. You never know what's around the next bend that could change the whole thing.
PVC is light, but it's not an excellent structural building material, and can buckle and fail easily and spectacularly under strain, especially in high heat. Always over-engineer anything built with PVC.
Open yourself to spontaneity. As Woody Allen said, half of life is just showing up.
You've got a lot of love to give. Give freely of it, and it will come back to you in the form of joy. At the burn, it's easy to connect with others, because people are open to the experience. It's harder back here in Palo Alto, but less hard than you might think.

L'Shana Tova.

Saturday, August 24, 2013

I Will Show You Fear In a Handful of Dust

No posting until after Labor Day. I'll be off the grid in the desert for a little over a week: no phone, no e-mail, no SMS, and no internet. Although I will be bringing a small Raspberry Pi-based server with me, so technically, there will be WiFi, but only on the local intranet that I'm setting up, with no uplink to the outside world. See you when I get back.

Thursday, August 22, 2013

Theory Thursday, with bonus Metaphysics: Gödel's Incompleteness Theorem

Gödel's Incompleteness Theorem is one of those things that makes pop science writers go gaga. And, I'm not going to be contrarian: it's profound, it's meta, it's paradoxical, and it has a lot to say about what we can accomplish as human beings. But, at its core, it's really a programming problem, albeit one with very interesting implications*.

In the 1920s, in the shadow of The Great War, David Hilbert set out to do something monumental, something that would bring together mathematicians from all over the world, in a grand challenge: to formulate all of mathematics rigorously, and provably, starting from a single set of axioms. It came to be known as Hilbert's Program, and the name has an unintentional level of acuity: it really boils down to a system of encodable rules which one can use to generate mathematically true statements. Then, by utilizing these rules in a fully deterministic fashion, you can prove mathematical theorems. By iterating with those rules, you could theoretically enumerate all possible mathematically true statements: it's a programming language for generating All Mathematical Laws!

Except...for that "theoretically" bit. Turns out, it's not true. Kurt Gödel spent a lot of time trying to prove the completeness and self-consistency of such systems, but eventually wound up proving just the opposite: he demonstrated that you could construct the following types of statements in this mathematical language:

Statements about the provability of other statements (e.g., Statement X is unprovable)
Statements that reference themselves (e.g., This sentence has N characters.)
By combining the two, you can construct statements that reference their own provability, e.g.:

This theorem is unprovable.

This is similar in construction to the classic form of Russel's Paradox ("Does the barber who shaves everyone who doesn't shave themselves shave himself?"), but with a twist: In this case, if the statement is provable, it's a paradox. However, if it is actually unprovable, there's no paradox. Therefore, it is both unprovable but also true. Hence, Gödel showed that the system is incapable of proving a demonstrably true statement, and cannot be complete. As you might guess, there's a lot of overlap between this finding and computability theory, and it can even be expressed in terms of the latter, which brings it strongly into the realm of computer science.

But, one of the things that has always stuck out to me about the Incompleteness Theorem is what it has to say about what it is we can possibly know about the world, and what this means about our perception of reality. I was raised in a moderately religious household, and eventually, in the course of my own explorations, became very observant, went to Jewish summer camp, studied the Torah, and tried, to the best of my ability, to puzzle out what it was God wanted out of us. But, over the years, the relentlessly empirical life of the academic scientist took a toll on my faith, at first subtly, but like a river grinding out the Grand Canyon, its effect was profound over the 10 years I spent as a physicist. And, one day, I woke up and realized: there was no room in The Mechanical Universe for an omniscient, ominpotent being**. Plus, the psychological and sociological reasons for us to invent such a being were too clear to ignore. It was just too facile a solution to the difficult problem of What Do I Do With My Life.

But even so, I've never been able to call myself an atheist, because, in spite of all my book larnin', I know what I don't know. On the one hand, at the most granular scale, almost everything I've ever learned about the physical world is premised on an unprovable hypothesis: that the things that we observed yesterday are a good guide to what we will observe tomorrow. It's unprovable, because the only evidence that we have for the inductive principle is the inductive principle itself: it has worked well in the past, so it should continue to work. I have no problem using the inductive principle as a guide, as a way to decide which insurance to buy, or how to spot a mu meson. But in matters of spiritual life or death, can we really count on something so flimsy?

And then, on the grand, cosmic scale, we have the Incompleteness Theorem: I know that there are things out there that are true, but can't be systematically proven. In fact, Gödel proved that there's an infinite number of them. If that's true, I can prove and prove and prove until my hearts' content, but I can never be sure that I haven't failed to observe something which could change my entire understanding around. There's just too much truth out there, and not enough time.

So, what are we to make of the universe? Nothing is provable, and there's an infinity of unprovable true things that we'll never know. Given all that, can we really rule out, conclusively, the existence of something outside of this mortal coil? And so ends my catechism.

*However, while it's really cool, it's doesn't have something to say about everything: Carl Woese once asked me to write a paper on what the implications of the incompleteness theorem were for biological evolution, and Douglas Hofstadter happened to be giving a colloquium that week. I went down early before the colloquium and accosted him over coffee to ask him his thoughts on the matter. He considered it briefly, and then said, "I don't think there are any."

**The Mechanical Universe used to be shown on PBS in Chicago when I was growing up, and I found every single episode, from Newton's Laws to Relativity, to be absolutely mesmerizing. It had a strong impact on my decision to study physics.

Wednesday, August 21, 2013

Is Earlybird or X-Pro II better for your wads of cash?

I was at a conference once on technology and law enforcement, and there was a gentleman presenting some sort of Twitter-monitoring technology on behalf of IBM. Because I was at the booth next to him, I got to hear his pitch dozens of times, describing a specific scenario in which a small time drug dealer was coordinating with clients over Twitter. As the day went on, the size of the fish he had caught got bigger and bigger, and the tone of his pitch more and more frenzied, until finally he jumped the shark and told someone that "nowadays all crime is coordinated over Twitter." Hah, Twitter is so 2012! In 2013, all crime is coordinated over Instagram:

Aspiring rapper's Instagram photos lead to largest gun bust in New York City history

Also in today's news: Supreme Court Justices don't use e-mail, which probably makes it difficult for them to opine on things like net neutrality, whether metadata qualifies for Fourth Amendment protections, and a host of other technologically relevant issue that are currently defining our society. Technological innovation is occurring at an ever faster rate, and has clearly passed the rate at which Supreme Court justices die, which means that this will be a permanent state of affairs. They may begin to understand e-mail around the time when they are asked to rule on the legality of remote-scanning your flying car's hyperdigicube transponder.

Monday, August 19, 2013

The Three P's of Big Data: Plumbing, Payment, and Politics

In a panel discussion at Georgetown Law School, I discussed some of the barriers to actually using data, at scale, in a large organization, to drive change. I lump these barriers into three categories, which I call The Three P's: Plumbing, Payment, and Politics. We're a long way from being able to just throw all the information into a computer, ask it a question, and get an answer out, and the things that stand between here and there are my raison d'etre. This weekend, The Sunday Times dipped its toe into Big Data commentary, with a contrarian take titled Is Big Data an Economic Big Dud?, in which the author asks, bluntly, What's taking so Goddamn long?

[T]he economy is, at best, in the doldrums and has stayed there during the latest surge in Web traffic. The rate of productivity growth, whose steady rise from the 1970s well into the 2000s has been credited to earlier phases in the computer and Internet revolutions, has actually fallen. The overall economic trends are complex, but an argument could be made that the slowdown began around 2005 — just when Big Data began to make its appearance.

This is a worthy topic of discussion, but the article is a mish-mash, lumping things like streaming video, text messaging, Cisco, and analytics companies under the title "Big Data", and asking why this "Big Data" hasn't made an impact on the economy (while tacitly admitting that the economy has been in such flux since 2005 that asking what the impact of any particular development is amounts to asking why the wind direction didn't change when you passed gas.) In some sense, the very fact that the New York Times has covered it means that it may have already peaked, because once the Sunday Times becomes aware of a technological development, it's reached the level that geriatric crossword-puzzle enthusiasts might actually be dimly aware of it. Witness:

Other economists believe that Big Data’s economic punch is just a few years away, as engineers trained in data manipulation make their way through college and as data-driven start-ups begin hiring.

(Emphasis mine.) I don't know where the Times gets their information about high tech hiring, but judging from this sentence, it's probably from the classifieds page of the New York Times.

Paul Krugman, as usual, has a more nuanced and interesting set of questions:

OK, we've been here before; there was a lot of skepticism about the Internet too — and I was one of the skeptics. In fact, there was skepticism about information technology in general; Robert Solow quipped that “You can see the computer age everywhere but in the productivity statistics”. But here’s another instance where economic history proved very useful. Paul David famously pointed out (pdf) that much the same could have been said of electricity, for a long time; the big productivity payoffs to electrification didn't come until after around 1914.
Why the delays? In the case of electricity, it was all about realizing how to take advantage of the technology, which meant reorganizing work. A steam-age factory was a multistory building with narrow aisles; that was to minimize power loss when you were driving machines via belts attached to overhead shafts driven by a steam engine in the basement. The price of this arrangement was cramped working spaces and great difficulty in moving stuff around. Simply replacing the shafts and belts with electric motors didn’t do much; to get the big payoff you had to realize that having each machine powered by its own electric motor let you shift to a one-story, spread-out layout with wide aisles and easy materials handling.

I think it's reasonable to suggest that Big Data™ writ large may never (and probably won't) have the same impact on the world as electricity or The Intarwebz. But the question of what may be slowing its impact is something on which I am well qualified to comment. I will make more extensive posts about this in the coming weeks and months, but in terms of the three P's, the plumbing pieces are still in their infancy, and are hugely labor intensive: getting data from unstructured sources, like documents, images, and video is nearly impossible without significant cost and data-specific tuning. And, even making full use of structure data (databases, XML, etc) is painful because of the amount of context and integration required.

More importantly, though, the politics of data-driven decision making are subject to the politics of any paradigm-shifting change, particularly ones that have the impact to show that the Chief Executive Officer is wearing no clothes. Large organizations have been Doing It Wrong™ for so long now that convincing them to take the implications of their data seriously is an uphill battle against politics and entrenched interests. In theory, the market should correct and allow data-driven decisions to lead if they provide a competitive edge. But in practice, pop-culture management books and pop-science provide managers with confirmation bias that their gut instincts are all they need to run a giant organization, data be damned. And, even when a CEO wants to make a large scale change, he has to push it through using hired thugs just to get any traction.

Essentially, we're still at the infancy of the Big Data movement, both in terms of its ease of use, and its cultural adoption. The potential to monetize how we use data for decision making is huge, but there are still significant obstacles to leveraging its potential.

Friday, August 16, 2013

Data Sharing Done Right

By my hackweek teammate Andy I: http://www.marketsforgood.org/beyond-alphabet-soup-5-guidelines-for-data-sharing

Read this if you ever plan to share data publicly.

I Do Not Think This Word Means What You Think It Means

Poor Peter Shih. His rant about the things he hates about San Francisco (ugly girls, homeless people, and not being able to park his car) generated so much derision that he had to take it down, and clarify that it was "satire". This is a common defense of bigots and jerks everywhere. Here's the thing: satire is a tool used by the weak against the powerful. It's a way that someone with no power can even the odds, through wit and insight. Here, Shih is the powerful: he's YCombinator backed male with a car and a job making fun of women, the homeless, and bicyclists. If it were funny, he'd at least be able to claim he was poking fun at himself. But it's not funny. There isn't even a hint of self-deprecation. A vitriolic stream of misogyny and self-entitlement isn't called "satire", it's called being a jerk. And so ends my catechism.

Tuesday, August 13, 2013

Getting (Real) Things Done

I'm a technologist. I like technology. Specifically, I like bringing real change to real problems, using technology. I spent 10 years as an academic, engaged in, as my father liked to say, the examination of the frontiers of human knowledge using a magnifying glass. But, by far, the thing I loved more than anything else as an academic was to build things, both machines and software. Instead of relying on paper notebooks, I rolled my own web-based lab notebook, in a combination of object-oriented Perl and JavaScript (this was a decade ago, and to this day, my colleagues insist on endorsing me for Perl on LinkedIn, which I think is about as valuable as being endorsed for trepanning.) In the end, one of the reasons I left academia was precisely because I liked building things a lot more than I liked discovering things of marginal value, and one of the reasons I love my job is because the things I build have immense value.

I put all this forward as a form of credentialing, though, to give you my bona-fides as a techno-optimist. Because what I really want to say is: the vast majority of the problems in our world will not be solved by apps, hackathons, or data jams. They can only be solved by talking to people.

Let's be honest: most engineers and nerds hate talking to people. I met my wife through a dating website, because I am gripped with terror at the idea of introducing myself to a woman in a social setting. I book my restaurant reservations using Open Table because restaurant hosts at upscale restaurants are some of the most odious people to talk to on the planet. So, faced with a problem, we will opt to write software to solve it. Witness, for instance, this article about StopBeef.com, a “matchmaking app for murder mediators”, as PandoDaily puts it:

A high profile shooting on Mother’s Day that injured 19 people got hacker Travis Laurendine thinking about how homicide was impacting the city. The National Civic Day of Hacking was fast approaching, so he decided to try “hacking the murder rate” for the New Orleans version of the event.

Laurendine was friends with Nicholas through the rap community, and he turned to him for advice. “I was like: you know this inside and out, what do you think would help?” Laurendine said.

Nicholas told Laurendine about a recent mediation experience he’d had. Two guys were beefing and carried guns. The conflict had escalated over time, and it was starting to look like one would try to take the other out. But Nicholas knew both of them, and one of them asked him to help settle the issue.

An intervention was staged, and the two beefers showed up with their entourages to Nicholas’ restaurant. “They both really wanted to stop the beef,” Nicholas remembers. “I told them it didn’t make sense for them to die over nothing.” The entourages agreed, and with their friends (and a local celebrity) asking them to stop, the beefers were able to end the disagreement without looking like they had backed down.

When Laurendine asked Nicholas for hacker app ideas, Nicholas relayed this story and said, “You need to do something like that! You can hit this app, log in, tell them who you’re beefing with, and someone from the neighborhood can come spoil.”

…

After submitting the stopbeef app to the government, Laurendine was picked as one of the 15 top “Champions of Change” nationally in civic hacking. He’s headed to the White House to receive recognition tomorrow.

I don't mean this as a dig against StopBeef.com, which I'm sure has provided some really valuable outcomes, and may have even saved lives (although the linked article doesn't cite any specific success stories.) But, it's pretty clear that the key element here is people talking to people. And I have to wonder: if Laurendine et al had spent the time learning about the community, talking to people, doing volunteer work with local tutoring organizations or Boys and Girls clubs, or finding a way to actually engage people directly, would it have had a higher impact than the equivalent amount of work spent trying to get their CSS to look just-so?

Social problems are particularly prone to this kind of wishful thinking. When I first started looking for volunteer opportunities when I was in graduate school, I reasoned that there must be dozens of organizations looking for someone to build them a web site, or a database, or maintain their PCs, and that I could bring my highly valuable skill set to bear on these problems, to great plaudits. What I in fact, discovered, is that volunteer organizations need people to do things for other people, not for machines. And, the less interesting and high profile the work, the more they need it. Habitat for Humanity had so many church groups, schools, and companies calling on them to volunteer that the local chapter didn't even bother to call me back. The San Francisco Food Bank has a several week waiting list if you want to come in and help sort canned goods. Even the Martin De Porres soup kitchen in my neighborhood sometimes has too many people doing dishes when I come in to volunteer. In grad school, I wound up driving a van for a program called Mothers Too Soon. On Thursday nights, in the winter, I would pick up 15 – 17 year old African American girls from the outlying areas of Champaign-Urbana, and drive them and their young children to a local church. There, I would sit and read journal articles for a couple of hours while they had dinner and counseling, and then would drive them back home. It wasn't particularly fun, but that was the point: nobody else wanted to do it, which was why there was a need. The only skill I had that it utilized was a drivers license, and a willingness to help out.

But, techno-optimism bedevils the corporate world just as much. 99% of my job could be done from the comfort of my home office. I can video teleconference, Webex, call, SSH, and pretty much do all of my job functions without ever so much as standing up. But I hop on a southbound train from San Francisco to Palo Alto every morning at 7:19AM like clockwork, and I fly back and forth to DC and New York (as I am doing right now) several times a month. And I do it because, in spite of all the things I can do with just a computer and a wifi connection, nothing important ever gets accomplished unless I'm there in person. You can commit code to the git repo for years on end, but it's not until a person is actually using it that you've accomplished something.

For example. I have occasionally planned trips to client sites not because I needed to get something done there, but because being there, in person, forces the client to make sure I have something to do when I get there. So, I tell them that I'm coming in to configure the networking, and they need to have the servers racked by Wednesday. The truth is that, if they just plug in the Remote Access Console, I can configure the rest of it from Palo Alto, without the need to fly anywhere, or even put my pants on, for that matter. But if I did that, I could spend the next month or more calling, e-mailing, and sending carrier pigeons, for all the good it would do me; the servers would get racked when the IT team had nothing else to do, which is not a condition that has ever prevailed since the invention of the abacus (“Abacus help-desk hours are between 2AM and 4AM on the first new moon of the harvest season. If you require assistance outside of these hours, please consult our rice-paper scroll for self-help and Frequently Asked Questions.”) Is this a huge waste of jet fuel, money, and 3-ounce bottles of shaving cream? Yes. But it's less wasteful than flushing three months of a six month software pilot down the drain while I wait for the servers to get racked. And, while I'm there, I can stop by the analysts' desks and chat with them about their work, their workflow, and what kinds of actual, real-world problems keep them up at night. I can take the clients' project lead out to lunch and let them dish face-to-face about who's trying to block the roll-out, and what kinds of changes we might want to make to turn their opinion around. I can ask for an introduction to the head of a different department or someone who works at a different firm, and see if they're having the same problem, and whether we can try to solve it for them too. And I can hit up my college friends who live in New York and DC, toast heartily to our health, and hear about their latest project, whether or not it has any bearing on my work or not, just to expand my mind a bit.

Software and the internet has changed our world, and in most ways, I think it's been for the greater good. People like to throw digs at Facebook for its triviality, but I landed my current job because I connected with someone on Facebook who turned out to be friends with someone who I had met years before and not seen since. So, we reconnected, and started talking, and when my postdoc was over, he suggested I interview. Not only that, but in the ensuing years, he's become one of my best friends, and was the officiant at my wedding. I can confidently say that my life would be less rich and less happy without Facebook. But the good that Facebook, and JDate, and GMail did for me were that they brought me into contact with people who mattered. The internet has lowered the barriers to human interaction, but it is always the human interaction itself that makes the world a better place. Next time you want to write an app to solve a problem, ask yourself: could I do more by volunteering, or community organizing, or even just reading and learning about this problem, than I could by debugging jQuery? Or am I secretly serving my own interests by doing something I enjoy or learning a new marketable skill, or maybe just avoiding the fact that, for better or for worse, the problems in our world are really, really hard?

Tuesday, August 6, 2013

Something You Will Never Hear in a Spy Movie

"Just run it through one or two databases. No need to go overboard."

Thursday, July 25, 2013

Jonathan Swift Slow Claps From His Grave

Congratulations are in order for Holman Jenkins of the Wall Street Journal: he has penned a truly devastating double-Swiftian satire of both the NSA call data warehousing program and the Wall Street Journal editorial page's disdain for civil liberties: while the rest of us were discussing whether the program goes too far, and trying to carefully weigh the balance between privacy and security, Jenkins kicks it up a notch and suggests that the problem with the program is that not enough government agencies have access to the data of every phone call made by everyone (Metadata Liberation Movement, WSJ, July 23, 2013).

How do we know that this must be satire? He does a pretty good job keeping it on the level, making it seem completely serious, but there are a few dead giveaways, howlers that would just be too stupid to put into print if it weren't.

For one thing, he starts with the following quote:

"It is not rational to give up massive amounts of privacy and liberty to stay marginally safer from a threat that, however scary, endangers the average American far less than his or her daily commute," writes Conor Friedersdorf in the Atlantic, expressing a common view [about counter-terrorism surveillance].

A perfectly reasonable sentiment. But he starts to hint in the very next paragraph:

Another kind of loss of liberty comes when our tax dollars are spent on useless programs.

A nice setup: this is the Wall Street Journal editorial page, so we expect every reasonable paragraph to be followed with "...and taxes are too high." At this point, you think he's going to run straight up the middle, accuse the Obama administration of malfeasance for funding a hugely bloated program, and suggest that, in the name of liberty, freedom, and apple pie, we should cut the whole damn thing, and return the money to the oil companies they took it from. But that's exactly when he feints to the left!

The biggest problem, then, with metadata surveillance may simply be that the wrong agencies are in charge of it. One particular reason why this matters is that the potential of metadata surveillance might actually be quite large but is being squandered by secret agencies whose narrow interest is only looking for terrorists.

I know what you're saying: "This might be silly, but it's not patently asinine." True. But Jenkins is just getting started. His prime, leading example of an important problem plaguing everyday Americans that can be solved by giving MOAR METADATA to every law enforcement agency in the country? Wait for it...highway serial killers.

Highway serial killers are enough of a problem that the FBI formed a task force devoted to them, its Highway Serial Killers Initiatives. Instead of finding a suspect and trying to tie him to bodies, could metadata help us quickly find suspects based on the locations of bodies?

How would this Magical Metadata help us solve this scourge of highway serial killers? Who knows? Jenkins doesn't even offer a theory. It's not clear that he actually even knows what the word "metadata" means, apart from "something the government wants very, very badly, so we should give it to them." Other scourges worth giving the government the entire list of people you have ever called? Anticipating traffic jams. At some point, he has apparently shifted from talking about phone call metadata to some other kind of data entirely, but it's not clear when that happened. He still must be talking about something related to government surveillance data, because otherwise, it would sound like he is simply advocating using data to try to predict traffic jams, an idea which is about as original as using data to try to predict the weather. He seems to be under the apprehension that prepending the term meta- to these observations transforms them from "stringing together this week's trending twitter topics into barely coherent sentences" to "stunningly original insight."

It's all downhill from there. He marches straight off into stupid-town, saying things like, "'Big data' is only as good as the algorithms used to find out things worth finding out," and "Our guess is that big data techniques would pop up way too many false positives at first, and only considerable learning and practice would allow such techniques to become a useful tool," which seems simultaneously ignorant and vapid. But that's just Jenkins pulling your leg again. Here, he's making fun of the history of the government sinking vast sums of money into black box "algorithms" to look for terrorism and then trying to bury it. Surely, he couldn't have written such an article without having familiarized himself with the topic in question. For that matter, why does the WSJ editorial board need to make "guesses"? Why not let someone knowledgeable write the column?

But he saves the real zinger for the end: after trumpeting highway serial killers as proof that we need more people knowing when you called your grandma, he caps it off with this:

Most of all, it would allow these techniques to be put to work on solving problems that are actual problems for most Americans, which terrorism isn't.

Thank you sir. Your contributions to the art of satire will be duly noted for generations hence.

Monday, July 22, 2013

Test Data

Quora has a super-handy post for data nerds listing big open source data sets you can use for testing:
http://www.quora.com/Big-Data/What-kinds-of-large-datasets-open-to-the-public-do-you-analyze-the-mostly

I would add to that the Enron e-mail corpus, which is great for testing anything against e-mails:
http://www.cs.cmu.edu/~enron/

There's "only" about 600,000 emails total, and of those, only about 300,000 are unique, but for an unstructured data set, it's good, and it's the gold standard.

Update:
Another useful list of large unstructured data sets:
http://www.quora.com/Where-can-I-get-large-corpora-open-to-the-public

On Metadata and the Right to Privacy

One of the interesting points raised by the NSA call data warehousing disclosures is whether the collection of metadata about phone calls is materially different from collection of the actual content of the calls themselves, and if so, how. It's easy to see why you wouldn't want people listening in on your calls. Knowing who you called and when is obviously materially interesting and important, but it seems to fit into the theoretical framework somewhat differently. I'm trying to get a better understanding of the Right to Privacy in the abstract, because it's a tricky beast; for instance, is the Right to Privacy a fundamental human right? Is it constitutionally supported? What harm comes to someone who is being surveilled without their knowledge? If there's no harm, and they never find out, what's the basis for the claimed right, and what's actually being infringed? I've been working through one of the seminal works on the right to privacy in the US, Warren and Brandeis' Harvard Law Review article The Right to Privacy, and interesting, it actually addresses the question of metadata somewhat directly. Warren and Brandeis propose that the right to privacy doesn't actually derive from any damages that we incur due to having our privacy breached, but instead derive from intellectual property rights. They discuss the publication of information about a collection of gems, as opposed to the sale of the gems themselves:

Suppose a man has a collection of gems or curiosities which he keeps private : it would hardly be contended that any person could publish a catalogue of them, and yet the articles enumerated are certainly not intellectual property in the legal sense, any more than a collection of stoves or of chairs...To deprive a man of the potential profits to be realized by publishing a catalogue of his gems cannot per se be a wrong to him. The possibility of future profits is not a right of property which the law ordinarily recognizes; it must, therefore, be an infraction of other rights which constitutes the wrongful act, and that infraction is equally wrongful, whether its results are to forestall the profits that the individual himself might secure by giving the matter a publicity obnoxious to him, or to gain an advantage at the expense of his mental pain and suffering. If the fiction of property in a narrow sense must be preserved, it is still true that the end accomplished by the gossip-monger is attained by the use of that which is another's, the facts relating to his private life, which he has seen fit to keep private. Lord Cottenham stated that a man "is that which is exclusively his..."

One of the downsides of this argument, as applied to government intrusions into privacy, is that it is very firmly based in 19th century notions of privacy from the paparazzi press of the era, and, as Warren and Brandeis put it, "the right to be left alone." If the government is collecting data and keeping it secret, such concerns evaporate. But, the interesting point is that, in their view, no damage needs to be shown to have suffered from a violation of privacy; the information which I care about is mine, and there's no need to quantify its value in order to show that you misused it. Just as if you took a used paper coffee cup from me without my permission, it's theft, irrespective of the value.

In the Big Data era, we are constantly sloughing information in a myriad of ways: see, for instance, this article about collection of information about shoppers habits based on their mobile device WiFi signal. (The shopper outrage is a bit hard to grok; you have a device in your pocket which is constantly broadcasting, and it's not doing it on its own, you had to buy it, and turn the WiFi on. I think if people want to buy and use technology they don't understand, they're going to run this risk, and should take some personal responsibility.) Warren and Brandeis seem to imagine a penumbra of information that follows us around, and within a certain sphere, that information belongs to us, but outside of which, it is fair game (much like Rabbi Jeremiah's fledgling bird.) But with the amount of data being thrown off increasing both in range and volume, does the sphere stay the same size, or do our expectations simply have to change about what's fair game?

Thursday, July 11, 2013

Theory Thursday: Computing Really Cheaply

How cheaply can you run a computer? How much energy does it require to run a computation? The big fans in last week's Theory Thursday suggest: a lot. But, in the 60s, when everybody else was trying to figure out how quickly we could compute, Rolf Landauer was at Bell Labs trying to figure out how slowly you could compute. It had been assumed, up until that point, that computation cost energy: if you want to take a bunch of numbers, and figure something out, you had to put in some energy to get an answer out. And, that energy was lost to entropy (i.e. heat). Those fans in last week's Theory Thursday were carrying all of that heat away. From an information theory standpoint, imagine a basic logic circuit, such as an AND gate:

The gate takes in two bits (A and B) and has a single output (O). That means that we lose a bit of information in this transformation, which has to be released as entropy (based on last week's energy-information equivalence.) But, what if you could perform computations using only gates that looked like this:

These gates all take two input bits, and produce two output bits. Landauer and others not only demonstrated that such gates could be used to do Turing complete computation, they envisioned a hypothetical "pinball machine" model of computing: bits are represented by spheres, and gate operations are carried out by arrangements of walls and elastic collisions between spheres:

Since the collisions are assumed to be completely elastic, the gates don't use any energy, and the computation is lossless.

Even more interesting, Landauer later showed that even with lossy computing (as with two-to-one bit gates), it was possible to carry out computations with input energies approaching zero, as long as you're willing to wait long enough for the answer. This is what is called "reversible computing," since it depends on reversible reactions, in the thermodynamic sense. Without getting too deep into the thermodynamics, this is somewhat analogous to running a marathon: the faster you want to get there, the more total energy you have to expend, because humans aren't "reversible". It takes more energy to run from point A to point B than to walk there. Landauer hypothesized about theoretical molecular computers that depended on reactions that could be biased from reversibility towards irreversibility by applying more energy, making them run faster, at higher expense, or, conversely, running at arbitrarily low energy, for arbitrarily longer compute times.

And that's your Theory Thursday for this week!

Wednesday, July 10, 2013

Won't Somebody Please Think of the Poor Financial Insitutitons?

Oh noes, more government intrusion! The CFPB is collecting "anonymous information about at least 10 million" consumer credit card transactions. Luckily, the Liberty Slueths(tm) at Breitbart are already sharpening their pitchforks to save us. The financial services industry is right to be outraged at the burden it imposes on them, as well as the security and privacy risks. On the other hand, if you would like to follow them into this dark alley over here, they have a fine selection of wares that they would be happy to part with for a small fee.

Wednesday, July 3, 2013

Theory Thursday: Knowledge is Power

Welcome to the first installment of Theory Thursday! (The first installment is coming to you on Wednesday because Thursday is 4th of July. Give yourself a moment to adjust to the cognitive dissonance, then contiue reading.) Sometimes, when considering the world of Big Data, it is instructive to consider the world of Very Very Small Data: how do the bits of information around us add up to the terabytes on our hard drives and, more importantly, what constraints does the physical world place on us? Once upon a time, I was a Warrior of the Physicist Clan, and it was my clan's destiny to gaze to the heavens and to the minutiae, and ask what the cosmic rules were which bind the two together. I hope to share some of that wonder with you here.

So, let us start with the mundane, and proceed from there to the sublime: You are the operations manager for a thriving startup in San Francisco, and as such, figuring out how to heat the building on those frigid June mornings is of paramount importance. You conceive of a grand plan to leverage Big Data to heat the building:
Step One: Start with information about the location and of every air molecule in the building.
Step Two: Stand by the front door. When a colder-than-average molecule approaches the door from the inside, open the door very quickly to let it out, and then shut the door again. When a warmer-than-average molecule approaches from the outside, do the same to let it in.
Step Three: Profit! You have just raised the temperature of the building using nothing more than Big Data! The devs are happy in their shorts and flip flops, and you have saved the company huge amounts of money on heating bills, allowing them to buy Red Bull and paleo snack packs for the micro-kitchen for another three months before they burn through the rest of their funding.

This little thought experiment is often called "Maxwell's Demon", because of its original publication by James Clerk Maxwell in 1872. It was raised as a paradox, of sorts, because it appears to violate the 2nd Law of Thermodynamics, which states that entropy should always increase. The demon effectively takes a disordered system, and sorts it into an ordered system of "hot molecules" inside, and "cold molecules" outside, making it appear to violate the 2nd Law. However, there's a way out of the apparent paradox. Notice what the demon started with: Information about the location of every air molecule in the building. The demon converted pure information into pure energy. This is because information is interchangeable with with energy. When Claude Shannon coined the concept of "information entropy", he did so full well understanding the implications, that the two concepts are, in fact, one in the same. Entropy, the thermodynamic concept, is in fact simply a measurement of the number of accessible states that a system can obtain; a molecular chain with 20 links can have more possible configurations than one with 10, by an exponential amount, and hence has far higher entropy. By the exact same token, a computer with 20 bits of storage can store an exponentially larger number of possible binary strings than one with only 10 bits, leading us to speak of it having a higher capacity for entropy. The measurement of entropy is simply an accounting trick, and it's the same for computers as it is for molecules.

So, where's the resolution in the demon paradox? There are many subtleties that have arisen over the years about how the demon makes measurements, and whether it can open the door in a way that doesn't waste more energy than it captures, but fundamentally, we can understand the answer by realizing that we forgot about a step:

Step Zero: Measure the location of every air molecule in the building.

This step, gathering the information used in step one, requires us to expend energy, to measure (and record!) the location of every molecule. In essence, we are converting that energy into information, and then using that information to convert it back into energy (in the form of building heat). The 2nd Law tells us that we will never be able to do this in a way that gets out more energy from Step Two than we started with in Step Zero, because in each of these conversion steps, we create excess entropy, and lose energy, driving the universe that much closer to heat death.

This relationship between energy and information is one of the most fundamental and befuddling, and we'll revisit it many times on Theory Thursday. Compare, for instance, that a 50 MHz CPU heat sink and fan:

With the equivalent hardware for cooling a 4 GHz CPU:

The fact that the latter crunches at 80 times the speed of the former is directly responsible for the fact that it requires much more cooling: processing information generates entropy, which must be dissipated in the form of heat...or does it? We'll revisit this topic when we discuss reversible computing, or, How to Compute On The Cheap (If You Don't Mind Waiting For Your Answers).

Monday, July 1, 2013

The World's Worst Chinese Restaurant Menu

Photos taken at the "Chang'an True Taste Restaurant" in Xi'an on my recent trip to China.

Created with Admarket's flickrSLiDR.

Wednesday, June 26, 2013

Data is Oil and the API is...Whatever Oil Comes Out Of!

How APIs Help Us Comprehend The Infinite Concept Of Data

"The API has emerged as the means for connecting software and services." "The API", as opposed to...hacking into someones system and extracting the data in pure binary form? I don't mean to nitpick; APIs are real, and they're important. Badly designed APIs (or even a lack thereof) are a horror, and can be a true economic drain. Ask anybody whose data is locked away in a 00s vintage "data warehouse." Or ask Joshua Bloch, the guru of Java APIs. But when you say something like:

By itself, data is irrelevant. The enterprise model has demonstrated that software on-premise has limited value when isolated in silos. But connect it with APIs, and transformations can occur that just were not possible before.

You're not saying anything about APIs. You're saying something about data integration. If you remove the words "with APIs" from the sentence, the meaning is unchanged, since there's no other way to connect data. Data integration is the secret sauce. The API is just the pipe.

Tuesday, June 25, 2013

Crowdsourcing Regulation

A few interesting recent regulatory moves:
- California state regulators try to ban ridesharing services such as Lyft as unlicensed taxi service
- New York City bans AirBnB as an unlicensed hotel
- Several states attempt to ban Tesla from selling direct-to-consumer cars because...well, it's not clear why except that car dealerships are threatened and have a lot of friends in high places.

These all have in common 20th century regulation colliding at high speed with 21st century commerce. As another example, the America Psychological Association has long made it illegal to provide psychological counseling for patients unless both the patient and the doctor are licensed in the same state, making telemedicine a difficult proposition.

For the most part, these regulations are well-intentioned (with the exception of the Tesla direct sales ban.) They are intended to prevent consumers from getting ripped off in an imperfect marketplace, and to impose minimum standards of safety, cleanliness, and service on service industries. And, pre-Internet, they made a lot of sense. How was I supposed to know if this pink-mustache car that I'm climbing into is going to take me for a ride?

But a funny thing happened on the way to ubiquitous informational awareness. You can't book an AirBnB or a Lyft or buy a Tesla direct without an internet connection. And, if you have an internet connection, you can in fact find out whether these services are reputable and safe. There's no information asymmetry anymore for these services. Is it possible that Yelp and other review services have actually solved the problem of crowdsourcing regulation?

Still, there's a need for enforcement. How do we verify bad actors and punish them without a licensure model and an army of inspectors? Enter: ADA compliance lawsuits. The Americans with Disabilities Act imposes a number of requirements on small businesses to ensure accessibility for those with disabilities, such as adequate disabled parking and accessible toilets. And, instead of employing inspectors, it empowers disabled individuals to sue for non-compliance, and receive damages. Libertarian think tanks find this practice appalling, and call it frivolous. But in my mind, this is a Libertarian dream come true: no messy government inspectors or agencies, we simply empower individuals with market-driven incentives, and they ensure appropriate compliance. The explosion of cell phones and connected mobile devices ensures that the consumers of internet-based services have the tools to ensure compliance on their own. For instance, services like Uber and Lyft routinely record the routes their drivers take, and so could any consumer, using a GPS enabled smartphone. If they feel the route they took was inefficient and they were overcharged, they have all the evidence they need in their pocket to prove it. If they feel the driver was unsafe, or the car was filthy, they can easily take a photo or video and instantly file a report. When everybody is an inspector, who needs inspectors?

There's no doubt that the profusion of government regulation in this sphere is in part due to entrenched interests. Taxi cabs don't want competition from ridesharing, and hotels don't want competition from house-sharing. But the existing regulatory regimes aren't laudable, they're laughable, and they're long due for an overhaul. Everybody would benefit from the expanded competition and vastly expanded compliance information that the public could provide.

Reports of Kindle Death Rays Have Been Exaggerated

Apparently the FAA has heard my complaints about the Kindle Death Rays:
FAA moving toward easing electronic device use

Friday, June 21, 2013

Should We Rely On Incompetence to Safeguard Our Civil Liberties? Or: How to Build a Better Call Trap

Robert Mueller's testimony today on the NSA phone monitoring (F.B.I. Director Warns Against Dismantling Surveillance Program) had some fascinating tidbits. First, there's this (emphasis mine):

Testifying before the Senate Judiciary Committee, Mr. Mueller addressed a proposal to require telephone companies to retain calling logs for five years — the period the N.S.A. is keeping them — for investigators to consult, rather than allowing the government to collect and store them all. He cautioned that it would take time to subpoena the companies for numbers of interest and get the answers back.

“The point being that it will take an awful long time,” Mr. Mueller said.

“In this particular area, where you’re trying to prevent terrorist attacks, what you want is that information as to whether or not that number in Yemen is in contact with somebody in the United States almost instantaneously so you can prevent that attack,” he said. “You cannot wait three months, six months, a year to get that information, be able to collate it and put it together. Those are the concerns I have about an alternative way of handling this.”

Mr. Mueller did not explain why it would take so long for telephone companies to respond to a subpoena for calling data linked to a particular number, especially in a national security investigation.

I can tell you why it would take so long in one word: incentives. The NSA and FBI are incentivized to build a system that actually works efficiently and effectively. The phone companies, if faced with regulatory requirements to retain records, and incentivized to do it cheaply. Let's do some back of the envelope math here:

- The average person probably makes 5 - 10 phone calls/text messages a day on their mobile device.
- Wikipedia tells us that there are about 300,000,000 mobile phones in the US.
- That comes out to about 3 trillion phone calls in 5 years. Let's say a single carrier handles maybe 1/5 of that traffic, or 600 billion calls they have to retain.
- Assuming metadata on a single call (from, to, duration, date, time, and maybe IMEI) takes up 1 kilobyte of data.
- Then the carrier is required to keep a rolling log of about 500 terabytes of call data

As bad as this sounds, it's not actually that big a deal. Facebook handles about this much data each day. And using horizontally scalable key-value stores, like Cassandra or MongoDB, you can easily store the data and return the results in near real time, as long as you're willing to throw enough commodity hardware at it. But that's the real issue: the willingness. Verizon, AT&T, these guys don't really want to be in the business of storing call log data and providing it to the government. It doesn't make them any money. So they would simply throw it onto a disk, making it unsearchable, and tell the government, "Sorry, your request will return in 3 - 6 weeks." You could in theory legislate that they return the results faster, but you can't actually legislate that people build competent technology infrastructure. Failure is a more likely scenario than compliance.

With all that said, though, the fundamental question in my mind is this: What is the real difference between the NSA storing the data and the phone carriers storing it and producing it on request? I think this is an interesting philosophical question, and as a civil libertarian, not one I take lightly. The process is essentially the same:

Case 1: The FBI asks Verizon for calls relating to X, and they get an answer back.
Case 2: The FBI asks the NSA for calls relating to X, and they get an answer back.

Going through Verizon for the request may make it take longer, and that may be a good thing, if you're worried about abuse of the data. But, should we really be relying on incompetence as a safeguard against abuse? Frankly, incompetence is often the only thing that stands between us and abuse by corporations and the government. People who ascribe all things to vast complex conspiracies fail to appreciate the true depths of human fallibility and incompetence, in most cases. But, if the question is one of principle (legal, moral, or otherwise), it's worth asking ourselves if we'd be comfortable with Case 1, why are we fundamentally less comfortable with Case 2?

Transparency Drives up the Value of Secrets

I do almost all of my magazine reading on airplanes in the 20 minutes at takeoff and landing when I can't use my laptop or eReader (because dangerous Kindle rays have been implicated in over a dozen fatal airplane crashes in just the past 12 months). Two items in this month's New Yorker caught my interest (emphasis mine). The first by James Surowiecki:

The consequences of being caught [insider trading] have never been higher...but hard-pressed fund managers continue to be tempted. Competition in the investing world is fierce: there are now nearly eight thousand hedge funds, and on average they have underperformed the stock market for nine of the past ten years. Whatever your supposed market-beating strategy is, someone else is probably duplicating it, and everyone is desperate to find an informational edge. There was a time when big investors could come by that edge quasi-legally, as companies leaked information to select investors and analysts. [c.f. the Facebook IPO -ed] But in 2000 the S.E.C. passed a rule called Regulation F.D., which required companies to disclose material information publicly or not at all.

And, from The President and The Press: "Obama said that he would make 'no apologies' for zealous press-leak investigations, since unauthorized disclosures of secrets jeopardized the lives of soldiers and the spies he sent in danger's way."

The common thread between these two articles is the rising value of secret information. The internet has given us a huge amount of information, which can be utilized by almost anybody to help them understand market trends, world changing events, and communities. The governments are increasingly being pressured to put everything online, and not just DMV forms, but raw data in computer readable formats, www.data.gov ("Empowering People"). And others are using that data to create ever more sophisticated analyses and make them available to others, such as Open Secrets.

In a society where everybody knows everything, the ability to trade on secret information becomes impossible. The problem is that trading on insider information is a really, really lucrative way to make Money for Nothing and your Checks for Free. Surowiecki suggests that the solution (to the insider trading problem) is to simply disclose yet more and more. But companies have very little incentive to deter insider trading, especially when the 'tips' come from third party sources (like expert networks.) Moreover, this strategy will simply drive the value of the secrets up further, and drive up the excesses people will go to in order to obtain them. The explosion in top-secret data (as well as the explosion in leaking of that data to the press) is based on the same principle. The internet makes it possible for everyone to know everything that's publicly available at the speed of light, so where is my next Huge Scoop going to come from? It's got to come from a secret source.

Hypothesis: The demand for fraud in the world is pretty much a constant; if you squeeze it via regulation, prosecution, or disclosure, it simply increases the value until supply reaches demand, at least to first order.

Thursday, June 20, 2013

A Public Service Announcement for Tech Bloggers

If you look at the web site of an enterprise software company and declare that "nobody understands what they do" after three minutes of reading, here's a hint: it's probably because you cover mobile-social-smartphone-cloud-game consoles, and have no idea what the challenges are of running an enterprise with dozens of databases and terabytes of data spanning 10 years and multiple architectures. One of my problems with blogging has always been, whenever I try to post on a topic, the acute awareness of how little I actually know. Most bloggers are blissfully unburdened by the Dunning-Kruger effect.