Thursday, July 25, 2013

Jonathan Swift Slow Claps From His Grave



Congratulations are in order for Holman Jenkins of the Wall Street Journal: he has penned a truly devastating double-Swiftian satire of both the NSA call data warehousing program and the Wall Street Journal editorial page's disdain for civil liberties: while the rest of us were discussing whether the program goes too far, and trying to carefully weigh the balance between privacy and security, Jenkins kicks it up a notch and suggests that the problem with the program is that not enough government agencies have access to the data of every phone call made by everyone (Metadata Liberation Movement, WSJ, July 23, 2013).

How do we know that this must be satire?  He does a pretty good job keeping it on the level, making it seem completely serious, but there are a few dead giveaways, howlers that would just be too stupid to put into print if it weren't.

For one thing, he starts with the following quote:


"It is not rational to give up massive amounts of privacy and liberty to stay marginally safer from a threat that, however scary, endangers the average American far less than his or her daily commute," writes Conor Friedersdorf in the Atlantic, expressing a common view [about counter-terrorism surveillance].

A perfectly reasonable sentiment.  But he starts to hint in the very next paragraph:


Another kind of loss of liberty comes when our tax dollars are spent on useless programs.

A nice setup: this is the Wall Street Journal editorial page, so we expect every reasonable paragraph to be followed with "...and taxes are too high."  At this point, you think he's going to run straight up the middle, accuse the Obama administration of malfeasance for funding a hugely bloated program, and suggest that, in the name of liberty, freedom, and apple pie, we should cut the whole damn thing, and return the money to the oil companies they took it from.  But that's exactly when he feints to the left!


The biggest problem, then, with metadata surveillance may simply be that the wrong agencies are in charge of it. One particular reason why this matters is that the potential of metadata surveillance might actually be quite large but is being squandered by secret agencies whose narrow interest is only looking for terrorists.

I know what you're saying: "This might be silly, but it's not patently asinine."  True.  But Jenkins is just getting started.  His prime, leading example of an important problem plaguing everyday Americans that can be solved by giving MOAR METADATA to every law enforcement agency in the country?  Wait for it...highway serial killers.

Highway serial killers are enough of a problem that the FBI formed a task force devoted to them, its Highway Serial Killers Initiatives. Instead of finding a suspect and trying to tie him to bodies, could metadata help us quickly find suspects based on the locations of bodies?

How would this Magical Metadata help us solve this scourge of highway serial killers?  Who knows?  Jenkins doesn't even offer a theory.  It's not clear that he actually even knows what the word "metadata" means, apart from "something the government wants very, very badly, so we should give it to them."  Other scourges worth giving the government the entire list of people you have ever called?  Anticipating traffic jams.  At some point, he has apparently shifted from talking about phone call metadata to some other kind of data entirely, but it's not clear when that happened.  He still must be talking about something related to government surveillance data, because otherwise, it would sound like he is simply advocating using data to try to predict traffic jams, an idea which is about as original as using data to try to predict the weather.  He seems to be under the apprehension that prepending the term meta- to these observations transforms them from "stringing together this week's trending twitter topics into barely coherent sentences" to "stunningly original insight."

It's all downhill from there.  He marches straight off into stupid-town, saying things like, "'Big data' is only as good as the algorithms used to find out things worth finding out," and "Our guess is that big data techniques would pop up way too many false positives at first, and only considerable learning and practice would allow such techniques to become a useful tool," which seems simultaneously ignorant and vapid.  But that's just Jenkins pulling your leg again.  Here, he's making fun of the history of the government sinking vast sums of money into black box "algorithms" to look for terrorism and then trying to bury it.  Surely, he couldn't have written such an article without having familiarized himself with the topic in question.  For that matter, why does the WSJ editorial board need to make "guesses"?  Why not let someone knowledgeable write the column?

But he saves the real zinger for the end: after trumpeting highway serial killers as proof that we need more people knowing when you called your grandma, he caps it off with this:

Most of all, it would allow these techniques to be put to work on solving problems that are actual problems for most Americans, which terrorism isn't.


Thank you sir.  Your contributions to the art of satire will be duly noted for generations hence.

Monday, July 22, 2013

Test Data

Quora has a super-handy post for data nerds listing big open source data sets you can use for testing:
http://www.quora.com/Big-Data/What-kinds-of-large-datasets-open-to-the-public-do-you-analyze-the-mostly

I would add to that the Enron e-mail corpus, which is great for testing anything against e-mails:
http://www.cs.cmu.edu/~enron/

There's "only" about 600,000 emails total, and of those, only about 300,000 are unique, but for an unstructured data set, it's good, and it's the gold standard.

Update:
Another useful list of large unstructured data sets:
http://www.quora.com/Where-can-I-get-large-corpora-open-to-the-public

On Metadata and the Right to Privacy

One of the interesting points raised by the NSA call data warehousing disclosures is whether the collection of metadata about phone calls is materially different from collection of the actual content of the calls themselves, and if so, how.  It's easy to see why you wouldn't want people listening in on your calls.  Knowing who you called and when is obviously materially interesting and important, but it seems to fit into the theoretical framework somewhat differently.  I'm trying to get a better understanding of the Right to Privacy in the abstract, because it's a tricky beast; for instance, is the Right to Privacy a fundamental human right?  Is it constitutionally supported?  What harm comes to someone who is being surveilled without their knowledge?  If there's no harm, and they never find out, what's the basis for the claimed right, and what's actually being infringed?  I've been working through one of the seminal works on the right to privacy in the US, Warren and Brandeis' Harvard Law Review article The Right to Privacy, and interesting, it actually addresses the question of metadata somewhat directly.  Warren and Brandeis propose that the right to privacy doesn't actually derive from any damages that we incur due to having our privacy breached, but instead derive from intellectual property rights.  They discuss the publication of information about a collection of gems, as opposed to the sale of the gems themselves:

Suppose a man has a collection of gems or curiosities which he keeps private : it would hardly be contended that any person could publish a catalogue of them, and yet the articles enumerated are certainly not intellectual property in the legal sense, any more than a collection of stoves or of chairs...To deprive a man of the potential profits to be realized by publishing a catalogue of his gems cannot per se be a wrong to him. The possibility of future profits is not a right of property which the law ordinarily recognizes; it must, therefore, be an infraction of other rights which constitutes the wrongful act, and that infraction is equally wrongful, whether its results are to forestall the profits that the individual himself might secure by giving the matter a publicity obnoxious to him, or to gain an advantage at the expense of his mental pain and suffering. If the fiction of property in a narrow sense must be preserved, it is still true that the end accomplished by the gossip-monger is attained by the use of that which is another's, the facts relating to his private life, which he has seen fit to keep private. Lord Cottenham stated that a man "is that which is exclusively his..."

One of the downsides of this argument, as applied to government intrusions into privacy, is that it is very firmly based in 19th century notions of privacy from the paparazzi press of the era, and, as Warren and Brandeis put it, "the right to be left alone."  If the government is collecting data and keeping it secret, such concerns evaporate.  But, the interesting point is that, in their view, no damage needs to be shown to have suffered from a violation of privacy; the information which I care about is mine, and there's no need to quantify its value in order to show that you misused it.  Just as if you took a used paper coffee cup from me without my permission, it's theft, irrespective of the value.

In the Big Data era, we are constantly sloughing information in a myriad of ways: see, for instance, this article about collection of information about shoppers habits based on their mobile device WiFi signal.  (The shopper outrage is a bit hard to grok; you have a device in your pocket which is constantly broadcasting, and it's not doing it on its own, you had to buy it, and turn the WiFi on.  I think if people want to buy and use technology they don't understand, they're going to run this risk, and should take some personal responsibility.)  Warren and Brandeis seem to imagine a penumbra of information that follows us around, and within a certain sphere, that information belongs to us, but outside of which, it is fair game (much like Rabbi Jeremiah's fledgling bird.)  But with the amount of data being thrown off increasing both in range and volume, does the sphere stay the same size, or do our expectations simply have to change about what's fair game?

Thursday, July 11, 2013

Theory Thursday: Computing Really Cheaply

How cheaply can you run a computer?  How much energy does it require to run a computation?  The big fans in last week's Theory Thursday suggest: a lot.  But, in the 60s, when everybody else was trying to figure out how quickly we could compute, Rolf Landauer was at Bell Labs trying to figure out how slowly you could compute.  It had been assumed, up until that point, that computation cost energy: if you want to take a bunch of numbers, and figure something out, you had to put in some energy to get an answer out.  And, that energy was lost to entropy (i.e. heat).  Those fans in last week's Theory Thursday were carrying all of that heat away.  From an information theory standpoint, imagine a basic logic circuit, such as an AND gate:

The gate takes in two bits (A and B) and has a single output (O).  That means that we lose a bit of information in this transformation, which has to be released as entropy (based on last week's energy-information equivalence.)  But, what if you could perform computations using only gates that looked like this:


These gates all take two input bits, and produce two output bits.  Landauer and others not only demonstrated that such gates could be used to do Turing complete computation, they envisioned a hypothetical "pinball machine" model of computing: bits are represented by spheres, and gate operations are carried out by arrangements of walls and elastic collisions between spheres:



Since the collisions are assumed to be completely elastic, the gates don't use any energy, and the computation is lossless.

Even more interesting, Landauer later showed that even with lossy computing (as with two-to-one bit gates), it was possible to carry out computations with input energies approaching zero, as long as you're willing to wait long enough for the answer.  This is what is called "reversible computing," since it depends on reversible reactions, in the thermodynamic sense.  Without getting too deep into the thermodynamics, this is somewhat analogous to running a marathon: the faster you want to get there, the more total energy you have to expend, because humans aren't "reversible".  It takes more energy to run from point A to point B than to walk there.  Landauer hypothesized about theoretical molecular computers that depended on reactions that could be biased from reversibility towards irreversibility by applying more energy, making them run faster, at higher expense, or, conversely, running at arbitrarily low energy, for arbitrarily longer compute times.

And that's your Theory Thursday for this week!

Wednesday, July 10, 2013

Won't Somebody Please Think of the Poor Financial Insitutitons?

Oh noes, more government intrusion!  The CFPB is collecting "anonymous information about at least 10 million" consumer credit card transactions.  Luckily, the Liberty Slueths(tm) at Breitbart are already sharpening their pitchforks to save us.  The financial services industry is right to be outraged at the burden it imposes on them, as well as the security and privacy risks.  On the other hand, if you would like to follow them into this dark alley over here, they have a fine selection of wares that they would be happy to part with for a small fee.

Wednesday, July 3, 2013

Theory Thursday: Knowledge is Power

Welcome to the first installment of Theory Thursday!  (The first installment is coming to you on Wednesday  because Thursday is 4th of July.  Give yourself a moment to adjust to the cognitive dissonance, then contiue reading.)  Sometimes, when considering the world of Big Data, it is instructive to consider the world of Very Very Small Data: how do the bits of information around us add up to the terabytes on our hard drives and, more importantly, what constraints does the physical world place on us?  Once upon a time, I was a Warrior of the Physicist Clan, and it was my clan's destiny to gaze to the heavens and to the minutiae, and ask what the cosmic rules were which bind the two together.  I hope to share some of that wonder with you here.

So, let us start with the mundane, and proceed from there to the sublime: You are the operations manager for a thriving startup in San Francisco, and as such, figuring out how to heat the building on those frigid June mornings is of paramount importance.  You conceive of a grand plan to leverage Big Data to heat the building:
Step One: Start with information about the location and of every air molecule in the building.
Step Two: Stand by the front door.  When a colder-than-average molecule approaches the door from the inside, open the door very quickly to let it out, and then shut the door again.  When a warmer-than-average molecule approaches from the outside, do the same to let it in.
Step Three: Profit!  You have just raised the temperature of the building using nothing more than Big Data!  The devs are happy in their shorts and flip flops, and you have saved the company huge amounts of money on heating bills, allowing them to buy Red Bull and paleo snack packs for the micro-kitchen for another three months before they burn through the rest of their funding.

This little thought experiment is often called "Maxwell's Demon", because of its original publication by James Clerk Maxwell in 1872.  It was raised as a paradox, of sorts, because it appears to violate the 2nd Law of Thermodynamics, which states that entropy should always increase.  The demon effectively takes a disordered system, and sorts it into an ordered system of "hot molecules" inside, and "cold molecules" outside, making it appear to violate the 2nd Law.  However, there's a way out of the apparent paradox.  Notice what the demon started with: Information about the location of every air molecule in the building.  The demon converted pure information into pure energy.  This is because information is interchangeable with with energy.  When Claude Shannon coined the concept of "information entropy", he did so full well understanding the implications, that the two concepts are, in fact, one in the same.  Entropy, the thermodynamic concept, is in fact simply a measurement of the number of accessible states that a system can obtain; a molecular chain with 20 links can have more possible configurations than one with 10, by an exponential amount, and hence has far higher entropy.  By the exact same token, a computer with 20 bits of storage can store an exponentially larger number of possible binary strings than one with only 10 bits, leading us to speak of it having a higher capacity for entropy.  The measurement of entropy is simply an accounting trick, and it's the same for computers as it is for molecules.

So, where's the resolution in the demon paradox?  There are many subtleties that have arisen over the years about how the demon makes measurements, and whether it can open the door in a way that doesn't waste more energy than it captures, but fundamentally, we can understand the answer by realizing that we forgot about a step:

Step Zero: Measure the location of every air molecule in the building.

This step, gathering the information used in step one, requires us to expend energy, to measure (and record!) the location of every molecule.  In essence, we are converting that energy into information, and then using that information to convert it back into energy (in the form of building heat).  The 2nd Law tells us that we will never be able to do this in a way that gets out more energy from Step Two than we started with in Step Zero, because in each of these conversion steps, we create excess entropy, and lose energy, driving the universe that much closer to heat death.

This relationship between energy and information is one of the most fundamental and befuddling, and we'll revisit it many times on Theory Thursday.  Compare, for instance, that a 50 MHz CPU heat sink and fan:
With the equivalent hardware for cooling a 4 GHz CPU:

The fact that the latter crunches at 80 times the speed of the former is directly responsible for the fact that it requires much more cooling: processing information generates entropy, which must be dissipated in the form of heat...or does it?  We'll revisit this topic when we discuss reversible computing, or, How to Compute On The Cheap (If You Don't Mind Waiting For Your Answers).

Monday, July 1, 2013