Author Archive

Tropical rank in the snow

Friday, February 22nd, 2013

This week I have been at the Dagstuhl castle in Germany for a workshop on “Communication complexity, linear optimization, and lower bounds for the nonnegative rank of matrices.”  I always like coming to Dagstuhl, and this time it has been especially nice meeting many new people from the neighboring communities of combinatorial optimization and matrix theory.

A tradition of the Computational Complexity blog, the grand-daddy of the TCS blogs, is to post the Dagstuhl ”nerd shot“—the Dagstuhl group photo. So here you go.

The nerd shot

(more…)

Reduce and Repeat

Wednesday, September 26th, 2012

A favorite open problem recently returned to my head. For every boolean function \(f:\{0,1\}^n \rightarrow \{0,1\}\) is the deterministic query complexity at most the square of the bounded-error quantum query complexity, that is \(D(f) \le Q(f)^2\)?  We know a function where this square gap is realized, the OR function which Grover’s algorithm can solve in \(O(\sqrt{n})\) quantum queries but for which deterministically in the worst case you have to query the entire input to tell if there is a 1.  We also know that for partial functions (functions not defined over all of \(\{0,1\}^n\)) quantum and randomized query complexity can be exponentially far apart, for example for Simon’s problem.

Currently we know that \(D(f) \le Q(f)^6\) for total functions, a relation which has seen no improvement since the classic 1998 paper of Beals et al. that first showed a polynomial relationship.  It is time for some improvement, no?

Every time this problem comes back to me, I end up spending a couple of days on it.  The proof of this result, and similar statements, all use a well-defined proof paradigm we might call  reduce and repeat.  It always seems that if one can simply apply reduce and repeat with some other complexity measures, in a smarter way, improvement will follow.

You might use reduce and repeat when you lose your keys somewhere in your house: Pick a room you have not searched yet, and look there. If you find your keys, well, then you are done. If not, you have reduced the number of rooms remaining to search by one. So if searching any room takes you at most 10 minutes you will know that in time at most #rooms*10 minutes you will have found your keys or found that they are lost.  Well, progress is progress.

The algorithm for deterministic query complexity is similar, except that instead of searching a room we make a few queries, and instead of looking for keys, we are looking for a certificate that the function value is 1 (defined formally below).  If the round of queries reveals a one certificate, then we know the function value is 1.  If they don’t, then the reduce step argues that the new function restricted to the query values we have seen is closer to the constant function.

As there can be exponential gaps for partial functions, we must somewhere use the fact that the function is total.  Many potential proof ideas are spoiled by asking: “Where am I making use of the fact that the function is total?”  Thus we will pay careful attention in these proofs to how totality is used.

(more…)

Revealing the impossible

Monday, August 27th, 2012

“Is the task impossible or does it only seem that way? The beauty is: Nobody knows.”

Someone talking about solving the Traveling Salesman Problem in polynomial time? No, though I wouldn’t blame you for thinking that. This is from a new book of wire walker extraordinaire Philippe Petit, “Cheating the Impossible: Ideas and recipes from a rebellious high-wire artist“. It is a short TED book, available as an Amazon single. Here is the related TED talk, though I found the book better than the talk (and does not take much longer to read).

(more…)

The non-negative rank of Euclidean distance matrices

Thursday, August 9th, 2012

Here is the open question we mentioned a few weeks back.

Question: Consider an \(n\)-by-\(n\) matrix \(M_n\) where \(M_n(i,j) = (i-j)^2\) for \(1 \le i,j \le n\).  What is \(\mathrm{rk}_+(M_n)\) the non-negative rank of \(M_n\)?

Note that the normal rank of \(M_n\) is \(3\) for any \(n\).  It was conjectured (and even claimed to be proven) that the nonnegative rank of \(M_n\) is \(\Omega(n)\).  Pavel Hrubeš recently (article behind paywall) gave a surprising answer.

Answer (Hrubeš): The non-negative rank of \(M_n\) is \(O(log n)\).

This is tight by results of Beasley and Laffey (article behind paywall).

Pavel’s proof is quite simple and elegant—simple enough to go through here. It is a proof by induction.  In the base case, note that the rank of \(M_1\) is \(1\). For the induction step he shows that \(\mathrm{rk}_+(M_{2n}) \le \mathrm{rk}_+(M_n) + 2\).  This will give the theorem.

How this works is best illustrated by looking at a small example.

\[
M_6 =
\begin{bmatrix}
0&1&4&9&16&25\\
1&0&1&4&9&16\\
4&1&0&1&4&9\\
9&4&1&0&1&4\\
16&9&4&1&0&1\\
25&16&9&4&1&0
\end{bmatrix}
\]

Note that this matrix has a lot of structure.  It is constant on diagonals, and symmetric. Moreover, the top left \(3\)-by-\(3\) block and bottom right \(3\)-by-\(3\) blocks are the same and are equal to \(M_3\).

Now we are going to rearrange the rows and columns. This will not change the nonnegative rank. Instead of labeling the rows and columns by \(1,2,3,4,5,6\) we will label them as \(1,2,3,6,5,4\). This gives the new matrix \(R_6\) (think of \(R\) as rearranged) where
\[
R_6 =
\begin{bmatrix}
0&1&4&25&16&9\\
1&0&1&16&9&4\\
4&1&0&9&4&1\\
25&16&9&0&1&4\\
16&9&4&1&0&1\\
9&4&1&4&1&0
\end{bmatrix}
\]

If we write this in terms of \(3\)-by-\(3\) blocks this is
\[
R_6 =
\begin{bmatrix}
M_3 & B_3 \\
B_3 & M_3
\end{bmatrix}
\]
Moreover, if you look at it, \(B_3\) is entrywise larger than \(M_3\). Thus we can express \(B_3 = M_3 + C_3\) where \(C_3\) is non-negative and hopefully has small non-negative rank! It turns out that in general \(C_n=B_n-M_n\) is rank one (and so is non-negative rank one as well).  Let’s check that in our small example.

\[
C_3=
\begin{bmatrix}
25 & 15 & 5 \\
15 & 9 & 3 \\
5 & 3 & 1
\end{bmatrix}
= \begin{bmatrix}
5 \\ 3 \\ 1
\end{bmatrix}
\begin{bmatrix}
5 & 3 & 1
\end{bmatrix}
\]

Thus we can write
\[
R_6 = \begin{bmatrix}
M_3 & M_3 \\
M_3 & M_3
\end{bmatrix}
+
\begin{bmatrix}
0 & C_3 \\
C_3 & 0
\end{bmatrix}
\]
where the first matrix has non-negative rank \(\mathrm{rk}_+(M_3)\) and the second has non-negative rank \(2\). This gives \(\mathrm{rk}_+(M_6)=\mathrm{rk}_+(R_6) \le \mathrm{rk}_+(M_3) +2\) as we wanted to show.

The general case proceeds in the same fashion.  But now let me explain things in a slightly different, more geometrical way.  We want to express all the entries of \(M_{2n}\) in terms of entries of \(M_n\), plus possibly some correction. To do this, let us define a mapping \(f: \{1, \ldots, 2n\} \rightarrow \{1, \ldots, n\}\). Then we want to write \((i-j)^2 = (f(i) – f(j))^2 + c_{i,j}\) where the corrections \(c_{i,j}\) hopefully form a matrix of low non-negative rank (actually rank one).

Imagine the numbers \(1, \ldots, n\) laid out on the \(x\)-axis. The mapping \(f\) will be conditional reflection in the line \(x=n+1/2\).  That is, if \(1 \le i \le n\) then \(f(i)=i\) and if \(i > n+1/2\) then \(f(i)= i – 2(i-(n+1/2))=2n+1-i\).

In the example where \(n=6\) we conditionally reflect in the line \(x=3.5\). Note that this sends \(4 \rightarrow 3, 5 \rightarrow 2, 6 \rightarrow 1\).  This is what the rearrangement of the rows and columns above did—we put column \(i > n+1/2\) into position \(n+f(i)\).

Clearly, if \(1 \le i,j \le n\) then \((i-j)^2 = (f(i)-f(j))^2\) and also as reflection preserves distance if \(n+1/2 < i,j \le 2n\) then \((i-j)^2 = (f(i)-f(j))^2\). This is why the two diagonal blocks of \(R_6\) are equal to \(M_3\).

The interesting case is when, say, \(i \le n\) and \(j > n+1/2\).  In this case, you can check that \(|f(i)-f(j)| \le |i – j|\), so the points get closer together. This means that \((i-j)^2 = (f(i) – f(j))^2 + c_{i,j}\) where \(c_{i,j}\) is non-negative. A good start.

What is \(c_{i,j}\)? Let’s do the computation. We are going to use the fact that \((a-b)^2 – (a+b)^2 = -4 ab\).

\[\begin{align*}
c_{i,j} &= (i-j)^2 - (i-(2n+1-j))^2 \\
&= ((i - (n+1/2)) - (j - (n+1/2)))^2 -
((i - (n+1/2)) + (j - (n+1/2)))^2 \\
&= -4 (i- (n+1/2)) (j - (n+1/2)) \\
&= 4 (n+1/2 -i) (j-(n+1/2)).
\end{align*} \]

Remember that \(i < n+1/2\) so now we get that \(c_{i,j}\) is the product of two positive numbers one which only depends on \(i\) and the other which only depends on \(j\) — thus the matrix \(C(i,j)=c_{i,j}\) is rank one!

That’s the end of the proof.  I think there is more potential in this proof technique.  Essentially the same proof idea is used by Fiorini, Rothvoß, and Tiwary to show that the non-negative rank of the slack matrix of a regular \(n\)-vertex polygon is \(O(log n)\).  This has apparently been known for some time—the authors say it is implicit in the work of Ben-Tal and Nemirovski from 1999, though I cannot see this.  The Fiorini et al. paper is highly readable and gives an explicit non-negative factorization of the slack matrix very similar in spirit to the above proof.

What is the slack matrix of an \(n\)-vertex polygon? It is an \(n\)-by-\(n\) matrix with rows labeled by vertices and columns labeled by faces. The \((i,j)\) entry is the distance of the \(i\)th vertex to the \(j\)th face. The same idea of conditional reflection can be used to map the vertices on one half of the polygon to those on the other half, and the same for faces.  This sets up the same kind of inductive proof, where the correction term again turns out to be rank one.

Now the question of the non-negative rank of the “standard” linear Euclidean distance matrix has been resolved.  Are there linear Euclidean distance matrices that have non-negative rank \(\Omega(n)\)?  In general, a linear Euclidean distance matrix is of the form \(M(i,j) = (a_i-a_j)^2\) for some numbers \(a_1, \ldots, a_n\). Pavel’s proof shows that if \(a_1, \ldots, a_n \ge 0\) are contained in an arithmetic progression of length \(m\) then the non-negative rank of \(M\) is at most \(\log(m)\).  So if a linear Euclidean distance matrix is going to have large non-negative rank it is going to have to contain large numbers.  I spoke with Pavel about this and he suggested the matrix where formed by \(a_i=2^{2^i}\) as an interesting candidate to have large non-negative rank.

The research rollercoaster

Wednesday, August 8th, 2012

In the last post, I mentioned an open problem. Last week, I solved it.

I had the idea of something to try for a few days and finally got around to it on Monday. I didn’t think it would work, but thought it would be interesting to see why it failed. By late afternoon, nothing had failed and I became very excited.

For me, this is one of the best moments in research. It is a walk along a cliff edge—the exhilirating height of discovery and danger that at any moment a hole can appear that sends you and your proof tumbling down. Sometimes I just sit on the cliff and enjoy the view—purposefully delaying writing up the proof because I know it can all come crashing down, and I want to prolong those moments of joy a little while longer.

In this case, the initial verifications passed. I worked out a small example that also checked out. I was becoming fairly convinced.

I thought about canceling my plans for the evening to write up the proof straight away. But again I thought why not wait and enjoy these moments of new proof a little while longer.

I got back home around 11pm and started in on the proof, typing everything up. As darkness lifted in the early morning hours, the proof was done and there were no bugs. For this problem, I could write a program to check the algorithm, which also worked. At this point all doubts had evaporated.

It was late. I was tired. I was just putting the final touches on the writeup, adding some references so I could send it to some colleagues.  I was googling to get the bibliographical data. Then a paper appeared in the list of search results that I had not seen before.  I clicked on it. It solved the same problem! I went to bed depressed.

That is the research rollercoaster.

Limitations of the linear programming approach to TSP

Monday, July 9th, 2012

In a previous post, we solved a small instance of the traveling salesman problem (TSP) around Singapore’s hawker centers using linear programming.  Today, we will talk about limitations of the linear programming approach to TSP, namely a recent result showing a super-polynomial lower bound on the number of linear constraints needed to characterize the set of valid tours.  The paper is “Linear vs. Semidefinite Extended Formulations: Exponential Separation and Strong Lower Bounds” by Fiorini, Massar, Pokutta, Tiwary, and de Wolf (hereafter referred to as FMPTW) and shared the best paper award at the STOC 2012 conference.

A key tool in this lower bound is another concept we have discussed on this blog, the non-negative rank.  Believe it or not, those posts were written to prepare for this one!  Even so, this post will be more technical than usual.  Before we get started, let me thank Ronald de Wolf for his comments on this post, and for explaining this work to me when he visited CQT at the end of last year.

To put the FMPTW work in context, let me first recap the approach we took to finding the shortest tour through Singapore’s hawker centers. We considered variables \(x_{i,j}\) meant to represent the presence or absence of an edge between city \(i\) and \(j\) in a tour. We then brainstormed linear constraints on the \(x_{i,j}\) satisfied by true tours—starting out with basic constraints like all edge weights should be non-negative and the total weight of edges incident to any city should be two. Finding the shortest “tour” with respect to these linear constraints can be done efficiently, as it is a linear program. We saw, however, that the solution was not a true tour but had problems like non-integral weight edges and small cycles.  To combat this, we did several iterations of adding more linear constraints in an attempt to rub out spurious solutions.  A natural question is: when will this process stop? How many linear constraints are needed to describe the set of valid tours? The work of FMPTW gives the first super-polynomial lower bound on the number of linear constraints needed to characterize the set of valid tours: in fact, they show a bound of \(2^{n^{1/4}}\) for tours on \(n\) cities. (Ronald tells me they have now improved this to \(2^{\sqrt{n}}\).)

I want to clear up one thing that might be confusing with this introduction to their result. In the hawker center problem, we did not develop a set of linear constraints to characterize all valid tours on 18 cities—we just needed to eliminate false solutions near the optimal tour for our particular problem. To do this, we chose the constraints adaptively, meaning that when the linear program returned a spurious solution, we scratched our heads and identified a fault of this solution (like having a small cycle) and added a linear constraint to eliminate this fault.  For a different configuration of hawker centers, our
final linear program might still return a spurious solution. The FMPTW result, on the other hand, speaks about the number of linear inequalities needed to characterize the set of valid tours. Such a characterization is not dependent on the objective function (the configuration of the cities), and so would universally allow you to solve a TSP problem by optimizing with respect to these inequalities.

The paper actually deals with more general linear programming formulations called extended formulations that allow the introduction of extra variables. In addition to the variable \(x \in R^{n \choose 2}\) whose \((i,j)\) entry represents the “strength” of the edge between cities \(i\) and \(j\) in a tour
(and ideally would be either 0 or 1), we are now also allowed some extra variables \(y \in R^k\) that don’t enter in the objective function.  An extended formulation for TSP expresses the convex hull of the set of valid tours by the set \(\{x : \exists y \ge 0 : Ax +By = b\}\). The size of an extended formulation is the number of constraints, that is the number of rows in \(A\) and \(B\).

While it may be surprising that allowing extra variables can make a difference, it can. For example, in solving the hawker tour problem a key role was played by subtour inequalities to eliminate small cycles—inequalities of the form \(\sum_{i<j \in S} x_{i,j} \le |S|-1\) for a set \(S \subset [n]\). We did not add all the subtour inequalities, though, because there are exponentially many! The extended formulation size of the linear programming relaxation of TSP with all subtour inequalities, however, is still polynomial, in fact \(O(n^3)\).

The connection between extended formulation size and non-negative rank goes back to a beautiful result by Yannakakis over 20 years ago. We first need a couple of definitions. For our purposes a (convex) polytope \(P\) is a bounded set defined by linear inequalities \(P = \{x : Ax \le b\}\).  A slack matrix \(S\) for \(P\) is a matrix with rows labeled by constraints (i.e. rows of \(A\), which we denote by \(A_i\) for the \(i\)th row) and columns labeled by points in \(P\) whose convex hull is all of \(P\). The \((i,j)\) entry corresponding to a row labeled by \(A_i\) and column by a point \(v_j\) is \(S(i,j)=b_i – A_i v_j\).  That is, the \((i,j)\) entry of \(S\) indicates the slack of \(v_j\) with respect to the \(i^{th}\) constraint.  As \(v_j \in P\) it must be the case that \(A_iv_j \le b_i\), so all the entries of the slack matrix are non-negative.  An entry is \(0\) exactly when \(v_j\) satisfies the corresponding constraint with equality.

Note that there can be many different slack matrices for a polytope \(P\) as we have some freedom in choosing \(A,b\) and the points in \(P\). The extension complexity of \(P\) is the size of a smallest extended formulation defining \(P\). Yannakakis showed the following.

Theorem 1 (Yannakakis) Let \(P\) be a polytope. For any slack matrix \(S\) of \(P\), the non-negative rank of \(S\) is equal to the extension complexity of \(P\).

Thus in particular the non-negative rank of all slack matrices of \(P\) is the same.

Showing lower bounds on non-negative rank is very difficult, and we have few strong lower bounds for concrete examples. For example, even the following is open:

Question 2: Consider an \(n\)-by-\(n\) matrix \(M\) where \(M(i,j) = (i-j)^2\) for \(i,j \in [n]\). The rank of \(M\) is \(3\). What is the non-negative rank of \(M\)?

It is conjectured that the non-negative rank is \(\Omega(n)\), yet the best lower bound is \(\Omega(\log n)\).

One of the most interesting lower bounds on non-negative rank is for the unique disjointness matrix
from communication complexity. Consider a \(2^n\)-by-\(2^n\) matrix \(M\) indexed by \(n\)-bit strings. If \(x \cap y = \emptyset\) then \(M(x,y)=1\). If \(|x \cap y| =1\) then \(M(x,y)=0\). No matter how you fill out the rest of \(M\) with non-negative numbers, the non-negative rank will be \(2^{\Omega(n)}\). This lower bound follows from the key lemma in Razborov’s \(\Omega(n)\) lower bound on the randomized communication complexity of the disjointness function.

As unique disjointness provides a non-trivial family of matrices for which there is strong lower bound on the non-negative rank, it would be great to embed it in the slack matrix for an interesting combinatorial problem. This is exactly what FMPTW do.

Instead of directly working with TSP, they instead work with the correlation polytope, the convex hull of \(\{ bb^t : b \in \{0,1\}^n\}\). While not as famous as TSP, this is another important and well studied polytope.  FMPTW provide a reduction to show that lower bounds on the extension complexity of the correlation polytope provide (slightly weaker) lower bounds on extension complexity of the TSP polytope.  We will just focus on showing the lower bound on the extension complexity of the correlation polytope.

Now what matrices can appear as slack matrices of the correlation polytope? Here is a more general, and I think simpler, construction than that given by FMPTW to show a useful class of matrices that are slack matrices for the correlation polytope.

Lemma 3: Let \(p(z) = a + bz+cz^2\) be a single variate degree 2 polynomial that is non-negative on non-negative integers. Consider the matrix \(M(x,y) = p(| x \cap y)|)\) for \(x, y \in \{0,1\}^n\). Then \(M\) is a submatrix of a slack matrix for the correlation polytope.

Proof:
As \(p\) is non-negative on non-negative integers, \(-bz -cz^2 \le a\) is a valid inequality for integers \(z \ge 0\). Note that \(\langle xx^t , yy^t \rangle = | x \cap y|^2\) and \(\langle \mathrm{diag}(x), yy^t \rangle = |x \cap y|\) for \(x, y \in \{0,1\}^n\). Thus \(\langle – b \cdot \mathrm{diag}(x) – c\cdot xx^t, yy^t \rangle \le a\) is a valid inequality, whose slack is \(p(|x \cap y|)\). Note that the columns of \(M\) are labeled by vertices of the correlation polytope \(yy^t\) for \(y \in \{0,1\}^n\) and likewise the constraints are labeled by \(xx^t\) for \(x \in \{0,1\}^n\).

The matrix used by FMPTW is \(F(x,y) = p(|x \cap y|)\) where \(p(z) = (z-1)^2\). By Lemma 3 this matrix is a slack matrix for the correlation polytope and moreover \(F\) is an instance of unique disjointness. Thus we get an exponential lower bound on the non-negative rank of \(F\), and hence the extension complexity of the correlation polytope.

The way that FMPTW came up with this matrix was through—of all things—quantum communication complexity!  That story will have to wait for a future post.

A hawker tour

Monday, March 5th, 2012

My father is coming to visit Singapore next week.  Of course one thing I have to do is take him to sample some of the great food at the hawker centres here.  Say we are very ambitious and want to visit 18 of the best hawker centres on the island…in order to intelligently engage in the raging debate on which stall has the best char kway teow.  As my Dad’s stay is limited, we want the shortest tour visiting all these hawker centres.

Here is a google map of the 18 chosen hawker centres.  What do you think the shortest tour is?

18 of the best Singaporean hawker centres: ABC brickhouse, Amoy St, Changi Village, Chinatown Complex, Chomp Chomp, East Coast Lagoon, Geylang Serai, Ghim Moh, Golden Mile Complex, Hong Lim, Lau Pa Sat, Maxwell Rd, Newton, Old Airport Rd, Tampines Roundhouse, Tekka Centre, Tiong Bahru, Whampoa

(more…)

Irrationality of reviewing

Thursday, February 23rd, 2012

I have to finish reviewing two papers today.  Posting about the irrationality of reviewing seems like a great way to procrastinate.

Why do we review papers?  As pointed out in the boycott against Elsevier, with the big commercial publishers, reviewers are doing free work for very profitable companies.  Could it actually be that we are more likely to do reviews because they are not paid?

In Predictably Irrational, Dan Ariely distinguishes actions we undertake as part of social norms versus market norms.  When we help a friend move a sofa, we would probably be offended if he offered to pay us something at the end—this action takes place in the social realm, not the market realm.  Ariely did experiments (with college students of course) where he found subjects worked harder when asked to do a simple task as a favor to the experimenter than when they were paid some nominal amount.  Similarly, he relates the story of a daycare center that found that the tardiness of parents picking up their kids increased when the center started imposing fines for lateness.  Before the fines, parents avoided being late as a courtesy to the teachers who they were keeping from going home.  With fines this moved to a market interaction, where you could pay to be late.

Reviews largely work in the realm of social norms.  Many review requests I receive are from colleagues I know well, making them hard to turn down.  It is much easier to say no to form letters from people I do not know, and it would be similarly easier to say no if it became a (low) paid job rather than a favor.

While on the topic of reviewing irrationality, I remember an old post of Lance Fortnow that when evaluating results in a paper, sometimes less is more.  He related the story of a student who had a paper rejected from STOC, then removed one of the two main theorems and won best student paper at FOCS.

Being on a pop economics kick, I just finished reading Thinking, Fast and Slow by Daniel Kahneman, who discusses this very “less is more” phenomenon.  This book was very enjoyable, full of interesting examples of how real human behavior differs from that of the Homo Economicus usually assumed by economists.

Kahneman describes an experiment where subjects are asked to put a price on a set of dinnerware.   Set A has 24 plates and bowls.  Set B is set A with the addition of 8 cups and 8 saucers, a few of which are broken.

In single evaluation, where subjects are presented only one of set A or set B, set A was valued at $33 while set B was valued at $23.  In joint evaluation, where both sets are presented simultaneously, subjects acknowledge that set B can only be better than set A, and valued set B at $32 and set A at $30.

I think the implication for paper submission is clear: if in doubt, submit multiple versions and force your reviewers to behave rationally with joint evaluation.

An open revolution

Wednesday, February 8th, 2012

We are in the midst of an open revolution in science. This is a revolution towards doing research in the open, using online tools to share unfinished ideas and allow massive collaboration, but also an open revolt against the traditional model of publishing those finished ideas, in particular against the publishing giant Elsevier.

In his book Reinventing Discovery, Michael Nielsen (co-author of The Book in quantum information) is an enthusiastic chronicler of the rise of open science, and inspiring advocate for its success.  Some of his examples were familiar to me, like the arXiv or the polymath project.  But there were also some neat examples I had not heard of, like the Galaxy Zoo, where anyone can help classify galaxies, FoldIt, where users compete video game style to come up with low-energy configurations of proteins, or Kasparov vs. The World, where 50,000 people democratically played chess against Kasparov, doing much better than expected.

One example I was surprised not to find in the book was the stack exchange family of question and answer websites, mentioned previously on this blog.  Sites like math overflow or cs theory stack exchange, where people can ask research-level questions and expect to get informative answers in a matter of hours, have quickly become very useful research tools.

The end of Nielsen’s book is a call to arms, with suggestions of what scientists, programmers, citizens can do to develop open science, and also some fantasies of what open science might look like in the future.  One such idea I found quite cool was a sort of stack exchange meets eHarmony service.  This tool would automatically pair questions with respondents, based on the respondents expertise and questions they have successfully answered previously.  As Nielsen describes it, every morning a researcher would receive an email with a list of ten requests related to their expertise, which they could then choose to answer or not.

What I personally find the most exciting in the open science revolution is the development of open education.  The Khan academy deserves a lot of credit for innovation here, showing that students often prefer a short video that can be paused and replayed to a live lecture.  They now have over \(10^8\) downloads and 2600 videos.

Sebastian Thrun, who co-taught the Stanford online artificial intelligence class last semester that attracted 160,000 students, recently gave a talk about the experience which I think is well worth watching.  Thrun says that this experienced changed his life, and the talk is surprisingly heartfelt and emotional.  He also has some harsh words for the current university teaching model: “Miraculously, professors teach today exactly the same way they did 1000 years ago.  University has been, surprisingly, the least innovative of all places in society.”  Nielsen also addresses this point in discussing the low rate of participation of researchers in Wikipedia.

Thrun further says that after teaching 160,000 students (of which 23,000 completed the course), he can’t go back to the old way of teaching the 200 in his normal Stanford classes.  He gave up tenure at Stanford in April of last year and now has began a startup, udacity, focused on producing online classes.  Two classes are on tap for the first offering, CS101 going from zero programming experience to programming a web search engine, and CS373, programming some of the algorithms under the hood of a driverless car.

Putting a face on non-negative rank

Monday, January 16th, 2012

Last time we talked about non-negative rank in the context of communication complexity.
Non-negative rank also has a more practical face, with applications in data analysis and
machine learning.  Today we will talk about its application to the representation of faces, a nice
change of pace as it involves lots of pictures.

The usual way of representing a digital image is to store values for each pixel indicating the
color of that pixel.  For this example, we will work with grayscale images of 5000 faces from the
faces in the wild dataset.  To get an idea, here are the first 100 images from the dataset.

First 100 images in the dataset

Each face is a \(32 \times 32\) pixel grayscale image.  Each pixel is represented by a number form
0 to 255, with 0 indicating total absence of light (black) and 255 full intensity light (white).  Thus
in total it takes \(32 \times 32 \times 8=8192\) bits, or a kilobyte, to represent each picture.

Representing images by their pixel values is very general, and we can represent any image this way.
Say that for our application, however, we are going to be working exclusively with images of faces.
This information already substantially reduces the space of possible images.  Indeed, if we were
to randomly choose the intensity of each pixel value, the image would probably not look at all
like a face, but like TV static  (though people do have a knack for seeing faces in
unusual places!).

A random grayscale image

If we also allow some reduction in quality of the image (lossy compression), we can have vastly more
efficient representations.  We will see recognizable approximations of these faces using only
100 bytes, a factor of ten reduction in information!

The main idea behind this is to use a different set of basis images.  In the normal pixel representation,
the basis images are single pixels, one for each point of the image.  We could instead use a different
set of basis images that are specialized to working with faces.  For example, our basis images could be
prototypical faces.  Or, we could come up with a small set of images of eyes, noses, mouths, etc. and
express each face by combining the most appropriate features.  To faithfully represent any image,
we would still need at least \(32 \times 32\) basis images, the same number as originally used in the
pixel representation.  Our hope, however, is that by using basis images carefully chosen to work with
faces, with only 100 basis images we can still reasonably approximate any face in our
dataset—that is, the contribution of the other \(32 \times 32 -100\) remaining basis images will be
very small.

Of course, we don’t want to have to construct this basis this by hand, but want some
automatic way of choosing a good set of basis images for our dataset.  Principal components analysis and
non-negative matrix factorization are two ways of doing this.  In the case of principal components analysis, the
basis images are known as eigenfaces.  Each face is represented as a linear combination of
these eigenfaces.  The eigenfaces are chosen in such a way that the first \(k\) eigenfaces can represent
the dataset with less error than any other choice of \(k\) basis images.  (The particular error measure
here is the average squared difference in pixel intensity between the original and the approximation.)
Thus if we want an approximate representation of our faces in terms of only 100 basis images, taking
the top 100 eigenfaces seems like a good bet.

This is not the end story, however.  With principal components analysis, the faces in our dataset are
represented as linear combinations of the eigenfaces, so in particular coefficients can be negative.
This means that we can subtract faces.  Have you ever heard anyone describe a face as Brad Pitt minus a bit of George Clooney?  (I get that all the time.)  Probably not—such a superposition is difficult to visualize, and it seems
unlikely that our brains process faces in this way.  It is much more common to hear that you have your father’s eyes and the mouth of your mother.  (And the curly hair of the mailman?)  This addition of disjoint parts is much easier to visualize.

Non-negative matrix factorization tends to give rise to basis images that more closely correspond to
parts of faces.  In a non-negative matrix factorization we again represent each face as a linear
combination of basis images, but now all coefficients in the linear combination are non-negative—we
can no longer subtract faces.  We can have such a non-negative factorization as all the data we
started out with—all the pixel values—are non-negative numbers.  When we restrict ourselves to 100
basis images, we are looking for the best approximation of all the images by a matrix whose
non-negative rank is 100.  In principal components analysis, when we restrict ourselves to 100
basis images, we are looking at he best approximation of all the images by a rank 100 matrix.

Illustration of matrix factorization

Here is a pictorial representation of matrix factorization.  We represent our dataset as a big \(5000 \times 1024\)
matrix.  The rows are indexed by faces, and each row gives the pixel intensities of the \(32 \times 32\)
image written as a long vector—we stack the rows of the image together, beginning with the first
row of the image, followed by the second row, etc.  In this example, in the \(i^{th}\) image the
\(j^{th}\) pixel value is 42.  Representing the faces as linear combinations of 100 basis images
amounts to (approximately) factorizing the matrix as the product of a \(5000 \times 100\) matrix
holding the coefficients telling the contribution of each basis image to a face, and
a \(100 \times 1024\) matrix giving the 100 basis images.  In the case of a non-negative factorization
the coefficient matrix will be non-negative.
You can judge the final results for yourself.  Here are the 100 basis images for our dataset arising from
principal components analysis and non-negative matrix factorization.

Top 100 eigenfaces

 

Top 100 non-negative basis images

 

And the final product, the reconstruction of the faces in terms of the 100 basis images.  First, from PCA.

Faces in terms of top 100 eigenfaces

And from the non-negative factorization.

Reconstructed faces from rank 100 non-negative matrix factorization

You will probably agree that the recovered images from the non-negative factorization are more
recognizable.  Unfortunately, there is a price to pay for this improvement.  For a \(m\)-by-\(n\) matrix
(in our case a 5000-by-1024 matrix), all the eigenfaces can be found in about \(\max\{m,n\}^3\) many steps.  On my laptop this took about 21 seconds.  Finding the best approximate non-negative factorization, on the other hand, is an NP-hard problem (shown by Vavasis).  In practice, a non-negative factorization is usually found by iterative methods that are not guaranteed to converge to the optimal solution.  I did 500 iterations of the algorithm suggested in this paper by Lee and Seung, which took about 5 minutes.

Another application of these techniques is to recommender systems.  You know, if you like this
you might also like that.  Matrix factorization played a big role in the winning algorithm for the
Netflix prize, a contest to give an improve the recommendation algorithm of Netflix.
Here the data can also be represented as a big matrix, now with rows indexed by users instead of
faces, and columns indexed by movies.  An entry of the matrix gives the user rating for that movie—if
the user has rated it.  The big difference between the setting of recommender systems and that
described above is that now there are lots of holes in the matrix, places where the value is unknown.
Indeed, the whole point is to fill in these holes and predict user ratings.  But, just as here we used
matrix factorization to come up with basis images which could well approximate our faces, in
recommender systems matrix factorization can be used to come up with features of
movies that do a good job of distinguish their preference among users.  In other words, instead
of coming up with factors by hand like, level of action, includes a big star, or humor and evaluating
the movie on each factor, these factors and evaluations are automatically generated just like the eigenfaces, and can often be similarly difficult to interpret.  This survey article by members of the Netflix prize winning team is a nice place to learn more.