Evelyn Lamb: Hello, and welcome to My Favorite Theorem, a math podcast with no quiz at the end. I’m Evelyn Lamb, one of your hosts. I’m a freelance math and science writer in Salt Lake City, Utah, and this is your other host.
Kevin Knudson: Hi, I’m Kevin Knudson, professor of mathematics at the University of Florida. How are you doing, Evelyn?
EL: I’m doing okay. I was trying to think of something interesting to talk to you about at the beginning of this and life isn’t very interesting, but in a good way. So that’s good.
KK: Yeah. Yeah. You know, monotony is underrated, isn’t it?
KK: We just had our 29th wedding anniversary on Sunday.
KK: Thank you. That’s our big news. We went out and sat at a restaurant for the first time in nearly a year. So we were very excited
EL: Excellent. I hope that, you know, this podcast is going to go on for decades and decades, and someone is going to be catching up on old episodes, you know, in 15 years and say, “Why do they keep talking about all these very boring things that they’re doing for the first time in a year?” So that’ll be great. You know, last time it was haircuts, this time it’s restaurants.
KK: That’s right. That’s right. Yeah.
EL: But whether they are saying that or not, they will be very excited that we are talking today to Rekha Thomas, and Rehka, would you like to introduce yourself?
Rekha Thomas: Hi, thanks for having me on the show. So I am a professor of math at the University of Washington in Seattle. And I come originally from optimization. My PhD is actually in operations research. But I have worked in a math department ever since I graduated. And my work lies somewhere at the intersection of optimization, applied algebraic geometry and combinatorics. And I very much like problems that have an applied background, or a motivation. Not necessarily because I work in applied things, but because I very much like the problem to be motivated by something real. And in the last 10 years or so I’ve been working quite a bit in the mathematics behind questions that come from computer vision. And this has been especially fun, but optimization in general does have that applied side to it. So I like problems like that.
KK: Right. You know, I talked to my colleagues over an industrial and systems engineering, where they do a lot of that work all the time. And I think, “You guys are basically mathematicians. Why don’t you come over here?” Heavy overlap there. Yeah.
EL: Well, and I just want to say, how we came to invite Rehka onto the podcast.
EL: I think it’s pretty cool. So one of her students sent in an email to our little submission form on Kevin’s website, saying that he thought that his linear algebra teacher was great and would be a great guest for My Favorite Theorem. And this is especially remarkable because this is in the semester, we’re all teaching has been online and stuff. And I just think to be connecting with students enough when you’re doing Zoom teaching, that they actually reach out to a podcast to get you as a guest on a podcast, just you must have really made an impression. So I think that’s very cool.
RT: Thank you. I was very honored by that piece of news. I thought, “Okay. We finally made contact.” That was great. But especially for an undergraduate to take that initiative to write to you was very touching.
EL: Yeah. So you know, not to build things up too much, but I’m sure that this is going to be a great episode.
EL: So with that introduction done, what is your favorite theorem?
RT: Yeah, so I announced my favorite theorem in that class and told them about your podcast. So when you invited me, I thought, “Oops, is that really my favorite theorem? Is that what I really want to speak about?” And I think yes, so I’m going to stick with that theorem. And this is a theorem from linear algebra. It was a linear algebra class. And I think it’s one that’s not very well known to pure mathematicians. So it goes by the name of Eckart-Young theorem or the Eckart-Young-Mirsky theorem, but apparently the history is more complicated. I’ve been reading about this so I can tell you about it if we get to that. So the theorem is basically says the following. So if you have a matrix, say a real matrix of size N by N, and we know its singular value decomposition [SVD] — which is a very special decomposition of the matrix as a sum of rank-one matrices — then the closest rank-K matrix to the given matrix is just made up of summing the first K rank-ones in that decomposition. So if I wanted the closest rank-one matrix, I just take the first rank-one in the singular value decomposition of A. If I wanted the closest rank-two matrix, I take the first plus the second rank-one matrices in the singular value decomposition, and so on. So there is this neatly arranged set of rank-one matrices that add up to give you A, and if you truncate that sum anywhere at the K-th spot, you get precisely the closest rank-K matrix to the given matrix. This is the theorem.
KK: Closest in what sense? What’s the metric?
RT: Yeah, so that’s a very good question. So closest in either Frobenius norm or spectral norm. So these are standard norms on matrices. Frobenius norm is just, you think of your matrix as a long vector, where you just take every row, let’s say, and concatenate it to make make a long vector, and then take the usual Euclidean norm. Okay. So the sum of Ai,j squared, square root. [Maybe a little easier to read in math notation: (Σ Ai,j2 )1/2.] Spectral norm is the largest singular value, so it is the biggest stretch that the matrix can make on a unit vector.
RT: So in either norm this works.
KK: That’s pretty remarkable.
EL: Yeah. So just doing a little bit of mathematical free association, is this, can this in any way be thought of as like a Taylor’s Theorem, like, you’re kind of saying like, Okay, if you want your approximation to be this good, you go out this far, if you want it to be this good, you go out this far, maybe that’s — I don’t I don’t know if that’s a good analogy.
RT: No, I think it is. So I think there are many constructs in mathematics like this, right? Like Fourier series is another one, where you have a break down into pieces, and then you take whichever part you want. Taylor series is similar. The SVD is special in the sense that the breakdown, this breakdown into rank-one matrices is actually tailored to the matrix itself. Like, for example, as opposed to, say, in Fourier series, where the the basic functions that we are trying to write the function as a combination of, they are fixed, right? It is always cos(θ) + i sin(θ), or cos(Nθ) + i sin(Nθ). So it’s not particularly tailored to the function. It’s just a fixed set of bases, and you’re trying to write any function as a combination of those basis functions. But in the SVD, the basis that you construct, the factorization that you get, is tailored to the actual data that’s sitting inside the matrix. So it’s very, very nice and is incredibly powerful. So it’s similar, and yet, I think, slightly different.
KK: Right. In other words, yeah, that’s a good explanation. Because, as you say, with Taylor series, you’re choosing a basis for a subspace of the space of all smooth functions or whatever. Whereas here, you’re taking a particular matrix. Does it matter what — so if you change the basis of your vector space, do you get a different SVD?
RT: No. So what the SVD is — there is a very nice geometric way to think of the SVD, which may answer that question better. So the SVD is sort of a reflection of how the matrix works as an operator. So what it’s telling you is if you take the unit sphere in the domain, so let’s say we have an N by M matrix, so the domain is Rn, take the unit sphere, under the map A, the sphere goes to an ellipsoid.
RT: In general, a hyperellipsoid, right? And this ellipsoid has semi-axes. The singular values are the lengths of those semi-axes. Oh, so it tells you, yeah, the length of the semi axes, and the unit vectors in the directions of those semi-axes are one set of basis vectors. So that’s one set of singular vectors, and the pre-images in the domain are the other set of singular vectors. So if you’re willing to change basis to the special bases, one in the domain, one in the codomain, then your matrix essentially behaves just as diagonal matrix. It just scales coordinates by these singular values. So it’s a generalization of diagonalization in some ways, but one that works for any matrix. You don’t need any special condition. So they’re very canonical bases that are tailored to the action of the matrix.
KK: I’m learning a lot today. So I have to say, so before we started recording, I was mentioning that I think students need to take more linear algebra. But the truth is, I need to take more linear algebra.
RT: I think we all do.
EL: Yeah, I mean, I am really realizing how long it’s been since I thought seriously about linear algebra. So this is fun. And it goes really well with our episode from a few months ago. I believe that was Tai-Danae Bradley who chose a singular value decomposition as her favorite theorem. So we’ve got, you know, a little chaser that you can have after this episode, if you want to catch up on that. So Rekha, you said that there was a bit of a complicated history with this theorem. So do you want to talk a little bit about that?
RT: Yeah, sure, I’d be happy to. So I always knew this theorem as the Eckart-Young theorem. I only recently learned that it had Mirsky attached to it. But then in trying to prepare for this podcast, I started looking at some of the history of the singular value decomposition. And there’s a very nice SIAM review article by Stewart, written in 1993, about the history of the singular value decomposition. And according to him, singular value decomposition should be attributed, or the first version of it, is due to Beltrami and Jordan from 1873 — so the Eckart Yang theorem is from 1936, so almost 60 years before — and they were looking at more special cases, they were looking at square real matrices that are non singular, perhaps. So they, you know, people were interested in that special case. Then there were several other people, like Sylvester walked on it. Schmidt from Gram-Schmidt, he worked on it, Hermann Weyl worked on it. So from 1873 to 1912, this went on. And this article says that this approximation theorem that I mentioned, the Eckart-Young theorem, is really due to Schmidt from Gram-Schmidt fame. And he was interested not so much in matrices, but he was studying integral equations, where you have both symmetric and asymmetric kernels, non-symmetric kernels. And he wrote down this approximation theorem. So really, Stuart claims that that this theorem should actually be attributed to Schmidt. And then in 1936, Eckart and Young actually wrote down the SVD for general rectangular matrices. So that is in that paper, for sure. And they seem to have rediscovered this approximation theorem. So, that is my understanding of the history of how it is, but I did not know this till two days ago. And I’m not really a bonafide historian in any way. So, but this is what I’ve understood. It’s an interesting story.
EL: Yeah, that sounds like an interesting article. I mean, I guess in the history of math, there are an uncountable number of places where unravelling back to where the idea first appeared is more complicated than you think.
KK: And also, this sort of gets at your interests more generally, how you like things to sort of come from an actual application. Well, if this really came from integral equations, right, that’s really an application.
KK: So it’s working on lots of levels for you.
RT: That’s right. So he, of course, did this in infinite-dimensional vector spaces. And approximation will allow you to approximate an operator as opposed to a matrix, right? And apparently, that really elevated this whole theory from just a theoretical tool to something that’s actually widely used. It became much more of a practical tool. And I guess, in modern day, the SVD, and versions of the SVD in exists in all kinds of mathematical sciences. So in signal processing, and fluid dynamics, all sorts of places. So it’s, in some sense, one of our biggest exports from the math world, and yet we don’t quite teach it normally, to math people. So yeah.
KK: Right. So was this like a love at first sight thereom? Was this the sort of thing that came up a lot in your work, and that’s why you’re now so enamored?
RT: So I did learn of at first in the process of writing a paper. I did not know about this theorem before, maybe about 10 years ago. But I think this theorem sort of perfectly fits me, which is why I love this theorem. So for different reasons, right? So first of all, it’s an optimization problem, it’s about minimizing distance from a matrix to the set of rank-whatever matrices. So it’s an optimization problem. The space that you’re trying to minimize to, which is the space of rank at most k matrices, that is an algebraic variety. So it can be written as the set of solutions to polynomial equations. So there’s the applied algebraic geometry side, or at least the algebraic geometry side. And it’s not a very simple variety. It’s actually a complicated variety. So it’s an interesting one. And lastly, this problem is sort of a prototypical problem in many, many applications. So a lot of statistical estimation problems are of this flavor. You have a model, which, let’s say is our rank-K variety, so the rank being some measure of complexity. And then you have an observation that you have in the field with instruments, and it tends to be noisy, so it’s not on the model. So that’s your observed matrix. And now you’re trying to find the maximum likelihood estimate, or the closest true object, that fits the observation. So this is a very standard problem that comes up in many, many applications. So in some sense, I feel it really lives at this intersection of optimization, algebraic geometry and applications, which is sort of what I do.
KK: That’s you.
RT: Yeah. So that’s one reason that I think this theorem is so cool. And the other thing is, I think it’s a very rare instance of an optimization problem where the object, the observed matrix, knows the answer in its DNA. It doesn’t need to know the the landscape that it’s trying to get to, which is your space of matrices of rank at most K. it doesn’t need to know anything about that variety. Just inside its own DNA, it has this SVD. And from the SVD, it knows exactly where to go. So this is completely unusual in an optimization problem, right? Like even if you’re minimizing, say, a univariate function over the interval [0,1], I really need to know the interval [0,1], to figure out what the minimum value is. The function doesn’t know it. But this is, I think, kind of a gem in that sense. You don’t need to know the constraint set. And then lastly, it appears all over the place in applications. So in things like image compression, you know, the Netflix problem is sort of a version of this, distance realization, you know, things that would come up in areas like molecular modeling, or protein folding, and so on. It’s all — many of these problems can be thought of as low rank approximation to a given matrix.
KK: Very cool. Yeah, I’m now I’m thinking about that variety. It’s not a Grassmannian. But it’s sort of like, is it stratified by Grassmannians? Let’s not go down this path.
RT: It’s just — yeah, it’s a set of solutions to the equations that you get by setting all the rank, whatever minus to zero.
RT: So the matrix cannot be rank more than K. So you set all the K+1 by K+1-minus to zero.
KK: That’s complicated.
RT: Yeah, yeah. It’s a complicated variety with singularities and so on. Right.
KK: So another thing we like to do on this podcast is ask our guests to pair their theorem with something. So what pairs well with this theorem?
RT: So I thought about this. This, to me was a very interesting challenge that you posed. So one thing that I always think about when I teach this theorem or teach the SVD in general, is there’s sort of a layer cake analogy. Okay, so I have always drawn this layer cake picture in my class. But when I started thinking about what I should tell you on the podcast, I thought, “Okay, it’s not quite a layer cake, like you would buy in a store.” But it’s sort of like a layer cake. So there is a layer cake analogy going on here. And that is simply we can think of a matrix. So if you have, say, an M by N matrix, let’s start by just thinking of it as a rectangular chessboard lying on the floor. And then every entry of the matrix is creating, let’s say, a building on each square. So you have, you know, the buildings have different heights, depending on the entry there. And then in that sense, what we’re doing is, if you think of what a rank-one matrix is in picture that I’m trying to draw, then what it is, would be this sort of city block with skyscrapers that we’ve constructed, where when you look from East Avenue or West Avenue, you see one skyline. And then you see, like, various up and down versions of the same skyline as you look across. And then similarly, if you stand on North Avenue or South Avenue, you see one skyline. And then you see these up and down versions of that skyline as you go in the other direction. So that’s a rank-one matrix. And then the whole matrix is built of these puzzle pieces, if you like. They’re all rank-one matrices. And what we’re doing is sort of, take different puzzle pieces, you know, the first puzzle piece captures the most amount of energy in the matrix, then the first and second, the next amount, and so on. So that’s sort of one geometric thing. And so thinking of a pairing, that’s just a geometric thing, not not exactly a pairing. But in my mind, another way to think about the whole thing is you could think of your matrices as, say, living in a universe. We have this M by N, universe. And then each of these landscapes, these, you know, matrices of rank, at most K, they form landscapes inside this universe. They’re nested one inside the other. And you could almost think of your matrix as sort of a flying object. And if it needs to make an emergency landing on one of these landscapes, it knows exactly where to land. It doesn’t need any, you know, radio control from the ground, right? There’s no air traffic control on the ground who needs to tell this matrix where to land. So that’s sort of my geometric vision of what is happening. I love geometry. So I always try to make pictures like this. But the closest physical phenomenon that I was thinking that maybe we could match with this is with the way migratory birds work, right? Like these migratory birds, they have sort of an inherent genetic compass in their head that tells you where they should land. So Florida being one of the biggest places.
RT: And that’s exactly before they fly over the Gulf, right, where there’s a long stretch of water so they know exactly where the end of the land is, or where the beginning of the land is when they fly in the other direction. So I think that’s some amount of this sort of DNA information that’s in their head. So there’s either genetic information — of course, they also use celestial signals like the sun and the stars and so on. But yeah, so that that, that to me was the best pairing I could come up with, just thinking of matrices as having this inbuilt computer inside them.
EL: I love that!
KK: I do too. As an avid birdwatcher, I’m really into this pairing a lot. This is really nice. Yeah, and luckily for the matrices, it doesn’t get messed up. You know, I mean, I get various — I subscribe to Audubon and things like that. And I just read an article recently about how light pollution is really a problem for migratory birds. Especially, you know, they fly over New York City. You think they’re not there, but they are, and you can catch it on radar data and all of that. And it’s really become a problem for them. And especially with climate change, they’re getting messed up on all these things — not to get off into birdwatching, but this is a really excellent pairing. I love it.
EL: Yeah, well, I know Kevin is quite the birdwatcher. I have only recently been getting a little more into it. And so I will think about flying matrices the next time I go and look at some birds. I recently discovered a new birdwatching place not too far away from where we live. And there have been some great blue herons nesting there. They’re probably leaving soon, but they were there for the spring. And so that was cool as a very beginning birdwatcher to suddenly have your first serious like, “I’m going to watch birds at this place” have these nesting great blue herons at them. It really raises the bar for subsequent birding outings.
KK: They’re impressive birds, too. I mean, I’ve seen them. We have lots of them here, of course, I mean, they’re walking around campus sometimes. But yeah, they’ll catch up like a big fish swallow it whole. And it’s really pretty remarkable. So yeah. All right. Well, Rekha, so we always like to give our guests a chance to plug anything they’re working on. Where can we find you online?
RT: Oh, so I have a basically just by webpage. I’m not a social media person at all.
KK: Good for you.
RT: That’s basically where you can find me.
KK: You’re part of a wonderful department. I’ve visited there several times.
KK: Great department, great city. And we’ll be sure to like the your homepage. Anyway, thanks for joining us. I learned a lot today. This has been great.
EL: Yeah. And can I just, I don’t do a little bit of tooting our own horn was saying that I love this podcast because we get to do things like now every time I decompose these matrices I’m going to think about migratory birds. And, you know, it’s just, like, building all these little connections. I love it. Thanks for joining us.
RT: Thank you.
On this episode of the podcast, we were excited to talk to Rekha Thomas, a mathematician at the University of Washington, about the Eckart-Young-Mirsky theorem from linear algebra. Here are some links you might find interesting after you listen to the show:
Our episode with Tai-Danae Bradley, whose favorite theorem is related to Thomas’s
Stewart’s article about the history of singular value decomposition