on
Some things I didn't happen to know about adjoints, transposes, dual spaces, and elementary linear algebra
This is not really intended to be a technical post. While it contains some technical elements, they are relatively basic; this is more of a retrospective about the messy process of learning things. Maybe if you happen to think in a similar way to me, this will provide you a couple of really valuable lenses, though.
I managed to get an entire math BS and CS MS from pretty good schools while understanding surprisingly few things in math, because:
- some classes were not that good, and at the time I didn’t really know how to take the initiative to get around that, but also in these cases it was quite possible to get a good grade without understanding anything, which even when I wanted to learn seemed kind of like the only option
- in the good classes, it’s possible for the explanations your professor has to just not be the one that happens to fit your way of thinking, or you can develop one method of thinking about things that works well enough for the purposes of the class but somehow isn’t really complete, and it’s generally kind of hard to fix this even when you know it has happened
I don’t think there was a simple explanation of my not wanting to know math enough, not liking it enough, or not trying hard enough. I think I understood a few things pretty deeply– sometimes considerably more deeply than a class demanded, and sometimes in a way that I don’t think was directly taught to me. It’s more like (at least within “math”), there was a weak correlation between the things I took classes in and the things I happened to learn. I am sure that I will write more about this and eventually separate this observation into its own post.
Anyway, relatively early in a standard abstract linear algebra course that most math majors will take, people bring up the concept of the dual space $V^*$ of a vector space $V$, which is the space of all linear functionals on V (i.e, linear functions $V^* \rightarrow \mathbb{R}$, if V was over the reals, which let’s say it was). Concretely, suppose you have a basis $\{e_n\}$ of vectors of V; one basis of $V^*$ is the basis of functionals $\{f_n\}$, where $f_i(e_j) = 1$ if $i = j$ and 0 otherwise.
This induces the concept of the dual linear map: given a linear transformation $A: V \rightarrow W$, there is a canonical map $A^*: W^* \rightarrow V^*$, defined by $A^*(g)(v) = g(A(v))$. (Type-checking here, $g$ is a linear functional on $W$.) An elementary result about this map is that its matrix with respect to the basis $\{f_n\}$ is the transpose of the matrix of A.
I remember that when I learned this, it seemed generally pretty cool to have tools to operate on that level of abstraction. But I could only prove the statement mechanically, by sort of chasing the action of a basis linear functional on a basis vector, and I couldn’t develop that much intuition, since I had a bit of trouble keeping all of the parts in working memory. Moreover, I didn’t really see the point of this statement. This concept, I think in most standard treatments (definitely mine), doesn’t really come back up in an obvious way for a while, leading you to think that you can safely write it off as an abstract curiosity, until of course you realize that it was pretty important if you want to use certain powerful tools and was probably taught for a reason.
A different professor of mine at a later time liked to speak loosely of row vectors as being functions, and column vectors as being actual vectors. I thought that this was kind of pointless, since the same additions and multiplications occur when you take a dot product no matter how you care to notate the vectors, and also all of linear algebra should be valid whether you use the notation you use or you take the transpose of everything. While those things are true, I now see a lot of value in his way of thinking, because part of the power you gain from an extremely general theory like linear algebra is that you can sometimes get insight for free by just reinterpreting the terms in an expression. I also realized that I didn’t have a lot of practice doing exactly that, and seeing whether I could glean any insight into what was happening when I did all matrix operations with the transpose of what I would usually do.
As a result of trying to do that for a few minutes, I made a few extremely elementary (i.e, follows directly from definitions) observations about linear transformations which, again, I managed to not make for a bunch of years doing other semi-advanced mathematics:
- The i’th column of a matrix A is the vector that A sends $e_i$ to. (I knew this one; I had to know at least one of these or else I couldn’t have done anything at all.)
- The i’th row of a matrix A is a formula which tells you how much the different coordinates of an input vector affects its image’s component in $e_i$. (This is very natural, thinking of the row as a linear functional, but I completely didn’t know this one.)
- You can of course do “transpose” linear algebra, but now using both intuitions: if you have a row vector v and you want to compute vA, you can think of v as a functional which maps the columns of A to a bunch of different numbers, or you can think of this as working the same way as computing $A^Tv^T$, however you wanted to think of that.
- If you compute a quantity AB, its i’th column is just A times the i’th column of B. (I also knew this one.) Using the transpose logic, it’s also true that its j’th row is just the j’th row of A times B.
- More generally, given the above two points, every time you compute a quantity AB = C, there are multiple different ways of interpreting the expression, and therefore the result: A can be a transformation acting on B or vice versa, and B’s columns can be transformed by A to produce the columns of C, or they can be functionals which act on the rows of A to produce the rows of C, etc…
Now here is the way I prefer to think about that dual statement: if you just lay out a row vector v, a matrix A, and a column vector w, and reinterpret the terms as above (ie, v is a functional on the image of A), then the result follows automatically. For example, $v(Aw) = (vA)w = (A^Tv^T)^Tw$. It’s clear to see that $A^T$ is the matrix acting on $v$ here if you were to think of $v$ as a column vector, and what you get can clearly be interpreted as a functional that acts on $w$, so it’s natural to think of $A^T$ as mapping a functional represented by v to some other functional, etc. But I’d say you should just start by thinking of $v$, a functional, as a row vector, and just encode this statement about the canonical dual map and the transpose matrix as being a simple statement about one of the ways you can parse an expression like $vAw$.
This all works basically if you are convinced that matrix multiplication is associative. Actually, if you are convinced about the canonicalness of the dual map, this also works as a decent way of convincing you that matrix multiplication is associative.
Later in my life, I spent a particularly long time passively confused about adjoints. For later context:
- Given an operator $A: V \rightarrow W$, an inner product on V, and an inner product on W, the adjoint operator $A^*: W \rightarrow V$ satisfies the relation $\langle Ax, y \rangle = \langle x, A^*y \rangle$ for all vectors $x$ and $y$ in the inner product space. 
- An elementary result about adjoints is that with respect to fixed bases, the matrix of $A^*$ is the transpose of the matrix of $A$. The proof of this in e.g, Axler, is not very intuitive to me. 
I somehow took two classes in linear algebra, one in functional analysis, and one in differential topology without ever hearing the word “adjoint” at all, and it was only semi-casually perusing Axler’s a couple years later that I noticed this hole. Although I could manipulate them algebraically, the definition seemed to me to be somewhat convoluted, and I had no geometric intuition for what it was supposed to be doing. I tried for a short time again last year and was able to build a small amount of it by characterizing the effect on rotation matrices and in the rank-one case, but I didn’t know how to be sure that my intuition was comprehensive. I also knew from the fact that “the matrix of an adjoint is the transpose” that there must be some natural connection to dual spaces, but since my intuition about those was also pretty weak I was semi-subconsciously averse to pursuing that line of inquiry, preferring to look for a different characterization. (One funny thing that happened is that I was searching the internet for an intuitive frame again, encountered a 5 year old Reddit comment from a guy who made some connections that I thought were insightful but got stuck at the connection to the dual/transpose, and I felt a strong sense of connection to him before realizing it was me.)
I vacillated between trying the “just grind problems and eventually an intuition will emerge” strategy (which has the pro that it’s tractable and gives your mind a lot of impetus to process the topic subconsciously, but the con of possibly being really inefficient in the exact sort of way mathematicians hate) and the “search around for the correct characterization that will make this obvious for me” strategy (which has obvious pros, and the con of possibly never really bearing fruit if that characterization does not exist or hasn’t been articulated anywhere. In general it’s hard to say whether a better way of thinking of something does or does not exist, but it seems that math is probably the field where (at all the levels I’ve accessed, which admittedly is not most of them at this point in history) there is almost always a more general, more elegant way of framing things that is worth investing the effort in to understand, but that still doesn’t mean someone wrote it down in a way that’s accessible to you in particular).
Anyway, since I now had this easier intuition about dual vectors it was natural to try to apply it to adjoints; due to the Riesz Representation Theorem it’s doubly natural to use this “one of the vectors in this inner product is a row vector/functional” sort of intuition, and indeed the result is basically tautological once you think of things this way. If your inner product is the standard dot product all of the equations are literally the same, i.e, $<Ax, y> = x^TA^Ty = <x, A^T y>$, and if your inner product is something else, this derivation gets slightly messier because you have to think about what it does to both x’s space and y’s space, but this approach will stlil get you the correct answer. (This observation, or something very closely related to it, was also made here.)
In the end, even in the age of LLMs, the “keep mining MathOverflow hoping to find a gem” strategy was the one that actually resolved this in my case, with this post. Honestly, I didn’t need or understand most of this answer, but it pointed me to a completely elementary observation that I also did not know to make: a finite-dimensional operator can be decomposed into the sum of several rank-one operators, so just characterize the behavior of adjoints of those, which I had done, and then use linearity to prove/intuit the rest. The value of the post was in presenting this point to me in the exact context I needed for the exact frame of mind I had about the question at the time, which seems kind of hard to systematically search for.
(For completeness, loosely speaking, in the rank one case, we care only about our projection onto some specific line, which then gets mapped by our transformation to a line in the image; the constraint on the adjoint is that it maps the line in the image back to the line in our domain. The constraint that we care “only” about our position on these lines is what gives us the complementary subspace properties of the adjoint, and this carries over neatly when we think of rank-n transformations.)
This is a much less thought-out addendum, but maybe it helps someone else. I also didn’t really cover inner products with any formality during my education, so I’ll make some obvious statements here that I had to discover for myself at different points:
- Ignoring complex values and the conjugate condition, bilinearity and symmetricity of $<v, w>$ is exactly equivalent to there being a symmetric matrix A such that $<v, w> = v^TAw$. This has at least a vague relation to most statements involving symmetric matrices.
- Likewise, a lot of things of the form $x^TAx$, like you might find in optimization theory or a machine learning survey course, can be interpreted as the norm-squared of x with respect to some inner product.
- Just as dot products are a sum of products of corresponding entries, so you can do this with infinite sequences or even continuous spectra of numbers, as long as you handle limits/convergence/etc. correctly so that it’s sensible to talk about the result. This makes $<f, g> = \int_a^bf(x)g(x)dx$ a very natural generalization of what we were already doing in finite-dimensional vector spaces, and this is a very straightforward and central example of what is meant when we do linear algebra in infinite dimensions.- (I personally think that while proving things gets mechanically harder and more terms need to be introduced to keep track of subtleties, conceptually things don’t get that much harder (certainly not in proportion to how much longer the corresponding text gets) when you go to infinite dimensions, and that I shouldn’t have been so scared of the premise when I was younger. I think of this as mostly using the exact same set of mental processes modulo a few holes you just need to be aware of. Maybe the view from an even higher level of abstraction is different.)
- (Also, “handle limits/convergence/etc. sensibly” is why we cared so much about operations on sequences, various forms of integrability, Holder’s inequality, etc. in that functional analysis class…)