Last week, DeepMind announced their new AlphaEvolve software along with a number of new results improving optimization problems in a number of areas. One of those results is a new, smaller, tensor decomposition for multiplying two 4 × 4 matrices in 48 scalar multiplications, improving on the previous best decomposition which needed 49 multiplications (Strassen 1969). While the DeepMind paper was mostly careful about stating this, when writing the introduction, the authors mistakenly/confusingly wrote:
For 56 years, designing an algorithm with rank less than 49 over any field with characteristic 0 was an open problem. AlphaEvolve is the first method to find a rank-48 algorithm to multiply two 4 × 4 complex-valued matrices
This gives the impression that this new result is the way to multiply 4 × 4 matrices over a field with the fewest number of field multiplications. However, it's not even the case that Strassen's 1969 paper gave a faster way to multiply 4 × 4 matrices. In fact, Winograd's 1968 paper on computing inner products faster could be applied to 4 × 4 matrix multiplication to get a formula using only 48 multiplications. Winograd uses a trick

which changes two multiplications of ab's into one multiplication of ab's and two multiplications involving only a's or only b's. The latter multiplications can be pre-computed and saved when calculating all the innerproducts in a matrix multiplication
# i=1..4, 4*2 = 8 multiplications
p[i] := -A[i, 1]*A[i, 2] - A[i, 3]*A[i, 4];
# 4*2 = 8 multiplications
q[i] := -B[1, i]*B[2, i] - B[3, i]*B[4, i];
# i=1..4, j=1..4, 4*4*2 = 32 multiplications
C[i,j] = p[i] + q[j]
+ (A[i,1]+B[2,j])*(A[i,2]+B[1,j])
+ (A[i,3]+B[4,j])*(A[i,4]+B[3,j])
It is simple to verify that C[i,j] = A[i,..] . B[..,j] and that the above formula calculates all of the entries of C=A.B with 8+8+32=48 multiplications.
So, if Winograd achieved a formula of 48 multiplications in 1968, why is the DeepMind result still interesting? Well, the relative shortcoming of Winograd's method is that it only works if the entries of A and B commute, while tensor decomposition formulas for matrix multiplication work even over noncommutative rings. That seems very esoteric, but everyone's favorite noncommutative ring is the ring of n × n matrices. So if a formula applies to a matrix of matrices, that means it gives a recursive formula for matrix multiplication - an tensor decomposition formulas do just that.
The original 1969 Strassen result gave a tensor decomposition of 2 × 2 matrix multiplication using only 7 scalar multiplications (rather than the 8 required by the inner product method) and leads to a recursive matrix multiplication algorithm using O(N^log[2](7)) scalar multiplications. Since then, it has been proved that 7 is the smallest rank tensor decompostion for the 2 × 2 case and so researchers have been interested in the 3x3 and larger cases. Alexandre Sedoglavic at the University of Lille catalogs the best tensor decompositions at https://fmm.univ-lille.fr/ with links to a Maple file with an evaluation formula for each.
The previous best 4 × 4 tensor decomposition was based on using the 2 × 2 Strassen decomposition recursively (2 × 2 matrix of 2 × 2 matrices) which led to 7 2 × 2 matrix multiplications each requiring 7 scalar multiplications, for a total of 49. The new DeepMind result reduces that to 48 scalar multiplications, which leads to a recursive algorithm using O(Nlog[4](48)) scalar multiplications: O(N2.7925) vs. O(N2.8074) for Strassen. This is a theoretical improvment over Strassen but in 2024 the best known multiplication algorithm has much lower complexity: O(N2.3713) (see https://arxiv.org/abs/2404.16349). Now, there might be some chance that the DeepMind result could be used in practical implementations, but its reduction in the number of multiplications comes at the cost of many more additions. Before doing any optimizations I counted 1264 additions in the DeepMind tensor, compared to 318 in the unoptimized 4×4 Strassen tensor (which can be optimized to 12*7+4*12=132). Finally, the DeepMind tensor decomposition involves 1/2 and sqrt(-1), so it cannot be used for fields of characteristic 2, or fields without sqrt(-1). Of course, the restriction on characteristic 2 is not a big deal, since the DeepMind team found a 4 × 4 tensor decomposition of rank 47 over GF2 in 2023 (see https://github.com/google-deepmind/alphatensor).
Now if one is only interested in 4 × 4 matrices over a commutative ring, the Winograd result isn't even the best possible. In 1970, Waksman https://ieeexplore.ieee.org/document/1671519 demonstrated an improvement of Winograd's method that used only 46 multiplications as long as the ring allowed division by 2. Waksman's method has since been improved by Rosowski to remove the divisions see https://arxiv.org/abs/1904.07683. Here's that nice compact straight-line program that computes C = A · B in 46 multiplications.
And, here attached are five Maple worksheets that create the explicit formulas for the 4 × 4 matrix methods mentioned in this post, and verify their operation count, and that they are correct.
Strassen_444.mw Winograd.mw DM_444.mw Waksman.mw Rosowski.mw
If you are interested in some of the 2022 results, I posted some code to my GitHub account for creating formulas from tensor decompositions, and verifying them on symbolic matrix multiplications: https://github.com/johnpmay/MapleSnippets/tree/main/FastMatrixMultiplication
The clear followup question to all of this is: does Maple use any of these fast multiplication schemes? And the answer to that is mostly hidden in low level BLAS code, but generally the answer is: No, the straightforward inner-product scheme for multiplying matrices optimized memory access thus usually ends up being the fastest choice in most cases.