In our previous article we described a packed representation for sparse polynomials is designed for scalability and high performance. The expand and divide commands in Maple 14 use this representation internally to multiply and divide polynomials with integer coefficients, converting to and from Maple's generic data structure described here. In this post I want to show you how these algorithms work and why they are fast. It's a critical stepping stone for our next topic, which is parallelization.
Consider the problem of multiplying two sparse polynomials f = f1 + f2 + ... + fn and g = g1 + g2 + ... + gm in distributed form. In general we must compute all products fi ⋅ gj and combine like terms to produce the result. This typically involves sorting, although a hash table could be used instead.
The problem with sorting all the fi ⋅ gj is that memory is slow. The table below should give you some idea of what to expect. If we generate a large dataset and sort it we will inevitably be working in memory, not in cache. Likewise, if we insert all the terms into a hash table and too many of them are distinct, we will also run up against the memory bottleneck. Large hash tables are especially slow because of their inherent random access.
The challenge then is to combine all the terms fi ⋅ gj in cache and write only the final result to main memory. This is accomplished by an algorithm of Johnson (1974), which is illustrated at the top of this post. Johnson's algorithm requires f and g to be sorted. It uses a binary heap to merge the largest outstanding fi ⋅ gj in each iteration, producing a new term of the result each time. This requires space for only #f monomials and pointers into g.
This order of magnitude reduction in "working storage" means that Johnson's algorithm will almost always run in cache. Another useful property is that it generates terms of the product one at a time in descending order, so we can divide polynomials with no intermediate result. Recall that the classical algorithm to divide f by g creates an intermediate polynomial p := f, computes a new term of the quotient qi := lt(p)/lt(g), and subtracts p := p - qig in each iteration. If the division is sparse then the polynomial p may be large and once again we are working in memory. There is also a complexity problem: since each merge is O(#p) the resulting algorithm can be O(#q#g).
Johnson's division algorithm solves these problems by using a heap to merge Σ qi g. This requires only O(#q) working storage instead of O(#f + #q #g), and produces a complexity of O(#f + #q #g log #q), which is optimal for #q < #g. Our work has produced a better complexity of O(#f + #q #g log min(#q,#g)), which is the cost to multiply the quotient and the divisor and subtract their product from the dividend. It is this algorithm that has been added to Maple.
The new code is substantially faster when the divisor is small and the quotient is large, as is often the case for polynomial gcd problems where we compute g = gcd(a,b) and divide a/g and b/g. It is also much faster when the problem is sparse. The division below runs about 250 times faster in Maple 14.
f := randpoly(x, degree=10^5, terms=1000):
g := randpoly(x, degree=10^5, terms=1000):
p := expand(f*g):
For more details see our paper on sparse polynomial division.