building up sets and removing redundant elements

roman_pearce's picture

"I've seen this element before..."

Often we are faced with the problem of building up sets incrementally, by removing pieces one at a time from a larger whole. The bottlenecks in this case are usually:
1) adding a small set X to a large set S (copies S and X, making this ~O(|S|+|X|))
2) removing elements of the large set S from the small set X (binary search: |X|*log(|S|))

A classic example of this is a breadth-first-search. We start at one vertex of a graph and in each iteration we add the set of new neighbors X to the set of vertices S that have already been found. We can make this more useful by making the program return the sets of new neighbors found in each iteration, that is, the sets of vertices that are distance 1, 2, 3, etc. from the initial vertex.

Here is a simple example of a graph:

1--2--3
|     |   
4--5--6
   |  |
7--8--9

which I will store as a table mapping vertices (e.g.: 1) to a set of neighbors (e.g.: {2,4}).

G := table([1={2,4}, 2={1,3}, 3={2,6}, 4={1,5}, 5={4,6,8}, 6={3,5,9}, 7={8}, 8={5,7,9}, 9={6,8}]);

Starting from vertex 1, the BFS should return [{1}, {2,4}, {3,5}, {6,8}, {7,9}], corresponding to vertices that are distance [0,1,2,3,4] away.

How would you program this in Maple ? I like to use the following trick:

zap := proc(v) procname(v) := NULL; v end proc:
forget(zap):

This procedure returns any object that it has not seen before, but assigns object=NULL to its own remember table so that if the object is encountered again nothing is returned. If S is a set, the result of map(zap, S) is the set of objects in S that are new. If you use this in your code, you should always take care to include the second line, forget(zap). Otherwise you will find yourself tracking down mysterious bugs, the cause of which are a topic in itself.

As an aside, what useful operation will the following do ?

zap := proc(v) procname(v) := NULL; v end proc:
forget(zap):
L := [1,2,1,1,3,2,4,3,1,2];
map(zap, L);

And for the experts in the house, how might you use the following variant ?

zap := proc(v) procname(v) := false; true end proc:
forget(zap);

Returning to breadth-first-search, here is some simple yet efficient code that takes a graph G as encoded above and a vertex v to start from. We store the current new neighbors in a set X and write them to a table S.

BFS := proc(G, v)
  local S, X, i, zap, d;
   zap := proc(v) procname(v) := NULL; v end proc;
   forget(zap);
   S := table();
   X := map(zap, {v});
   for i from 0 while nops(X) > 0 do
     S[i] := X;
     X := {seq(op(map(zap, G[i])), i=X)};
   end do;
   d := i-1;
   [seq(S[i], i=0..d)];
end proc:

You can play with this code to make it more efficient. For example, you can use another seq in place of op(map(zap, G[i])) and use lists everywhere instead of sets to completely avoid sorting (all elements are unique). These types of optimizations are important for large problems. Assuming you know, you should always ask yourself "what will the kernel do ?" That's the surest way to fast code.

Comments

JacquesC's picture

Interesting post

Your variant is interesting as it puts a 'true' wherever something new is encountered. That leads me to the following variant:
zap2 := proc(v) procname(v) := 1; 0 end proc:
forget(zap2):
zap := proc(v) zap2(v) := zap2(v) + 1; end proc:

roman_pearce's picture

clever

That is a neat, clever trick. I'm going to file that one away. For what it's worth, the true/false version is for selectremove, in case you want the duplicates.

counter

Here's a small module that uses that technique to count stuff.

Counter := module()
export count, tally, reset;
local zap;
    zap   := proc(v) procname(v) := 1; 0; end proc:
    reset := proc() forget(zap); end proc;
    count := proc(v) zap(v) := zap(v)+1; end proc:
    tally := proc(v) zap(v) := zap(v); end proc;
end module:

neat, but

The zap function is a clever idea, however, using it in BFS appears to be slower than using a set based approach. How large are the graphs that you are dealing with? I've tested with graphs up to 100000 nodes. Here is the set based procedure:

BFS3 := proc(G, v)
local S, U, X, i, j;
    S := table();
    X := {v};
    U := {};
    for i while X <> {} do
        S[i] := X;
        U := U union X;
        X := `union`(seq(G[j], j=X)) minus U;
    end do;
    [seq(S[j], j=1..i-1)];
end proc:

Here is a comparison of the times and memory usage. BFS1 is your procedure, BFS2 is essentially the same, but with lists instead of sets. n is the number of nodes in the graph, m is the number of children at each node. Graph is generated randomly.

                     +---- Timing (sec) -----+---- Memory (MiB) --------+
n=1000, m=10         |  Maple    |   Total   |   Used     |    Alloc    | GC times
---------------------+-----------+-----------+------------+-------------+---------
BFS1                 |     0.03  |     0.02  |     0.53   |      0.00   |    0
BFS2                 |     0.02  |     0.02  |     0.57   |      0.50   |    0
BFS3                 |     0.01  |     0.01  |     0.32   |      0.31   |    0
---------------------+-----------+-----------+------------+-------------+---------
n=1000, m=100                                                                     
---------------------+-----------+-----------+------------+-------------+---------
BFS1                 |     0.10  |     0.10  |     3.14   |      0.00   |    0
BFS2                 |     0.10  |     0.10  |     3.16   |      2.87   |    0
BFS3                 |     0.02  |     0.02  |     1.36   |      1.37   |    0
---------------------+-----------+-----------+------------+-------------+---------
n=10000, m=10                                                                     
---------------------+-----------+-----------+------------+-------------+---------
BFS1                 |     0.28  |     0.29  |     4.83   |      1.00   |    0
BFS2                 |     0.27  |     0.28  |     4.79   |      3.50   |    0
BFS3                 |     0.07  |     0.08  |     4.36   |      3.62   |    0
---------------------+-----------+-----------+------------+-------------+---------
n=10000, m=100                                                                    
---------------------+-----------+-----------+------------+-------------+---------
BFS1                 |     1.63  |     1.77  |    32.07   |     20.50   |    0
BFS2                 |     1.63  |     1.71  |    32.04   |     27.49   |    0
BFS3                 |     0.29  |     0.46  |    25.98   |     26.37   |    0
---------------------+-----------+-----------+------------+-------------+---------

much better here

This time I tried testing with a nonrandom graph, using

SeqGraph := proc(n::posint, m::posint)
local i,G,k;
    G := table();
    for i to n do
        G[i] := {seq(modp(k-1,n)+1, k=i..i+m)}
    end do;
    eval(G);
end proc:

The results were significantly different:

                     +---- Timing (sec) -----+---- Memory (MiB) --------+
n=10^2, m=10         |  Maple    |   Total   |   Used     |    Alloc    | GC times
---------------------+-----------+-----------+------------+-------------+---------
BFS1                 |     0.00  |     0.01  |     0.14   |      0.06   |    0
BFS2                 |     0.01  |     0.01  |     0.15   |      0.12   |    0
BFS3                 |     0.01  |     0.00  |     0.02   |      0.00   |    0
---------------------+-----------+-----------+------------+-------------+---------
n=10^3, m=10                                                                      
---------------------+-----------+-----------+------------+-------------+---------
BFS1                 |     0.03  |     0.03  |     0.59   |      0.19   |    0
BFS2                 |     0.02  |     0.03  |     0.63   |      0.56   |    0
BFS3                 |     0.01  |     0.01  |     0.33   |      0.31   |    0
---------------------+-----------+-----------+------------+-------------+---------
n=10^4, m=10                                                                      
---------------------+-----------+-----------+------------+-------------+---------
BFS1                 |     0.26  |     0.26  |     5.26   |      2.81   |    0
BFS2                 |     0.25  |     0.26  |     5.40   |      3.87   |    0
BFS3                 |     0.39  |     0.43  |    20.29   |     25.75   |    0
---------------------+-----------+-----------+------------+-------------+---------
n=10^5, m=10                                                                      
---------------------+-----------+-----------+------------+-------------+---------
BFS1                 |     3.64  |     3.74  |    48.71   |     22.06   |    1
BFS2                 |     3.58  |     3.69  |    48.90   |     27.68   |    1
BFS3                 |    48.73  |    85.43  |  1919.24   |    338.19   |   50

The set based approach is substantially worse for this case.

roman_pearce's picture

structured linear systems

Thanks for the times, I found them very interesting.

I am using this on structured linear systems, which may be very large and the BFS will do potentially O(n) steps, as in your example above. The reason sets were fast on the random graphs is because the BFS didn't do many steps. With random edges going everywhere you can cover the graph in I think O(log[m](n)) steps (n=10^5, m=10 took 8 steps). In that case, the incremental cost of adding a vertex doesn't matter.

an ingenious idea

That makes sense.

Thanks for sharing this technique, it is quite ingenious. A neat addition to the toolbox.

Special-Case Improvement

Your technique can be "improved" for the special case in which the set elements consist of integers in a known range. The toy graphs have this quality. I don't expect that your real application, or most uses, do. The motivation was to eliminate the call to zap, a non-builtin function, in the inner loop. The technique is to use an rtable with indices in the known range of integers and initialized to the index values (Z[1] = 1, Z[2] = 2, etc.). Elements are zapped by assigning their entries to NULL.

The new code is procedure BFS4. To make the comparison as fair as possible, I've included BFS22, which uses the double seq optimization that you mentioned.

BFS1 := proc(G, v)
local S, X, i, j, zap;
    zap := proc(v) procname(v) := NULL; v end proc;
    forget(zap);
    S := table();
    X := map(zap, {v});
    for i while nops(X) > 0 do
        S[i] := X;
        X := {seq(op(map(zap, G[j])), j=X)};
    end do;
    [seq(S[j], j=1..i-1)];
end proc:

BFS22 := proc(G, v)
local S, X, i, j, k, zap;
    zap := proc(v) procname(v) := NULL; v end proc;
    forget(zap);
    S := table();
    X := [zap(v)];
    for i while X <> [] do
        S[i] := X;
        X := [seq(seq(zap(k), k=G[j]), j=X)];
    end do;
    [seq(S[j], j=1..i-1)];
end proc:

BFS4 := proc(G, v, n)
local S, X, i, j, k, Z;
    Z := rtable(1..n, i->i);
    S := table();
    X := {v};
    Z[v] := NULL;
    for i while X <> {} do
        S[i] := X;
        X := {seq(seq(Z[k], k=G[j]), j=X)};
        for k in X do
            Z[k] := NULL;
        end do;
    end do;
    [seq(S[j], j=1..i-1)];
end proc:

Here is the comparison using graphs created by SeqGraph

                     +---- Timing (sec) -----+---- Memory (MiB) --------+
n=10^3, m=10         |  Maple    |   Total   |   Used     |    Alloc    | GC times
---------------------+-----------+-----------+------------+-------------+---------
BFS1                 |     0.03  |     0.03  |     0.59   |      0.12   |    0
BFS22                |     0.02  |     0.02  |     0.41   |      0.31   |    0
BFS4                 |     0.02  |     0.01  |     0.26   |      0.19   |    0
---------------------+-----------+-----------+------------+-------------+---------
n=10^4, m=10                                                                      
---------------------+-----------+-----------+------------+-------------+---------
BFS1                 |     0.26  |     0.26  |     5.26   |      2.81   |    0
BFS22                |     0.22  |     0.23  |     3.20   |      1.94   |    0
BFS4                 |     0.16  |     0.16  |     2.32   |      1.06   |    0
---------------------+-----------+-----------+------------+-------------+---------
n=10^5, m=10                                                                      
---------------------+-----------+-----------+------------+-------------+---------
BFS1                 |     3.61  |     5.44  |    48.72   |     21.87   |    1
BFS22                |     2.84  |     3.04  |    28.50   |     18.12   |    0
BFS4                 |     1.97  |     2.08  |    23.63   |     11.69   |    0

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.
}