The Shortest Common Supersequence Problem

Andreas Westling <anwe9331@csd.uu.se>

1 Definitions

2 The problem

The problem is to find a shortest common supersequence (SCS), which is a common supersequence of minimal length. There could be more than one SCS for a given problem.

2.1 Example

= {a, b, c}
S1 = bcb
S2 = baab
S3 = babc

One shortest common supersequence is babcab (babacb, baabcb, bcaabc, bacabc, baacbc).

3 Techniques

Dynamic programming Requires too much memory unless the number of input-sequences are very small.
Branch and bound Requires too much time unless the alphabet is very small.
Majority merge The best known heuristic when the number of sequences is large compared to the alphabet size. [1]
Greedy (take two sequences and replace them by their optimal shortest common supersequence until a single string is left) Worse than majority merge. [1]
Genetic algorithms Indications that it might be better than majority merge. [1]

4 Implemented heuristics

4.1 The trivial solution

The trivial solution is at most || times the optimal solution length and is obtained by concatenating the concatenation of all characters in sigma as many times as the longest sequence. That is, if = {a, b, c} and the longest input sequence is of length 4 we get abcabcabcabc.

4.2 Majority merge heuristic

The Majority merge heuristic builds up a supersequence from the empty sequence (S) in the following way:
  WHILE there are non-empty input sequences
    s <- The most frequent symbol at the start of non-empty input-sequences.
    Add s to the end of S.
    Remove s from the beginning of each input sequence that starts with s.
  END WHILE
Majority merge performs very well when the number of sequences is large compared to the alphabet size.

5 My approach - Local search

My approach was to apply a local search heuristic to the SCS problem and compare it to the Majority merge heuristic to see if it might do better in the case when the alphabet size is larger than the number of sequences.

Since the length of a valid supersequence may vary and any change to the supersequence may give an invalid string a direct representation of a supersequence as a feasible solution is not an option.

I chose to view a feasible solution (S) as a sequence of mappings x1...xSl where Sl is the sum of the lengths of all sequences and xi is a mapping to a sequencenumber and an index.

That means, if L={{s1,1...s1,m1}, {s2,1...s2,m2} ...{sn,1...s3,mn}} is the set of input sequences and L(i) is the ith sequence the mappings are represented like this:

xi {k, l}, where k L and l L(k)

To be sure that any solution is valid we need to introduce the following constraints:

1. Every symbol in every sequence may only have one xi mapped to it.
2. If xi ss,k and xj ss,l and k < l then i < j.
3. If xi ss,k and xj ss,l and k > l then i > j.

The second constraint enforces that the order of each sequence is preserved but not its position in S. If we have two mappings xi and xj then we may only exchange mappings between them if they map to different sequences.

5.1 The initial solution

There are many ways to choose an initial solution. As long as the order of the sequences are preserved it is valid. I chose not to in some way randomize a solution but try two very different solution-types and compare them.

The first one is to create an initial solution by simply concatenating all the sequences.

The second one is to interleave the sequences one symbol at a time. That is to start with the first symbol of every sequence then, in the same order, take the second symbol of every sequence and so on.

5.2 Local change and the neighbourhood

A local change is done by exchanging two mappings in the solution.
One way of doing the iteration is to go from i to Sl and do the best exchange for each mapping.
Another way is to try to exchange the mappings in the order they are defined by the sequences. That is, first exchange s1,1, then s2,1. That is what we do.

There are two variants I have tried.

In the first one, if a single mapping exchange does not yield a better value I return otherwise I go on.

In the second one, I seperately for each sequence do as many exchanges as there are sequences so a symbol in each sequence will have a possibility of moving. The exchange that gives the best value I keep and if that value is worse than the value of the last step in the algorithm I return otherwise I go on.

A symbol may move any number of position to the left or to the right as long as the exchange does not change the order of the original sequences.

The neighbourhood in the first variant is the number of valid exchanges that can be made for the symbol. In the second variant it is the sum of valid exchanges of each symbol after the previous symbol has been exchanged.

5.3 Evaluation

Since the length of the solution is always constant it has to be compressed before the real length of the solution may be obtained.

The solution S, which consists of mappings is converted to a string by using the symbols each mapping points to. A new, initialy empty, solution T is created. Then this algorithm is performed:

  T = {}
  FOR i = 0 TO Sl
    found = FALSE
    FOR j = 0 TO |L|
       IF first symbol in L(j) = the symbol xi maps to THEN
          Remove first symbol from L(j)
          found = TRUE
       END IF
    END FOR 
    IF found = TRUE THEN  
      Add the symbol xi maps to to the end of T            
    END IF
  END FOR

Sl is as before the sum of the lengths of all sequences. L is the set of all sequences and L(j) is sequence number j.

The value of the solution S is obtained as |T|.

6 Program code

ANSI-C with Makefile

7 Results

Increasing number or sequences - SCS length
Increasing number or sequences - CPU time
Increasing alphabet size - SCS length
Increasing alphabet size - CPU time

Explanations:

Interleaved, Initial Just the length/time value for interleaved initial solution.
Concatenated, Initial Just the length/time value for concatenated initial solution.
Interleaved, One The length of the solution/time when a local change is defined as only one exchange and the initial solution is interleaved.
Interleaved, All The length of the solution/time when a local change is defined as one exchange for each input-sequence and the initial solution is interleaved.
Concatenated One The length of the solution/time when a local change is defined as only one exchange and the initial solution is concatenated.
Concatenated, All The length of the solution/time when a local change is defined as one exchange for each input-sequence and the initial solution is concatenated..

8 Conclusions, Comparisons, Problems for Future Research

As can be seen by the result-graphs it is the "Interleaved, All" and "Concatenated, All" that performs the best. I am quite happy with the fact that the algorithm actually managed to beat Majority Merge when the alphabet size is large compared to the number of sequences. It also performed very well when the number of sequences compared to the alphabet size is large although not as well as Majority Merge. Both "Interleaved, All" and "Concaternate, All" have much worse performance than Majority Merge in terms of CPU-time but they are very fast compared to exhaustive search which I didn't include in any graph or anywhere else on this page because I found it not possible to compute in resonable time except for very small problems.

9 Major Links on the topic

Branke and Middendorfs paper on SCS and Genetic Algorithms
Page with good definitions

10 Bibliography

  1. J. Branke, M. Middendorf (1996) Searching for Shortest Common Supersequences by Means of a Heuristic-Based Genetic Algorithm, Proceeding of the 2nd Nordic Workshop on Genetic Algorithms and Their Applications pp. 105-114

Other participants' suggestions to improve this site