Hide
Need a quick chart or graph for your blog? Try our reStructured Text renderer. Join Siafoo Now or Learn More

'A' Longest Common Subsequence (LCS) implementation Atom Feed 1

I found myself wanting to do a Linux style diff of a graph (directed acyclic) in C++. In digging around I found the Longest Common Subsequence algorithm (dating back to 1974!). I decided to understand it by implementing a version in C++. This is a C++ template implementation which works under some constraints.

1   Contents

1.1   Introduction

I ran into the Longest Common Subsequence algorithm first while looking for how diff worked. I read a bit, then jumped right in, trying to avoid Google search results that pointed to C++ implementations since I wanted to write my own.

Why did I want to write an LCS implementation? Well, it turns out that having found "a" common subsequence, one can display diff-like output based on that common subsequence algorithm's results. Basically: solving LCS gives you diff.

I read a bit on Wikipedia, a UCI course note from 1996 and from a site called algorithmist.

First some definitions. A sequence X is defined as an ordered set {}.

A subsequence is basically a subset of an ordered sequence. Say a sequence S which is {}.

A common subsequence is a subsquence common to two sequences. For example, if we have two sequences (of characters/text) X={A,B,C,D,B,A,G,Z,C,D,C,A} and Y={B,C,A,D,C,G,Z} we can easily see these subsequences:

BCA CDC BCAD

In this case by inspection I can see two longest ones:

A B C   D B A     G Z C D C A
  B C       A D C G Z

Which gives S={B,C,A,G,Z} and:

A B C   D B A   G Z C D C A
  B C A D     C G Z

Which gives S={B,C,D,G,Z}

1.2   Chapter 1 Logic/Algorithm

At first analysis, if we have two sequences for which we are seeking a longest common subsequence we can find the solution in the following way. This is the brute force method. If matches then we move on to the next. If not, we compare with . We keep proceeding through Y comparing to until we find a match, for my example above, at , or we get to the end of Y.

When we get to the match point, we increment the X position and proceed from where we were in Y. Using the example above, we found == so we go to and start comparing with . This time we will get to the end of Y in our example.

If we get to the end of Y without a match then we move on to the next X, in our example, and reset the comparison to the last place that we found a match in Y + 1, which in this case was and go to the next one, .

In our example we will once again hit the end of Y without a match; this time we move to . Here we find a match because == == 'C'.

We increment our position in both sequences to and . Proceeding as before we should reach matching and matching .

At that point, both increment to the end without any more matches and we have found:

    A   B C   D B A     G Z C D C A
B C A D   C             G Z

or S={A,C,G,Z} for our first potential "longish" subsequence.

Then we move to starting at and do it all again form which immediately match. We proceed to and that also matches . Then to which matches .

The algorithm I came up with is implemented in brute force example for LCS.

Now, to understand the potential cost of this it is clear that we go through Y at least n times (where n is the number of items in X). So that is O(n*m) (m is the number of items in Y). However, at each point we do not match, we could potentially rewind multiple times. Doing a lot of testing with no with sequences like X={A,B,C,D} and Y={D} I saw some extra looping which I stopped with some if statements but these cost time. In any case, what you see is the final version and it still will cost more comparisons than the ideal LCS algorithm.

What if we take every subsequence of X and look for it in Y? Then we have the longest subsequence, X itself, then start knocking off one character at a time and check again. This truly brute force algorithm is provided in find ALL common subsequences and then sort for longest.

A critical insight into this problem is to realize that it can be broken down into sub-problems. Let us define a suffix as the first part of a sequence. For example, above suffixes would be A, AB, ABC, ABCD & etc.

Now, suppose two sequences start with the same element. Then to find their LCS we need to only compare the remainder. In other words, if the first item matches then a LCS is the item that matched plus a LCS of X = {} with Y = {}.

If the first element does not match, then obviously it is not part of an LCS and can be ignored. But then how do we proceed? We must check X = {} against the Y = {} and vice-versa to see which comparison yields a longer LCS.

This solution is analogous to the description in the Wikipedia reference except that there it is more formally defined and he is removing the last element instead of the first.

My version leads to a recursive solution like recursive solution for LCS problem.

1.3   Chapter 2 Implementation

I went through a series of implementations for fun and exploration. The first and most basic was a C version for char* comparisons. The second was a C++ version which used memoization to keep track of the comparisons.

1.3.1   Section 2.1 C version for char*

The first code I implemented was to demonstrate the memoization algorithm from the UCI article pretty much as-is (longest common subsequence algorithm in C for char*). It is implemented in C.

1.3.2   Section 2.2 C++ version (first try)

The next thing I decided to do was think about this in terms of the sequence being any object. As a result I would need to think more about indices and what it meant to be at the end of a set of objects and the size of the object collection. That was too much to think about so I decided to make an LCS C++ template class that could be used for c++ containers or at least some containers. The code is here LCS C++ template classes and test/example code is here LCS C++ template TEST.

1.4   Chapter 3 What else?

At this point comes the meat of my problem. I want to compare two graphs. I've not read any literature about that (yet) but am considering the following way of doing things. At each node in the graph there are edges from that node to the nodes below. What makes a node in one graph equal to a node in the other graph? The edges must match and the nodes they point to must match.

These are an example of graphs I might want to compare:

Suppose we have two graphs like the above, G1 and G2. Functionally we are asking is G1 == G2. This is the same as asking if G1(N1) == G2(N1). At this point I make a rule to simplify my life. If the top nodes match and each edge to the next nodes match and those nodes match (not their children but just the nodes themselves) then I say that G1(N1) IS equal to G2(N1). I do not insist that ALL descendants match. Furthermore, I do not insist that p1_1 and p1_2 are in a particular order, just that they match some edge in G2(N1). For example if G(N1 - p1_1 -> N21) == G(N1 - p1_2 -> N21) then this means that p1_1 and p1_2 are equivalent.

In my case, this lead to looking at the longest common subsequence (LCS) of the edges. This would allow diff-like output in the case where one or more edges DO match but others do not match or are missing. This would provide more information than my current comparison performs because my current comparison stops upon encountering an edge that is out of sequence without being able to check if maybe the NEXT one would have matched.

What I am working with and how I do the comparison between the two graphs at the moment is to grab the list of nodes. For each node, I grab the list of edges and compare them until I fail to find a match. At that point the whole process stops. I want to be able to continue to check other nodes and other edges without stopping.

To proceed, the meaning of a node matching is simple: the characteristics of the node are compared and if they are equal, we proceed with the next step. [Note to self: For optimization I may decide to keep some of this info around as well. There are many repeated nodes in the graphs I am comparing and if node N1 has been compared to N1 before, there is no need to compare it's characteristics all over again.] Next compare the edges. I approach this as an LCS problem. Each edge_list G1(N1)(p1_1, p1_2, p1_3, etc) will be compared to G2(N1)(p1_1, p1_2, p1_3, etc) the same way we approach the longest common sub-sequence problem. Then we go on to the next node in our node_list and do the same thing.

1.5   Update 2011-12-23

I am busy with a new job. When I started this work I was unemployed and keeping busy with this and a number of other things. Now my efforts are to learn more about the things I need for the new job. So I'm leaving this where it is and making it public.