Sequence alignment

In my previous post I illustrated the Levenshtein edit distance by comparing the opening paragraphs of Finnegans Wake by James Joyce and a parody by Adam Roberts.

In this post I’ll show how to align two sequences using the sequence alignment algorithms of Needleman-Wunsch and Hirschberg. These algorithms can be used to compare any sequences, though they are more often used to compare DNA sequences than impenetrable novels and parodies.

I’ll be using Gang Li’s implementation of these algorithms, available on github. I believe the two algorithms are supposed to produce the same results, that Hirschberg’s algorithm is a more space-efficient implementation of the Needleman-Wunsch algorithm, though the two algorithms below produce slightly different results. I’ll give the output of Hirschberg’s algorithm.

Li’s alignment code uses lists of characters for input and output. I wrote a simple wrapper to take in strings and output strings.

    from alignment import Needleman, Hirschberg

    def compare(str1, str2):
        seq1 = list(str1)
        seq2 = list(str2)
        for algorithm in [Needleman(), Hirschberg()]:
            a, b = algorithm.align(seq1, seq2)

The code inserts vertical bars to indicate spaces added for alignment. Here’s the result of using the Needleman-Wunsch algorithm on the opening paragraphs of Finnegans Wake and the parody Finnegans Ewok.

    |||riverrun, past Ev|e| and Adam'||||s,
    mov|i|er|un, past ||new and |||||hopes,

    from swe|rv||e of shore|||| to bend of
    from s||tr|ike of |||||back to bend of

    b|||ay, brings us by a commodius
    |jeday, brings us by a commodius

    vic|u||s of recirculation back to
    |||lucas of recirculation back to

    H|owth Ca||stle|||| and E|nvi||r|ons.
    |fo||||||rest||moon and |en||dor.||||

I mentioned in my previous post that I could compare the first four paragraphs easily, but I had some trouble aligning the fifth paragraphs. The fifth paragraphs of each version start out quite simiar:

    Bygme||ster Fi|nnega||n, of the 
    Bygm|onster ||Ann||akin, of the 
    Stutte||r|||||||ing Hand, f|re|emen'|s
    ||||||Throatchokin| Hand, for|cemen|’s
    mau-rer, lived in the broadest way
    mau-rer, lived in the broadest way
    immarginable in his rushlit
    immarginable in his rushlit
    toofar-|||back for messuages before
    toofar| — back for messuages before 

but then Roberts takes the liberty of skipping over a large section of the original. This is what I suspected by looking at the two texts, but Hirschberg’s algorithm makes the edits obvious by showing two long sequences of vertical bars, one about 600 characters long and another about 90 characters long.