Computer Theory and Application Laboratory

Algorithm Engineering for All-Pairs Suffix-Prefix Matching


Algorithm Engineering for All-Pairs Suffix-Prefix Matching

Jihyuk Lim and Kunsoo Park


Introduction

All-pairs suffix-prefix matching is an important part of DNA sequence assembly where it is the most time-consuming part of the whole assembly. Although there are algorithms for all-pairs suffix-prefix matching which are optimal in the asymptotic time complexity, they are slower than SOF and Readjoiner which are state-of-the-art algorithms used in practice.
In this paper we present an algorithm for all-pairs suffix-prefix matching that uses a simple data structure for storing input strings and advanced algorithmic techniques for matching, which together lead to fast running time in practice. Our algorithm is much faster than SOF and Readjoiner in real datasets and random datasets.

 


Download

The code for this work can be downloaded here.


Datasets

We compare the performances of SOF, Readjoiner, and our algorithm for the whole process of APSP matching. The labels of Our1 and Our2 mean our algorithm with option output = 2 and output = 3, respectively. For SOF, option −o 2 is used. Our2, SOF with −o 2 and Readjoiner return the same output, which includes all overlaps for each pair of input strings. Our1 solves the problem of APSP matching as it is defined (i.e., it returns the longest overlap for each pair of input strings).  In our algorithm we set c1 to 16 and c2 to 8 in all experiments.

We used two types of datasets which are real and random. The five real datasets are the complete EST databases of Citrus clementina, Citrus sinensis, Citrus trifoliata, C. elegans, and Atta cephaloates.

The random datasets are generated by a program that gives k strings such that the lengths of the strings follow a normal distribution with mean z and standard deviation o and the characters of the strings follow a uniform distribution over the alphabet, where k, z and o are parameters given by the user. The alphabet is again {A,C, G, T}. We generated two datasets rnd1 and rnd2, where rnd1 has 300000, 1000, and 150 as k, z, and o, respectively, and rnd2 has 1000000, 500, and 100.


 

Contact us at: jhlim@theory.snu.ac.kr