Documents
Poster
FM-Directories: Extending the Burrows-Wheeler Transform for String Labeled Vertex Graphs of (Almost) Arbitrary Topology
- Citation Author(s):
- Submitted by:
- Unsal Ozturk
- Last updated:
- 28 February 2023 - 12:54pm
- Document Type:
- Poster
- Document Year:
- 2023
- Event:
- Presenters:
- Unsal Ozturk
- Paper Code:
- 249
- Categories:
- Log in to post comments
We introduce an extension of the Burrows-Wheeler transform supporting exact pattern matching on a string graph $G(V,E,N)$ and yielding polynomial time queries.
Vertex $v_i\in V$ is labeled with a non-empty string $N_i\in N$. We use two special characters $\sigma_0,\sigma_1$ respectively denoting the beginning and the end of a string associated with a vertex, a vertex permutation $\pi:\{ 1,...,|V|\} \rightarrow \{1,...,|V|\}$ and uniquely decodable codewords $a_{1},...,a_{|V|}$ in ascending lexicographic order over a different alphabet. The string encoding $\textrm{RE}_\textrm{FMD (G)$ is then defined as the concatenation $\sigma_0 N_{1} \sigma_1 a_{\pi^{-1}(1)} (...) \sigma_0 N_{{|V|}} \sigma_1 a_{\pi^{-1}(|V|)}\$$. The FM-Directory is constructed by building an FM-index over $\textrm{RE}_\textrm{FMD}(G)$ accompanied by three tables: one to translate the Burrows-Wheeler rank of $\sigma_0$ to its $\textrm{RE}_\textrm{FMD}(G)$ rank, one to translate the $\textrm{RE}_\textrm{FMD}(G)$ rank of $\sigma_1$ to its Burrows-Wheeler rank, and one to store incoming neighbors for all $v\in V$. The exact match of a query $Q$ is computed as follows: characters are matched as they would be in an FM-index; however, at any point during the matching, if the character $\sigma_0$ is encountered in the range after matching any suffix $Q_s$ of $Q$, the ranks of the suffix array range of $\sigma_0$ are translated to $\textrm{RE}_\textrm{FMD}(G)$, which gives the indices of vertices with prefix $Q_s$ in their string labels; then the vertex indices of the union of their incoming neighbors are translated into the Burrows-Wheeler ranks of $\sigma_1$; then sub-queries are forked on one or more possibly non-consecutive suffix array ranges of $\sigma_1$, and this process is repeated recursively until $Q$ is exhausted.
Polynomial query times can be obtained in the best case by minimizing the number of consecutive suffix array ranges after the rank translation step. To do so, we enumerate the sets of unions of incoming neighbors of vertices whose string labels start with a given prefix, and force these sets to appear consecutively through in the suffix array through $a_{\pi^{-1}(i)}$.
It can be proven that computing such $\pi$ is \textsc{NP-Complete} by reduction from \textsc{Consecutive-Block-Minimization}, and that it is 1.5-approximable using Christofides' algorithm. Depending on the practical use case, weaker approximations might still prove useful. Overall, for a given permutation $\pi$, the FM-Directory can be constructed in
$O\left(|V|\log|V| + \sum_{i=1}^{|V|} |N_i| \right)$ time and takes $O(|V|\log|V| + |E| + F)$ space where $F$ is the size of an FM-index with input string of length $ 2|V| + |V|\log|V| + \sum_{i=1}^{|V|} |N_i|$. Finally, an algorithm with worst case time complexity $O(|Q|^2 + |Q||V|^2)$ of matching $Q$ in a $G$ admitting a permutation $\pi$ yielding completely consecutive suffix array ranges can be constructed.