Sorry, you need to enable JavaScript to visit this website.

FM-Directories: Extending the Burrows-Wheeler Transform for String Labeled Vertex Graphs of (Almost) Arbitrary Topology

Citation Author(s):
Unsal Ozturk, Paolo Ribeca, Marco Mattavelli
Submitted by:
Unsal Ozturk
Last updated:
28 February 2023 - 12:54pm
Document Type:
Poster
Document Year:
2023
Event:
Presenters:
Unsal Ozturk
Paper Code:
249
Categories:
 

We introduce an extension of the Burrows-Wheeler transform supporting exact pattern matching on a string graph $G(V,E,N)$ and yielding polynomial time queries.

Vertex $v_i\in V$ is labeled with a non-empty string $N_i\in N$. We use two special characters $\sigma_0,\sigma_1$ respectively denoting the beginning and the end of a string associated with a vertex, a vertex permutation $\pi:\{ 1,...,|V|\} \rightarrow \{1,...,|V|\}$ and uniquely decodable codewords $a_{1},...,a_{|V|}$ in ascending lexicographic order over a different alphabet. The string encoding $\textrm{RE}_\textrm{FMD (G)$ is then defined as the concatenation $\sigma_0 N_{1} \sigma_1 a_{\pi^{-1}(1)} (...) \sigma_0 N_{{|V|}} \sigma_1 a_{\pi^{-1}(|V|)}\$$. The FM-Directory is constructed by building an FM-index over $\textrm{RE}_\textrm{FMD}(G)$ accompanied by three tables: one to translate the Burrows-Wheeler rank of $\sigma_0$ to its $\textrm{RE}_\textrm{FMD}(G)$ rank, one to translate the $\textrm{RE}_\textrm{FMD}(G)$ rank of $\sigma_1$ to its Burrows-Wheeler rank, and one to store incoming neighbors for all $v\in V$. The exact match of a query $Q$ is computed as follows: characters are matched as they would be in an FM-index; however, at any point during the matching, if the character $\sigma_0$ is encountered in the range after matching any suffix $Q_s$ of $Q$, the ranks of the suffix array range of $\sigma_0$ are translated to $\textrm{RE}_\textrm{FMD}(G)$, which gives the indices of vertices with prefix $Q_s$ in their string labels; then the vertex indices of the union of their incoming neighbors are translated into the Burrows-Wheeler ranks of $\sigma_1$; then sub-queries are forked on one or more possibly non-consecutive suffix array ranges of $\sigma_1$, and this process is repeated recursively until $Q$ is exhausted.
Polynomial query times can be obtained in the best case by minimizing the number of consecutive suffix array ranges after the rank translation step. To do so, we enumerate the sets of unions of incoming neighbors of vertices whose string labels start with a given prefix, and force these sets to appear consecutively through in the suffix array through $a_{\pi^{-1}(i)}$.
It can be proven that computing such $\pi$ is \textsc{NP-Complete} by reduction from \textsc{Consecutive-Block-Minimization}, and that it is 1.5-approximable using Christofides' algorithm. Depending on the practical use case, weaker approximations might still prove useful. Overall, for a given permutation $\pi$, the FM-Directory can be constructed in
$O\left(|V|\log|V| + \sum_{i=1}^{|V|} |N_i| \right)$ time and takes $O(|V|\log|V| + |E| + F)$ space where $F$ is the size of an FM-index with input string of length $ 2|V| + |V|\log|V| + \sum_{i=1}^{|V|} |N_i|$. Finally, an algorithm with worst case time complexity $O(|Q|^2 + |Q||V|^2)$ of matching $Q$ in a $G$ admitting a permutation $\pi$ yielding completely consecutive suffix array ranges can be constructed.

up
0 users have voted: