Linear-time minimization of Wheeler DFAs

Wheeler DFAs (WDFAs) are a sub-class of finite-state automata which is playing an important role in the emerging field of \emph{compressed data structures}: as opposed to general automata, WDFAs can be stored in just $\log\sigma + O(1)$ bits per edge, $\sigma$ being the alphabet's size, and support optimal-time pattern matching queries on the substring closure of the language they recognize. An important step to achieve further compression is minimization. When the input $\mathcal A$ is a general deterministic finite-state automaton (DFA), the state-of-the-art is represented by the classic Hopcroft's algorithm, which runs in $O(|\mathcal A|\log |\mathcal A|)$ time. This algorithm stands at the core of the only existing minimization algorithm for Wheeler DFAs, which inherits its complexity. In this work, we show that the minimum WDFA equivalent to a given input WDFA can be computed in linear $O(|\mathcal A|)$ time.
When run on de Bruijn WDFAs built from real DNA datasets, an implementation of our algorithm reduces the number of nodes from 14\% to 51\% at a speed of more than 1 million nodes per second.

Linear-time minimization of Wheeler DFAs.pdf

Linear-time minimization of Wheeler DFAs.pdf (256)

Thumbs Up

Comments

proof of minimality

Permalink Submitted by Dominik Koeppl on 20 March 2022 - 12:15am

From your well-done explanations in your presentation, it is believable to me that the proposed algorithm produces an equivalent automaton with possibly smaller size. However, is it easily understandable why the algorithm also produces the smallest such automaton? In particular, I fear that the order in which pairs of nodes are collapsed could influence the resulting automaton, but obviously I should be wrong ;-)

Minimality

Permalink Submitted by Nicola Cotumaccio on 31 March 2022 - 12:45pm

The order in which pairs are collapsed does not matter, because then the collapsed automaton is obtained by the standard construction used in DFA minimization: pick two nodes u and v to be collapsed, remove the edges leaving and reaching u or v, remove u and v, add a new single node {u, v} (the collapsed one), add an edge labeled a between z and {u, v} if and only if there exists an edge labeled a from z to u or v, add an edge labeled a between {u, v} and z if and only if there exists an edge labeled a between u or v and z. The reason why the algorithm yields the smallest such automaton is that in a Wheeler DFA we have u < v if and only if all strings readable from the initial state to u are co-lexicographically smaller than all strings readable from the initial state to v, which implies that only consecutive states can be collapsed, otherwise this property would be lost. Lastly, collapsed states should be Nerode-equivalent for the same reason why in standard DFA minimization one only collapses Nerode equivalent states.

CITE

Documents

Presentation Slides

Linear-time minimization of Wheeler DFAs