Emerging: Big Data

BRIDGING THE GAP: A SELF-LEARNING MODEL USING IMPLICIT KNOWLEDGE FOR CHINESE SPELLING CORRECTION

Chinese Spelling Correction (CSC) is a challenging and essential task in natural language processing. In this study, we introduces a new method for Chinese Spelling Correction (CSC) that addresses three unattended areas in prior studies. Firstly, we use an Implicit Knowledge Extraction Network to overcome limitations of conventional methods that rely on explicit knowledge alone. Secondly, we use KL divergence to limit the effect of incorrect characters on semantic understanding, ensuring consistent meaning.

Bridging_the_Gap__A_Self_learning_Model_using_Implicit_Knowledge_for_Chinese_Spelling_Correction__icassp_.pdf

Bridging_the_Gap__A_Self_learning_Model_using_Implicit_Knowledge_for_Chinese_Spelling_Correction__icassp_.pdf (290)

Categories:: Emerging: Big Data

13 Views

Recursive Prefix-Free Parsing for Building Big BWTs

Read more about Recursive Prefix-Free Parsing for Building Big BWTs
Log in to post comments

[DCC23] Recursive PFP video 2.pptx

[DCC23] Recursive PFP video 2.pptx (442)

Categories:: Emerging: Big Data

75 Views

Model Compression for Data Compression: Neural Network Based Lossless Compressor Made Practical

main-update.pdf

main-update.pdf (690)

Categories:: Emerging: Big Data

74 Views

Compressed Input Data Format of Quantum Annealing Emulator

Read more about Compressed Input Data Format of Quantum Annealing Emulator
Log in to post comments

Recently, Quantum Annealing (QA) has attracted attention as an efficient algorithm for combinatorial optimization problems. In QA, the input data size becomes large and its reduction is important for accelerating by the hardware emulation since the usable memory size and its bandwidth are limited. The paper proposes the compression method of input sparse matrices for QA emulator. The proposed method uses the sparseness of the coefficient matrix and the reappearance of the same values.

DCC_detail_slide.pdf

DCC_detail_slide.pdf (455)

Categories:: Emerging: Big Data

25 Views

CSTs for Terabyte-Sized Data

Read more about CSTs for Terabyte-Sized Data
Log in to post comments

[DCC22]RePFP_Submitted.pptx

[DCC22]RePFP_Submitted.pptx (562)

Categories:: Emerging: Big Data

77 Views

Applying Practical Parallel Grammar Compression to Large-scale Data

Read more about Applying Practical Parallel Grammar Compression to Large-scale Data
Log in to post comments

Re-pair is a grammar-based compression algorithm. It achieves higher compression rates for text, graph, and tree. While Re-pair is a linear-time algorithm, it is slower than other general compression algorithms in practice. This is an obstacle in applying Re-pair to large-scale data. In this paper, we present Parallel Re-pair, a practical implementation that enables parallel processing of Re-pair. In Parallel Re-pair, Re-pair is executed in multiple threads for the divided block. Each thread shares a dictionary and it can output a single CFG.

Applying Practical Parallel Grammar Compression to Large-scale Data - final.pdf

Applying Practical Parallel Grammar Compression to Large-scale Data - final.pdf (558)

Categories:: Emerging: Big Data

42 Views

Chunk Content is not Enough: Chunk-Context Aware Resemblance Detection for Deduplication Delta Compression

With the growing popularity of cloud storage, identifying and removing duplicate data across users is getting more critical for service providers. Thus, many researchers have attracted attention for data resemblance to detect redundancy among similar data. It uses feature extraction to detect data chunks with high similarity first, and then treat them as candidates for removing redundancy.

v8-card-dcc.pdf

v8-card-dcc.pdf (509)

Categories:: Emerging: Big Data
Other

177 Views

HOLZ: High-Order Entropy Encoding of Lempel-Ziv Factor Distances

Read more about HOLZ: High-Order Entropy Encoding of Lempel-Ziv Factor Distances
Log in to post comments

We propose a new representation of the offsets of the Lempel-Ziv (LZ) factorization
based on the co-lexicographic order of the text's prefixes.
The selected offsets tend to approach the k-th order empirical entropy.
Our evaluations show that this choice is superior to
the rightmost and bit-optimal LZ parsings on datasets with small high-order entropy.

dcc22holz.s.pdf

dcc22holz.s.pdf (528)

Categories:: Emerging: Big Data

52 Views

Computing Lexicographic Parsings

Read more about Computing Lexicographic Parsings
2 comments
Log in to post comments

We give memory-friendly algorithms computing the compression schemes lexparse in linear time.

lexparse_dcc.pdf

lexparse_dcc.pdf (477)

Categories:: Emerging: Big Data

79 Views

PHONI: Streamed Matching Statistics with Multi-Genome References

Read more about PHONI: Streamed Matching Statistics with Multi-Genome References
Log in to post comments

Computing the matching statistics of patterns with respect to a text is a fundamental task in bioinformatics, but a formidable one when the text is a highly compressed genomic database. Bannai et al. gave an efficient solution for this case, which Rossi et al. recently implemented, but it uses two passes over the patterns and buffers a pointer for each character during the first pass. In this paper, we simplify their solution and make it streaming, at the cost of slowing it down slightly.

dcc21phoni.s.pdf

slides (649)

Categories:: Emerging: Big Data

95 Views

Emerging: Big Data

Pages