Sorry, you need to enable JavaScript to visit this website.

Chunk Content is not Enough: Chunk-Context Aware Resemblance Detection for Deduplication Delta Compression

Citation Author(s):
Xuming Ye, Xiaoye Xue, Wenlong Tian, Ruixuan Li, Weijun Xiao, Zhiyong Xu, Yaping Wan
Submitted by:
Wenlong Tian
Last updated:
7 March 2022 - 9:22pm
Document Type:
Presentation Slides
Document Year:
2022
Event:
Presenters:
Xuming Ye
Paper Code:
117
Categories:
Keywords:
 

With the growing popularity of cloud storage, identifying and removing duplicate data across users is getting more critical for service providers. Thus, many researchers have attracted attention for data resemblance to detect redundancy among similar data. It uses feature extraction to detect data chunks with high similarity first, and then treat them as candidates for removing redundancy. However, the features in existing resemblance detection methods, such as "N-transform" and "Finesse," are only related to a chunk content itself while ignoring the fact that similar chunks will co-occurred, which is called chunk-context. A minor modification on a chunk could seriously deteriorate its capability for resemblance detection. This paper proposes a novel chunk-context aware resemblance detection algorithm, called CARD, to mitigate this issue. CARD introduces a BP-Neural network-based chunk-context aware model, and uses N-sub-chunk shingles-based initial feature extraction strategy. It effectively integrates each data chunk content's internal structure with the context information for feature extraction, the impact of small changes in data chunks is significantly reduced. To evaluate its performance, we implement a CARD prototype and conduct extensive experiments using real-world data sets. The results show that CARD can detect up to 75.03% more redundant data and accelerate the resemblance detection operations by 5.6 to 17.8 times faster than the state-of-the-art resemblance detection approaches.

up
2 users have voted: Wenlong Tian, Xuming Ye