Documents
Presentation Slides
Chunk Content is not Enough: Chunk-Context Aware Resemblance Detection for Deduplication Delta Compression
- Citation Author(s):
- Submitted by:
- Wenlong Tian
- Last updated:
- 7 March 2022 - 9:22pm
- Document Type:
- Presentation Slides
- Document Year:
- 2022
- Event:
- Presenters:
- Xuming Ye
- Paper Code:
- 117
- Categories:
- Keywords:
- Log in to post comments
With the growing popularity of cloud storage, identifying and removing duplicate data across users is getting more critical for service providers. Thus, many researchers have attracted attention for data resemblance to detect redundancy among similar data. It uses feature extraction to detect data chunks with high similarity first, and then treat them as candidates for removing redundancy. However, the features in existing resemblance detection methods, such as "N-transform" and "Finesse," are only related to a chunk content itself while ignoring the fact that similar chunks will co-occurred, which is called chunk-context. A minor modification on a chunk could seriously deteriorate its capability for resemblance detection. This paper proposes a novel chunk-context aware resemblance detection algorithm, called CARD, to mitigate this issue. CARD introduces a BP-Neural network-based chunk-context aware model, and uses N-sub-chunk shingles-based initial feature extraction strategy. It effectively integrates each data chunk content's internal structure with the context information for feature extraction, the impact of small changes in data chunks is significantly reduced. To evaluate its performance, we implement a CARD prototype and conduct extensive experiments using real-world data sets. The results show that CARD can detect up to 75.03% more redundant data and accelerate the resemblance detection operations by 5.6 to 17.8 times faster than the state-of-the-art resemblance detection approaches.