Sorry, you need to enable JavaScript to visit this website.

Chunk Content is not Enough: Chunk-Context Aware Resemblance Detection for Deduplication Delta Compression

Error message

  • The specified file temporary://fileE5Ec96 could not be copied, because the destination directory is not properly configured. This may be caused by a problem with file or directory permissions. More information is available in the system log.
  • The specified file temporary://filea0ePPR could not be copied, because the destination directory is not properly configured. This may be caused by a problem with file or directory permissions. More information is available in the system log.
  • The specified file temporary://fileuHPcn5 could not be copied, because the destination directory is not properly configured. This may be caused by a problem with file or directory permissions. More information is available in the system log.
  • The specified file temporary://fileb9n3VP could not be copied, because the destination directory is not properly configured. This may be caused by a problem with file or directory permissions. More information is available in the system log.
  • The specified file temporary://filevFPJ7s could not be copied, because the destination directory is not properly configured. This may be caused by a problem with file or directory permissions. More information is available in the system log.
  • The specified file temporary://fileUlDHWm could not be copied, because the destination directory is not properly configured. This may be caused by a problem with file or directory permissions. More information is available in the system log.
  • The specified file temporary://filewgb7pD could not be copied, because the destination directory is not properly configured. This may be caused by a problem with file or directory permissions. More information is available in the system log.
  • The specified file temporary://filesM5C7w could not be copied, because the destination directory is not properly configured. This may be caused by a problem with file or directory permissions. More information is available in the system log.
  • The specified file temporary://filemvg0fy could not be copied, because the destination directory is not properly configured. This may be caused by a problem with file or directory permissions. More information is available in the system log.
Citation Author(s):
Xuming Ye, Xiaoye Xue, Wenlong Tian, Ruixuan Li, Weijun Xiao, Zhiyong Xu, Yaping Wan
Submitted by:
Wenlong Tian
Last updated:
7 March 2022 - 9:22pm
Document Type:
Presentation Slides
Document Year:
2022
Event:
Presenters:
Xuming Ye
Paper Code:
117
Categories:
Keywords:
 

With the growing popularity of cloud storage, identifying and removing duplicate data across users is getting more critical for service providers. Thus, many researchers have attracted attention for data resemblance to detect redundancy among similar data. It uses feature extraction to detect data chunks with high similarity first, and then treat them as candidates for removing redundancy. However, the features in existing resemblance detection methods, such as "N-transform" and "Finesse," are only related to a chunk content itself while ignoring the fact that similar chunks will co-occurred, which is called chunk-context. A minor modification on a chunk could seriously deteriorate its capability for resemblance detection. This paper proposes a novel chunk-context aware resemblance detection algorithm, called CARD, to mitigate this issue. CARD introduces a BP-Neural network-based chunk-context aware model, and uses N-sub-chunk shingles-based initial feature extraction strategy. It effectively integrates each data chunk content's internal structure with the context information for feature extraction, the impact of small changes in data chunks is significantly reduced. To evaluate its performance, we implement a CARD prototype and conduct extensive experiments using real-world data sets. The results show that CARD can detect up to 75.03% more redundant data and accelerate the resemblance detection operations by 5.6 to 17.8 times faster than the state-of-the-art resemblance detection approaches.

up
2 users have voted: Wenlong Tian, Xuming Ye