CloudClean: An AI-Driven Adaptive Framework for Intelligence Cloud Storage Governance and Near-Duplicate Detection with AWS S3 Integration
Nandhini S
, Gokul M , Prithish kumar S , Megavarshini A.J , Mr.A.Arockia Selvaraj
Index Terms—Cloud Storage Governance, Near-Duplicate Detection, MinHash LSH, TF-IDF Semantic Classification, Policy-Driven Framework, AWS S3 Integration, HoloClean Comparison, Data Entropy
Abstract—The exponential growth of cloud-hosted enterprise data presents acute challenges for organisations dependent on scalable and
cost-effective storage infrastructure. Repositories accumulate substantial volumes of redundant, obsolete, and structurally irrelevant
data—a phenomenon termed data entropy—through repeated uploads, uncontrolled versioning, and unmanaged replication. CloudClean
is a comprehensive twelve-module, policy-driven cloud storage optimisation framework integrating: (i) Policy-Driven Governance
Engine, (ii) 30+ regex Pattern Matching, (iii) MinHash LSH near-duplicate detection, (iv) Governance Decision Engine
(RETAIN/WARN/DELETE), (v) Module 9 Semantic Classifier using NLP keyword embeddings and TF-IDF cosine scoring, and (vi)
Module 12 AWS S3 real-cloud integration supporting LocalStack, real AWS S3 via boto3, and offline demo modes. Validated on a
2,400-file enterprise simulation, CloudClean achieves 97.9% classification accuracy, 98.1% near-duplicate detection precision, 3.2%
Jaccard estimation error, and 41.5% maximum storage recovery—outperforming all seven baseline systems including HoloClean
(80.0%). The framework scales to 5000 files, satisfies ISO 27001 audit requirements, and executes real-time governance on live AWS
S3 buckets.
Index Terms—Cloud Storage Governance, Near-Duplicate Detection, MinHash LSH, TF-IDF Semantic Classification, Policy-Driven
Framework, AWS S3 Integration, HoloClean Comparison, Data Entropy.
"CloudClean: An AI-Driven Adaptive Framework for Intelligence Cloud Storage Governance and Near-Duplicate Detection with AWS S3 Integration ", JETNR - JOURNAL OF EMERGING TRENDS AND NOVEL RESEARCH (www.JETNR.org), ISSN:2984-9276, Vol.4, Issue 4, page no.c51-c54, April-2026, Available :https://rjpn.org/JETNR/papers/JETNR2604268.pdf
Volume 4
Issue 4,
April-2026
Pages : c51-c54
Paper Reg. ID: JETNR_233791
Published Paper Id: JETNR2604268
Downloads: 00022
Research Area: Science and Technology
Country: coimbatore, tamilnadu, India
ISSN: 2984-9276 | IMPACT FACTOR: 9.87 Calculated By Google Scholar | ESTD YEAR: 2023
An International Scholarly Open Access Journal, Peer-Reviewed, Refereed Journal Impact Factor 9.87 Calculate by Google Scholar and Semantic Scholar | AI-Powered Research Tool, Multidisciplinary, Monthly, Multilanguage Journal Indexing in All Major Database & Metadata, Citation Generator
Publisher: RJPN (IJPublication) Janvi Wave