Paper Title

CloudClean: An AI-Driven Adaptive Framework for Intelligence Cloud Storage Governance and Near-Duplicate Detection with AWS S3 Integration

Authors

Nandhini S , Gokul M , Prithish kumar S , Megavarshini A.J , Mr.A.Arockia Selvaraj

Keywords

Index Terms—Cloud Storage Governance, Near-Duplicate Detection, MinHash LSH, TF-IDF Semantic Classification, Policy-Driven Framework, AWS S3 Integration, HoloClean Comparison, Data Entropy

Abstract

Abstract—The exponential growth of cloud-hosted enterprise data presents acute challenges for organisations dependent on scalable and cost-effective storage infrastructure. Repositories accumulate substantial volumes of redundant, obsolete, and structurally irrelevant data—a phenomenon termed data entropy—through repeated uploads, uncontrolled versioning, and unmanaged replication. CloudClean is a comprehensive twelve-module, policy-driven cloud storage optimisation framework integrating: (i) Policy-Driven Governance Engine, (ii) 30+ regex Pattern Matching, (iii) MinHash LSH near-duplicate detection, (iv) Governance Decision Engine (RETAIN/WARN/DELETE), (v) Module 9 Semantic Classifier using NLP keyword embeddings and TF-IDF cosine scoring, and (vi) Module 12 AWS S3 real-cloud integration supporting LocalStack, real AWS S3 via boto3, and offline demo modes. Validated on a 2,400-file enterprise simulation, CloudClean achieves 97.9% classification accuracy, 98.1% near-duplicate detection precision, 3.2% Jaccard estimation error, and 41.5% maximum storage recovery—outperforming all seven baseline systems including HoloClean (80.0%). The framework scales to 5000 files, satisfies ISO 27001 audit requirements, and executes real-time governance on live AWS S3 buckets. Index Terms—Cloud Storage Governance, Near-Duplicate Detection, MinHash LSH, TF-IDF Semantic Classification, Policy-Driven Framework, AWS S3 Integration, HoloClean Comparison, Data Entropy.

How To Cite

"CloudClean: An AI-Driven Adaptive Framework for Intelligence Cloud Storage Governance and Near-Duplicate Detection with AWS S3 Integration ", JETNR - JOURNAL OF EMERGING TRENDS AND NOVEL RESEARCH (www.JETNR.org), ISSN:2984-9276, Vol.4, Issue 4, page no.c51-c54, April-2026, Available :https://rjpn.org/JETNR/papers/JETNR2604268.pdf

Issue

Volume 4 Issue 4, April-2026

Pages : c51-c54

Other Publication Details

Paper Reg. ID: JETNR_233791

Published Paper Id: JETNR2604268

Downloads: 00022

Research Area: Science and Technology

Country: coimbatore, tamilnadu, India

Published Paper PDF: https://rjpn.org/JETNR/papers/JETNR2604268

Published Paper URL: https://rjpn.org/JETNR/viewpaperforall?paper=JETNR2604268

About Publisher

ISSN: 2984-9276 | IMPACT FACTOR: 9.87 Calculated By Google Scholar | ESTD YEAR: 2023

An International Scholarly Open Access Journal, Peer-Reviewed, Refereed Journal Impact Factor 9.87 Calculate by Google Scholar and Semantic Scholar | AI-Powered Research Tool, Multidisciplinary, Monthly, Multilanguage Journal Indexing in All Major Database & Metadata, Citation Generator

Publisher: RJPN (IJPublication) Janvi Wave

Article Preview