Open access peer-reviewed chapter

Reducing Database Storage Space by Eliminating Duplicate Records

Written By

Eugene S. Valeriano

Submitted: 30 January 2024 Reviewed: 30 January 2024 Published: 17 July 2024

DOI: 10.5772/intechopen.1004398

Chapter metrics overview

View Full Metrics

Abstract

Reducing the storage space of the relational database management system (DBMS) like Microsoft SQL (MS SQL), MySQL is very challenging nowadays. Using DBMS is vital to any local area network (LAN)-based application including Web-based application and mobile application to store and manage data. These data leave traces on servers and are retained on storage devices for a very long time. Due to heavy program usage and potential user error, the data eventually need to be examined for abnormalities and integrity. The chapter discusses the duplication detection algorithm approaches. In addition, the Levenshtein Algorithm was used and implemented to detect duplicate record in the database alongside the method used for matching records’ multiple fields and the Rule-Based Technique approaches. This topic will contribute knowledge on data cleansing and reducing storage space used by applications and help to maximize storage space of data center.

Keywords

  • application storage
  • data
  • deduplication
  • edit distance
  • knowledge based
  • records

1. Introduction

Today’s data-intensive digital environment presents a significant challenge for data administrators. In order to provide its users with helpful services, Cloud Servers and Local Servers also receive information from various client transactions and store it periodically. In this context, maintaining repositories with “dirty” data—that is, replicas, nonstandardized representation that goes far beyond purely technical issues like data management system’s overall speed or performance; therefore, additional technical work, management, and even cultural shifts are required to address these issues [1].

Finding duplicated data that are kept in a database, collection of documents, or dataset is known as duplicate record detection. Nonetheless, there are two different approaches to detecting duplicates: Block-Level duplicate detection, which separates a file into chunks and saves a single copy of each chunk, and Duplicate File-Level detection, which saves only a copy of every file [2]. Additionally, there are two primary approaches to the architecture of the duplicate detection solution: the target-based approach, which targets the data storage device or service to handle duplicate detection while keeping the user uninformed of any identical detection that may arise throughout the development, and the source-based approach, which happens at the user side before it is moved to another source [3]. A number of scholars have previously tackled the challenges of eliminating redundant chunks of data before transmitting to the distant backup location. Source global and local chunk-level deduplication [4, 5, 6] are two common deduplication techniques.

Removing duplicate records by first detecting and locating them. One step in the integration and cleansing of data that is vital is locating roughly duplicate records in a database. There are numerous Web databases, particularly when the duplicate entry is identified using only a portion of the record’s fields. Using exact matching techniques, records that are exactly the same can be detected [7]. Also by using commonly used distinct function to find match data in the records that can be implemented to help find redundant data, resulting in data cleansing and that can reduce storage space of application [8].

Advertisement

2. Common algorithm approaches to find duplicate data in the list

This common approach has a vital role in finding duplicate record in the database that can help reduce the storage space used by many applications.

2.1 Sorting and comparing

Sorting is an important process that arranges things in a list in a predetermined order for several purposes. Scientists commonly utilize sorting to actually convey the results of numerical computations that include estimated accuracy levels. Sorting is a commonly used technique in string dispensation to find the lengthiest joint prefix in a set of strings and the lengthiest substring recurring in a specified string. Sorting algorithms are utilized in the literature and in the creation of work schedules, and they can be implemented in many different ways [9].

The method of this algorithm is sorting the list in a specified time complexity and then comparing the adjacent elements to find duplicates, and if a duplicate is found the algorithm can terminate early the execution, worst it traverses until the end of list or records.

There are many different sorting algorithms, and each of them has a different processing complexity. The different algorithms for sorting are listed in Table 1 along with their Average and Worst Case time complexity [10, 11, 12].

Sort AlgorithmTime Complexity
AverageWorst Case
SelectionO(n2)O(n2)
InsertionO(n2)O(n2)
MergeO(nlogn)O(nlogn)
QuickO(nlogn)O(n2)
BubblesortO(n2)O(n2)
ComparisonO(n2)O(n2)

Table 1.

List of the sorting algorithm with time complexity.

2.2 Hashing

The hashing algorithm works by storing entries in a hash table as it traverses the list. For each element, it then determines whether it already exists in the hash table; if a duplicate is found, the program can cease searching. O(n) is the typical case time complexity.

Applications including backup systems, Web caches, search engine databases, data stream management systems, and even cryptographic contexts use hashing to detect duplicate records in their streaming data [13]. Hashing algorithms are also used in cryptography and the most commonly used ones are listed in Table 2 with their usage description respectively [14].

AlgorithmUsage
MD5The Message Digest algorithm is currently on its fifth iteration. Results from MD5 are 128 bits.
SHA-1The standard Secure Hash Algorithm has two versions: SHA-0 is the first one to be developed. SHA-1 produces outputs with 160 bits. It is one of the primary algorithms that started to trade MD5, when the MD5 discovered vulnerabilities.
SHA-2This algorithm is a set of hashing algorithms that contain SHA-224, SHA-256, SHA-384, and SHA-512.
NTLMThe NT LAN Manager algorithm is used for secret key hashing during authentication.
LANMANMicrosoft LAN Manager hashing algorithm was used by legacy systems of Windows to store passwords.

Table 2.

List of the common hashing algorithms.

2.3 Bitwise manipulation

All of the data that the computer has altered, such as the words “yes” and “no,” can be represented by bits. Additionally, 0 and 1 can be used to portray a switch that can be turned on or off. It is also possible to characterize a piece as true or untrue, alive or dead, yes or no, etc. [15]. Additionally, you can indicate the existence of numbers using a bitwise manipulation technique if the range of numbers is restricted. This is particularly effective when working with narrow range integers.

2.4 Floyd’s tortoise and hare

Cycle detection is the algorithmic issue in computer science that involves identifying a cycle in iterated function values [16]. This algorithm can be adapted to find duplicates in an array or list in O(n) time complexity. It’s particularly efficient when the list represents a permutation of numbers from 1 to n. Floyd’s cycle-finding method is often referred to as the tortoise and the hare algorithm [17]. The algorithm has a two-pointer approach that employs varying speeds, as it moves through the sequence [18].

2.5 Counting sort

A fundamental sorting method in computer science is the counting sort, which has an O(N+K)time complexity, where the number of elements is represented by N and the largest value among those N elements is represented by K. If any count exceeds 1, it means it has duplicate occurrence in the list [19].

2.6 Binary search trees

A basic and extensive data structure researched are binary search trees (BSTs). You can end the search if entering a number into the Binary Search Trees (BSTs) yields duplicate results. The maximum and expected number of comparisons made during a search are the two-complexity metrics that are typically connected with them [20].

Advertisement

3. Rule-based techniques

Using rules to determine whether two records are identical or not is one way to identify duplicate records in a database using distance-based approaches. When two records are either 0 or 1, rule-based procedures can be thought of as distance-based techniques [21]. Additionally, researchers suggested using a rule-based method to cluster records that link to the same real-world entity [22, 23].

Nevertheless, if you expand on this notion and create the theory equation that governs the domain logic equivalency, two people with identical names may be assumed to be the same individual [24]. The following are additional methods for identifying duplicate records that have numerous fields:

  • Notation Techniques

  • Probabilistic Matching Models

  • Supervised and Semi-supervised Learning Approach

  • Active-Learning-Based Techniques

  • Distance-Based Techniques

  • Unsupervised Learning

Advertisement

4. Levenshtein distance algorithm

A frequently occurring cause of discrepancies in the database input is the typographic representation of textual data. Because of this, string comparison algorithms are usually used in duplication detection to handle typographic variances. Levenshtein distance, also known as edit distance in the context of duplicate record identification, is one method used for matching fields with string data under character-based similarity metrics [21].

4.1 Goal

To use the Levenshtein Distance Algorithm for record duplication detection that can reduce database storage space by eliminating duplicate records in the databases.

4.2 Data preparation

The Products (Walmart and Amazon) and Bibliography (Database systems and Logic Programming (DBLP) and Google Scholar) [25] are collection of benchmark datasets for evaluating the algorithm. Each dataset is comprised of .csv files and a defined number of duplicate records. From this dataset, the data in the MS SQL Server Database were imported for usage and analysis in the experiments conducted.

To evaluate their algorithms and techniques for getting rid of redundant data in relational databases, a variety of studies were conducted [26, 27, 28, 29]. Product Dataset with the following fields (id, title, category, brand, model no, and price), which has a total record of 24,628 with a total of 1154 duplicate records. These are joint product information of Walmart and Amazon Company.

At the same time, Bibliography Dataset has total fields of five, namely ID, title, authors, venue, and year, and a complete record of 66,879 with an identified duplicate record of 5347. These data come from the combined list of Bibliographical researchers of Database systems and Logic Programming (DBLP) and Google Scholar information [25]. Table 3 is a list of the data that were used to assess how well the Levenshtein distance technique was implemented for identifying duplicate records in SQL databases.

DatasetSizeDuplicates
Products24,6281154
Bibliography66,8795347

Table 3.

List of datasets used in the experiment for record duplication detection.

4.3 Implementation design flowchart

Figure 1 illustrates the process of detecting duplicate record in the database. As shown in the flowchart, the Product and Bibliography benchmark datasets were used to evaluate the data to detect duplicate record in the database using Levenshtein Distance Algorithm.

Figure 1.

Flowchart of the implementation design.

4.4 Levenshtein distance algorithm implementation

The Hypertext Preprocessor (PHP) scripting language is widely used and particularly suitable for website development. PHP was used to create and implement the record duplication detection in the database using the Levenshtein Distance Algorithm and the pseudocode below (Figure 2).

Figure 2.

The pseudocode for Levenshtein distance algorithm duplicate record detection.

where;

$strlen1 and $strlen2 get the length of the string

$max gets the highest number from $strlen1 and $strlen2

$lev1 variable gets the edit distance from the function lev using Levenshtein Distance Algorithm.

and the $percentage gets the percentage or the threshold similarity of the two string each row in the table.

Advertisement

5. Calculating database storage space

The method to calculate storage space followed the formula from the research paper of Ref. [30]. By determining the heap’s size, it is possible to compute the comparison of the storage space used in the database by the two suggested techniques.

The following Data Dictionary to compute the storage space is given in the Table 4:

No.VariablesDescription
1.Num_RowsThe number of rows in a table
2.Num_ColsThe total number of columns (fixed-length and variable-length) in a table
3.Fixed_Data_SizeThe sum byte size of all fixed-length columns
4.Var_Data_SizeThe data size variable-length
5.Num_Var_ColsThe number of columns’ variable-length
6.Max_Var_SizeThe maximum total byte size of all columns’ variable-length
7.Null_BitmapManage column nullability reserved
8.Row_SizeThe total row size in a table
9.Rows_Per_PageThe number of rows per page
10.Num_PagesThe number of pages in each table
11.Heap sizeThe volume of space that is required to save the data in the heap
12Num_DuplicateThe total number of duplicate detected records

Table 4.

Data dictionary to calculate a relational database’s storage capacity.

Formula:

Num_Rows = Num_Rows - Num_Duplicate.

Null_Bitmap = 2 + ((Num_Cols +7) / 8).

Var_Data_Size = 2 + (Num_Var_Cols x 2) + Max_Var_Size.

Row_Size = Fixed_Data_Size + Var_Data_Size + Null_Bitmap +4.

Rows_Per_Page = 8096 / (Row_Size +2).

Num_Pages = Num_Rows / Rows_Per_Page.

Heap size (bytes) = 8192 x Num_Pages.

Table 5 lists the improvement of the database storage space between improved and unimproved dataset. The table shows the different test datasets used in the experiment. The Products Dataset has a size of 32.27 MB on disk based on the computation of storage space provided by Ref. [30], reduced a total of 8.45 MB on disk in the 75% threshold and the Bibliography Dataset reduced a total of 4.88 MB on disk in the 75% threshold.

DatasetImprovedUnimprovedReducedThresholdPercent
Products23.82 MB32.27 MB8.45 MB75%26.18%
Bibliography82.74 MB87.62 MB4.88 MB75%5.57%

Table 5.

An overview of the outcomes of reducing duplicate records in the database to increase storage space.

Advertisement

6. Conclusions

To cope with data duplication in relational databases, duplication detection usually uses a string comparison technique. With the following approaches and techniques discussed in this chapter, record duplicate detection using Levenshtein Distance Algorithm and other Rule-based approaches helped and contributed in reducing storage space by the database applications.

Advertisement

Acknowledgments

The researcher would like to extend his profound gratitude to Dr. Noel J. Petero, Vice President for Academic Affairs of the Tarlac Agricultural University for his insights and recommendations and for supporting the researcher in all his endeavors.

The researcher would like to express his sincerest appreciation to Dr. Arnold R. Lorenzo and Dr. Ruben A. Parazo of the Tarlac Agricultural University for their invaluable advice and sharing their expertise that led to realization of the research.

To the researcher’s colleagues for the words of encouragement and unwavering support, which in one way or another led to improving the work.

To the researcher’s better half Kathleen B. Valeriano and his sons Gino and Ethan, for the undying love, encouragement, financial and moral support which served as my driving force to finish this endeavor.

Above all, to the Holy and Triune God, for giving enough knowledge, strength, inspiration, love, guidance, and blessings.

Advertisement

Conflict of interest

The author declared no conflict of interest.

References

  1. 1. de Carvalho MG, Laender AHF, Goncalves MA, da Silva AS. A genetic programming approach to record deduplication. IEEE Transactions on Knowledge and Data Engineering. 2012;24(3):399-412. DOI: 10.1109/TKDE.2010.234
  2. 2. Karunakaran D, Rangaswamy R. A Method for Duplicate Record Detection by Exploration and Exploitation of Optimization Algorithm. 2013. [Online]. Available from: http://www.lifesciencesite.comhttp://www.lifesciencesite.comhttp://www.lifesciencesite.com.102
  3. 3. Harnik D, Pinkas B, Shulman-Peleg A. Side Channels in Cloud Services, the Case of Deduplication in Cloud Storage. 2010. [Online]. Available from: http://www.geek.com/articles/news/dropbox-reaches
  4. 4. Tan Y, Jiang H, Feng D, Tian L, Yan Z, Zhou G. SAM: A semantic-aware multi-tiered source de-duplication framework for cloud backup. In: Proceedings of the International Conference on Parallel Processing. 2010. pp. 614-623. DOI: 10.1109/ICPP.2010.69
  5. 5. Bhagwat D, Eshghi K, Long DDE, Lillibridge M. Extreme Binning: Scalable, Parallel Deduplication for Chunk-Based File Backup. 2009
  6. 6. Li K. Institute of Electrical and Electronics Engineers Beijing Section, International Conference on Bio-Inspired Computing: Theories and Applications 5 2010.09.23-26 Changsha, International Conference on Bio-Inspired Computing: Theories and Applications 5 2010.09.08-10 Liverpool, BIC-TA 5 2010.09.23-26 Changsha, and BIC-TA 5 2010.09.08-10 Liverpool, Fifth International Conference on Bio-Inspired Computing: Theories and Applications (BIC-TA), 2010 23rd - 26th September, 2010, Changsha, China, 8th - 10th September, 2010, Liverpool, UK. 2010
  7. 7. Bharambe D, Jain S, Jain A. International Journal of Emerging Technology and Advanced Engineering A Survey: Detection of Duplicate Record. 2012. [Online]. Available from: www.ijetae.com
  8. 8. Valeriano ES. Improving data integrity within relational database using distinct function. International Journal of Knowledge Engineering. 2019;5(2):76-81. DOI: 10.18178/ijke.2019.5.2.122
  9. 9. Adesina O, Aremu DR, Adesina OO, Makinde OE, Ajibola O, Agbo-Ajala OO. A Comparative Study of Sorting. 2013. [Online]. Available from: https://www.researchgate.net/publication/288825600
  10. 10. Suleiman Al-Kharabsheh K, Mahmoud AlTurani I, AlTurani AMI, Zanoon NI. Review on Sorting Algorithms a Comparative Study. 2013
  11. 11. Elkahlout AH, Maghari AYA. A Comparative Study of Sorting Algorithms Comb, Cocktail and Counting Sorting; 2008. [Online]. Available from: www.irjet.net
  12. 12. Alkharabsheh K et al. Review on Sorting Algorithms a Comparative Study; 2013. [Online]. Available from: https://www.researchgate.net/publication/259911982
  13. 13. Géraud R, Lombard-Platet M, Naccache D. Quotient hash tables efficiently detecting duplicates in streaming data. In: Proceedings of the ACM Symposium on Applied Computing. Association for Computing Machinery; 2019. pp. 582-589. DOI: 10.1145/3297280.3297335
  14. 14. Rountree D. Cryptography. In: Security for Microsoft Windows System Administrators. Elsevier; 2011. pp. 29-69. DOI: 10.1016/B978-1-59749-594-3.00002-8
  15. 15. Nafiah R, Kurniawan W, Setiawan J, Umam K. Bit manipulation: Conditional statement using bit-wise operators with C++. International Journal on Informatics for Development. 2020;9(1):9. DOI: 10.14421/ijid.2020.09102
  16. 16. Rastogi R Rungta M, Srivastava S, Shankar Yadav M, Rastogi M. A New Approach to Detect Cycle in a Directed Graph Using Linked List conference paper August 2013 Citations 0 Reads 656 a New Approach to Detect Cycle in a Directed Graph Using Linked List. 2013. [Online]. Available from: https://www.researchgate.net/publication/336825351
  17. 17. Bajramovic A. Floyd’s Tortoise and Hare Algorithm: Finding a Cycle in a Linked List
  18. 18. Rungta S, Srivastava S, Yadav US, Rastogi R. International Journal of Information Technology Bharati Vidyapeeth’s Institute of Computer Applications and Management (BVICAM). 2014
  19. 19. Southeast University (Bangladesh), Southeast University (Bangladesh). Department of Computer Science and Engineering, Southeast University (Bangladesh). Department of Electrical and Electronic Engineering, and Institute of Electrical and Electronics Engineers. Bangladesh Section, ICCIT 2019: 22nd International Conference on Computer and Information Technology: December 18-20, 2019. 2019
  20. 20. Gagie T. New ways to construct binary search trees. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 2003;2906:537-543. DOI: 10.1007/978-3-540-24587-2_55
  21. 21. Elmagarmid AK, Member S, Elmagarmid A. Duplicate Record Detection: A Survey; 2007
  22. 22. Wang YR, Madnick E. The Inter-Database Lnstance Ldentification Problem in Integrating Autonomous Systems
  23. 23. Lim E-P, Srivastava J, Prabhakar S, Richardson J. Entity Identification in Database Integration*; 1996
  24. 24. Hernández MA, Stolfo SJ. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery. 1998;2(1):9-37. DOI: 10.1023/A:1009761603038
  25. 25. Nauman F. Hasso Plattner Institut. [Online]. Available from: https://hpi.de/naumann/projects/repeatability/datasets.html [Accessed: January 25, 2024]
  26. 26. Tejada S, Knoblock CA, Minton S. Learning object identification rules for information integration. Information Systems. 2001;26(8):607-633
  27. 27. CE, Camacho CFTD. A Hybrid Algorithm for Matching Arabic Names Tarek El-Shishtawy Benha University, Faculty of Computers and Informatics. no. Bilenko. 2003. pp. 1-15
  28. 28. Rehman M. Duplicate Record Detection for Database Cleansing. 2009. DOI: 10.1109/ICMV.2009.43
  29. 29. Skandar A, Rehman M, Anjum M. An efficient duplication record detection algorithm for data cleansing. International Journal of Computers and Applications. 2015;127(6):28-37. DOI: 10.5120/ijca2015906401
  30. 30. Yue L. The comparison of storage space of the two methods used in SQL server to remove duplicate records. In: Procedia Computer Science. Elsevier B.V.; 2018. pp. 691-698. DOI: 10.1016/j.procs.2018.04.313

Written By

Eugene S. Valeriano

Submitted: 30 January 2024 Reviewed: 30 January 2024 Published: 17 July 2024