Data deduplication is a crucial step in managing large-scale datasets, ensuring data quality, and optimizing storage. The data deduplication tool is designed to remove redundant data from text datasets to improve model training quality and efficiency.

The pipeline consists of several stages, which our dedup.py script executes:

  1. MinHash generation

  2. duplicate pairs generation

  3. duplicate graph construction

  4. generating the final list of duplicates

If needed, you can run each stage of the pipeline individually. See Manual Instructions for more details.

To optimize the deduplication process, we leverage the datasketch library for efficient similarity detection and incorporate enhancements to address memory usage and runtime performance. Our approach implements a producer-consumer model to parallelize I/O operations, which are the primary bottleneck in runtime. Additionally, memory consumption is minimized by retaining only a single representative document from each group of duplicates in memory at any given time, ensuring efficient resource utilization during processing.

Before You Begin

Ensure the Cerebras Model Zoo and its dependencies are installed if you’re running on a Cerebras Wafer-Scale cluster (specifically the prerequisites listed here).

If you’d like to run this locally, do the following:

virtualenv <env_name>
    source <env_name>/bin/activate
    pip install -r requirements.txt

Run the Script

To run the deduplication pipeline, execute the following command:

python dedup.py --dataset_name <name-of-the-dataset> --input_dir <path-to-input-directory> --jsonl_key <jsonl-key-of-the-dataset> --format <format-of-the-dataset> --output_dir <name-of-the-output-directory>

The process generates an output directory containing compressed .jsonl.zst files. By default, each compressed file will be of size 16 MB, and the jsonl_key will be text.

(Optional) Manual Instructions

While the dedup.py script runs the deduplication pipeline end-to-end, you can also run each stage individually if needed.

MinHash Generation

MinHash generation can be a very slow process. We recommend running it separately before creating a MinHashLSH index.

To calculate MinHash object for each document, we strip, lowercase, remove punctuation, consecutive spaces, newlines and tabs from each document. Afterwards, we construct a list of 13-grams that are later used as features to create a document signature to add into MinHashLSH index.

(More details about MinHash can be found at Identifying and Filtering Near-Duplicate Documents.)

We also apply NFC normalization, as well as filter out short documents, before yielding the documents.

For custom datasets, you also need to specify the jsonl_key as well as the format of the dataset. By default, the jsonl_key is set to be text and the format to be jsonl.

Here is the format of the command to run MinHash generation:

python to_hash.py --dataset_name <dataset-name> --input_dir <input-dir> --output_dir <output-dir> --job_id <job-id> --jsonl_key <jsonl-key> --format <format-of-the-dataset>

To reduce the total processing time, multiple jobs can be run for each corpus in parallel by using multiple job IDs - starting from 0. By default, the script works expecting one job, but if you wish, you can replicate the scripts across multiple machines and the script chunks the list of files to be processed equally, to do the MinHash generation.

This assumes that the dataset is present at a common place that is accessible by all the machines.

Duplicate Pairs Generation

In this step, we build a MinHashLSH index and query it to locate near duplicates

(More reading here: Chapter 3, Mining of Massive Datasets. )

We use a Jaccard similarity threshold of 0.8 by default to determine whether a pair of documents should be considered as a duplicate, but you can specify it according to your own needs.

python generate_duplicate_pairs.py --input_dir <output-directory-from-previous-step> --out_file <output-directory>/duplicates/duplicate_pairs.txt

Duplicate Graph Construction & Search for Connected Components

After locating duplicate pairs, we need to find connected components containing documents that are duplicates with each other.

To make it more illustrative, consider these pairs: (A, B), (A, C), (A, E).

We are going to form a cluster of (A, B, C, E) and keep only one document from the component.

We evaluated the performance and memory consumption of networkx, graphtool, and networkit. networkit offered most efficient implementation as it is designed to work with large graphs and features great parallelism.

Below you can find an example command how to construct a graph from documents pairs:

python generate_connected_components.py --input_dir <output-directory-from-previous-step}/duplicates --out_file <output-directory>/duplicates/connected_components.pickle

Generate Final List of Duplicates

In this step, we generate the final deduplicated dataset. We dump the original dataset in fixed-size files, in the jsonl.zst format, whose size is configurable in the deduplicate_dataset.py file. By default, we create files of 16 MB.

Below you can find an example command on how to generate the final deduplicated dataset:

python deduplicate_dataset.py --input_file <output-directory-from-previous-step>/duplicates/connected_components.pickle --input_dir <input-dir> --output_dir <final-output-dir> --format <format-of-dataset>