Learn how to set up and run a deduplication pipeline on the Cerebras platform.
dedup.py
script executes:
.jsonl.zst
files. By default, each compressed file will be of size 16 MB, and the jsonl_key
will be text
.
dedup.py
script runs the deduplication pipeline end-to-end, you can also run each stage individually if needed.
jsonl_key
as well as the format of the dataset. By default, the jsonl_key
is set to be text
and the format to be jsonl
.networkit
offered most efficient implementation as it is designed to work with large graphs and features great parallelism.
Below you can find an example command how to construct a graph from documents pairs:
jsonl.zst
format, whose size is configurable in the deduplicate_dataset.py
file. By default, we create files of 16 MB.
Below you can find an example command on how to generate the final deduplicated dataset: