Overview of the is unique rule
The is unique rule identifies duplicate rows in a table and marks them as data quality issues. It can be applied to one or multiple columns and supports both exact and fuzzy duplicate detection.

How matching works
Exact matching (100% sensitivity)
When the sensitivity slider is set to 100%, the rule looks for exact duplicates:
Rows are marked as duplicate issues only if these combined values are identical
Results are fully deterministic and repeatable across runs
This mode is recommended for IDs, keys, or standardized fields where values should match exactly.
Fuzzy matching (60–99% sensitivity)
The rule uses a similarity threshold parameter to detect near-duplicates based on semantic similarity.
Selected columns are combined into a single string per row
Instead of exact equality, cosine similarity between vector embeddings is evaluated
Small differences such as typos, abbreviations, or formatting variations can be matched
The threshold determines how similar items must be to be considered duplicates (e.g., threshold=0.90 requires 90% similarity)
Explanation of the approximate neighbors algorithm for fuzzy matching
To efficiently detect similar values at scale, the rule uses an approximate nearest neighbors (ANN) algorithm. Instead of comparing every row with every other row, the algorithm:
Creates compact vector embeddings of the combined strings
Quickly narrows down likely duplicate candidates using ANN indexing
Compares only those candidates in detail
Practical guidance
Use 100% sensitivity for strict uniqueness requirements and stable results
Use fuzzy matching when duplicates may differ slightly due to human input or inconsistent formatting
If exact reproducibility is critical, prefer exact matching or higher sensitivity values
Notes
At 100% sensitivity, duplicate detection is fully deterministic and produces stable, reproducible results
Fuzzy matching (< 100%) uses an approximate nearest neighbors approach and may lead to slightly different issue counts on large tables
Use exact matching for IDs and keys, and fuzzy matching for human-maintained data such as names or addresses