-
Notifications
You must be signed in to change notification settings - Fork 83
Open
Description
I see that the algorithm is based on the MMDS book by Ullman et al. However, your implementation seems to use a fixed THRESHOLD value of 0.5, whereas in the book they describe the THRESHOLD as a chosen value at which documents should be regarded as a "similar pair". From section 3.4.3:
Choose a threshold t that defines how similar documents have to be in order for them to be regarded as a desired “similar pair.” Pick a number of bands b and a number of rows r such that br = n, and the threshold t is approximately (1/b) 1/r. If avoidance of false negatives is important,
you may wish to select b and r to produce a threshold lower than t; if speed is important and you wish to limit false positives, select b and r to produce a higher threshold.
Metadata
Metadata
Assignees
Labels
No labels