Keywords: low threshold, high confidence, high filter, mid confidence, candidate score, pass, result, case, screen, good, normalization, bunch. Powered by TextRank.
You have a bunch of strings presented on the screen as options and then you have another input string. You would like to get the best matches for the options on the screen. There are a bunch of different algorithms that can score the matches for you like Levenshtein ratio, Jaccard distance, Ratcliff/Obershelp, … But now that you have the scores for each match, how do you actually select the result or results? Do you just pick the highest one? What if that’s not the right thing to do because the confidence in the score isn’t that good? If you had the following scores
A - 30%
B - 35%
C - 10%
Do you think B is the correct answer? Maybe you should request some more information. But what is the right threshold. Is 64% a good match? These are some of the questions I have been thinking about lately. I think I found a somewhat acceptable approach that gives 2 knobs to play with. A 2 pass normalizing filter. The idea is that in the first pass you want to keep recall high and then in another pass you want to increase precision. The steps are simple
Here is an example scenario and the output they produce.
Low: 0.3+ / High: 0.8+
“1 high confidence, 2 mid, 1 low”
ITEM CONFIDENCE A 0.8 B 0.3 C 0.3 D 0.1
Result of 1st pass with threshold: 0.3+ is [A,B,C]
Normalization step
ITEM CONFIDENCE A 1.0 B 0.375 C 0.375
Result of 2nd pass with threshold: 0.8+ is [A]
Chosen item is A
“2 mid confidence, 2 low”
ITEM CONFIDENCE A 0 B 0.3 C 0.3 D 0.1
Result of 1st pass with threshold 0.3+ is [B,C]
After normalization
ITEM CONFIDENCE B 1 C 1
Result of 2nd pass with threshold: 0.8+ is [B,C]
Disambiguate further between [B, C]
Threshold: 0.8+
“1 high confidence, 2 mid, 1 low”
ITEM CONFIDENCE A 0.8 B 0.3 C 0.3 D 0.1
Result of 1st pass with threshold: 0.5+ is [A]
Chosen item is A
.
“2 mid confidence, 2 low”
ITEM CONFIDENCE A 0 B 0.3 C 0.3 D 0.1
Result of 1st pass with threshold 0.5+ is []
Nothing is selected.
363 words
First published on 2023-12-06
Generated on 13 Dec 2024 at 12:11 AM
Mobile optimized version. Desktop version.