Approximate String Matching in Python |
|
Approximate string matching is a technique that lets one discover similarities between strings. By using these one can correct or emulate different kinds of errors: typing errorrs, speling errors, closely related vocabularies (colour color), genetic mutations (GAG ACT), abbreviations (McScot, MacScot). The Apse.Approx python extension distributed from this page provides a python based API to Jarkko Hietaniemi's string approximation library written in C. Requirements: python, SWIG and a C compiler. Similar packages:
When to use it: the most notable difference between Apse.Approx and the packages above is that the matching can be controlled by specifying the maximum number of errors that can be tolerated. Thus one can specify the either the maximum number of overall edits or individually, the number of deletions, substitutions or inserts that may be acceptable within a match. In some circumstances the concept of number of edits in a word can be significantly easier to interpret and work with than a floating point similarity score. More info: README, DOCS Typical usage: import Apse pattern = "ABCDE" words = [ "AXCDX", "BAXX", "AXBXDE", "DCBA", "BACDE" ] # allow at most 3 edits ap = Apse.Approx(pattern, edit=3) #info on the matching parameters print print "Matching Info", ap.info() print for word in words: dist = ap.dist(word) match = bool(ap.match(word)) print "Pattern %6s is at distance %d and the match is %s" % ( word, dist, match ) will produce the following output:
Matching Info ({'edit': 3}, 'ABCDE')
Pattern AXCDX is at distance 2 and the match is True
Pattern BAXX is at distance 4 and the match is False
Pattern AXBXDE is at distance 2 and the match is True
Pattern DCBA is at distance 4 and the match is False
Pattern BACDE is at distance 1 and the match is True
Download: source code (18Kb) License: LGPL Feedback: ialbert@mailblocks.com |