Approximate String Matching in Python

Approximate string matching is a technique that lets one discover similarities between strings. By using these one can correct or emulate different kinds of errors: typing errorrs, speling errors, closely related vocabularies (colour color), genetic mutations (GAG ACT), abbreviations (McScot, MacScot). The Apse.Approx python extension distributed from this page provides a python based API to Jarkko Hietaniemi's string approximation library written in C.

Requirements: python, SWIG and a C compiler.

Similar packages:

  • The standard library includes the difflib.py module that has the get_close_matches() function that can compute similarity scores between strings.
  • Pure python implementations of a good number of string approximation methods can be found in the Febrl system, in the stringcmp.py and encode.py modules.
  • There is python extension based on the agrep C library called AGREPY

When to use it: the most notable difference between Apse.Approx and the packages above is that the matching can be controlled by specifying the maximum number of errors that can be tolerated. Thus one can specify the either the maximum number of overall edits or individually, the number of deletions, substitutions or inserts that may be acceptable within a match. In some circumstances the concept of number of edits in a word can be significantly easier to interpret and work with than a floating point similarity score. More info: README, DOCS

Typical usage:

import Apse

pattern = "ABCDE"
words   = [ "AXCDX", "BAXX", "AXBXDE", "DCBA", "BACDE" ]

# allow at most 3 edits 
ap = Apse.Approx(pattern, edit=3)

#info on the matching parameters
print 
print "Matching Info", ap.info()

print
for word in words:
   dist  = ap.dist(word)
   match = bool(ap.match(word))
   print "Pattern %6s is at distance %d and the match is %s" % ( word, dist, match )
			

will produce the following output:

	      
Matching Info ({'edit': 3}, 'ABCDE')


Pattern  AXCDX is at distance 2 and the match is True
Pattern   BAXX is at distance 4 and the match is False
Pattern AXBXDE is at distance 2 and the match is True
Pattern   DCBA is at distance 4 and the match is False
Pattern  BACDE is at distance 1 and the match is True		
		

Download: source code (18Kb)

License: LGPL

Feedback: ialbert@mailblocks.com

<<< go back to main