http://www.zorba-xquery.com/modules/data-cleaning/character-based-string-similarity
Description
Before using any of the functions below please remember to import the module namespace:
import module namespace simc = "http://www.zorba-xquery.com/modules/data-cleaning/character-based-string-similarity";
This library module provides character-based string similarity functions that view strings as sequences of characters, generally computing a similarity score that corresponds to the cost of transforming one string into another. These functions are particularly useful for matching near duplicate strings in the presence of typographical errors. The logic contained in this module is not specific to any particular XQuery implementation.
Author
Bruno Martins and Diogo Simões
XQuery version and encoding
xquery version "1.0" encoding "utf-8";
Namespaces
| simc | http://www.zorba-xquery.com/modules/data-cleaning/character-based-string-similarity |
| ver | http://www.zorba-xquery.com/options/versioning |
Function Summary
edit-distance($s1 as xs:string, $s2 as xs:string) as xs:integerReturns the edit distance between two strings. | |
jaro-winkler($s1 as xs:string, $s2 as xs:string, $prefix as xs:integer, $fact as xs:double) as xs:doubleReturns the Jaro-Winkler similarity coefficient between two strings. | |
jaro($s1 as xs:string, $s2 as xs:string) as xs:doubleReturns the Jaro similarity coefficient between two strings. | |
needleman-wunsch($s1 as xs:string, $s2 as xs:string, $score as xs:integer, $penalty as xs:integer) as xs:doubleReturns the Needleman-Wunsch distance between two strings. | |
smith-waterman($s1 as xs:string, $s2 as xs:string, $score as xs:integer, $penalty as xs:integer) as xs:doubleReturns the Smith-Waterman distance between two strings. |
Functions
edit-distance#2
declare function simc:edit-distance(
$s1 as xs:string,
$s2 as xs:string
) as xs:integer Returns the edit distance between two strings.
This distance, also refered to as the Levenshtein distance, is defined as the minimum number
of edits needed to transform one string into the other, with the allowable edit operations
being insertion, deletion, or substitution of a single character.
Example usage : edit-distance("FLWOR", "FLOWER")
The function invocation in the example above returns : 2
Parameters
$s1 as xs:stringThe first string.$s2 as xs:stringThe second string.
Returns
xs:integerThe edit distance between the two strings.
Examples
jaro-winkler#4
declare function simc:jaro-winkler(
$s1 as xs:string,
$s2 as xs:string,
$prefix as xs:integer,
$fact as xs:double
) as xs:double Returns the Jaro-Winkler similarity coefficient between two strings.
This similarity coefficient corresponds to an extension of the Jaro similarity coefficient that weights or
penalizes strings based on their similarity at the beginning of the string, up to a given prefix size.
Example usage : jaro-winkler("DWAYNE", "DUANE", 4, 0.1 )
The function invocation in the example above returns : 0.8577777777777778
Parameters
$s1 as xs:stringThe first string.$s2 as xs:stringThe second string.$prefix as xs:integerThe number of characters to consider when testing for equal prefixes in the strings.$fact as xs:doubleThe weighting factor to consider when the input strings have equal prefixes.
Returns
xs:doubleThe Jaro-Winkler similarity coefficient between the two strings.
Examples
jaro#2
declare function simc:jaro(
$s1 as xs:string,
$s2 as xs:string
) as xs:double Returns the Jaro similarity coefficient between two strings.
This similarity coefficient is based on the number of transposed characters and on a
weighted sum of the percentage of matched characters held within the strings. The higher
the Jaro-Winkler value is, the more similar the strings are. The coefficient is
normalized such that 0 equates to no similarity and 1 is an exact match.
Example usage : jaro("FLWOR Found.", "FLWOR Foundation")
The function invocation in the example above returns : 0.5853174603174603
Parameters
$s1 as xs:stringThe first string.$s2 as xs:stringThe second string.
Returns
xs:doubleThe Jaro similarity coefficient between the two strings.
Examples
needleman-wunsch#4
declare function simc:needleman-wunsch(
$s1 as xs:string,
$s2 as xs:string,
$score as xs:integer,
$penalty as xs:integer
) as xs:double Returns the Needleman-Wunsch distance between two strings.
The Needleman-Wunsch distance is similar to the basic edit distance metric, adding a
variable cost adjustment to the cost of a gap (i.e., an insertion or deletion) in the
distance metric.
Example usage : needleman-wunsch("KAK", "KQRK", 1, 1)
The function invocation in the example above returns : 0
Parameters
$s1 as xs:stringThe first string.$s2 as xs:stringThe second string.$score as xs:integerThe score value.$penalty as xs:integerThe penalty value.
Returns
xs:doubleThe Needleman-Wunsch distance between the two strings.
Examples
smith-waterman#4
declare function simc:smith-waterman(
$s1 as xs:string,
$s2 as xs:string,
$score as xs:integer,
$penalty as xs:integer
) as xs:double Returns the Smith-Waterman distance between two strings.
Example usage : smith-waterman("ACACACTA", "AGCACACA", 2, 1)
The function invocation in the example above returns : 12
Parameters
$s1 as xs:stringThe first string.$s2 as xs:stringThe second string.$score as xs:integerThe score value.$penalty as xs:integerThe penalty value.
Returns
xs:doubleThe Smith-Waterman distance between the two strings.