http://www.zorba-xquery.com/modules/data-cleaning/hybrid-string-similarity
Description
Before using any of the functions below please remember to import the module namespace:
import module namespace simh = "http://www.zorba-xquery.com/modules/data-cleaning/hybrid-string-similarity";
This library module provides hybrid string similarity functions, combining the properties of character-based string similarity functions and token-based string similarity functions. The logic contained in this module is not specific to any particular XQuery implementation, although the module requires the trigonometic functions of XQuery 3.0 or a math extension function such as sqrt($x as numeric) for computing the square root.
Author
Bruno Martins and Diogo Simões
XQuery version and encoding
xquery version "3.0" encoding "utf-8";
Namespaces
| math | http://www.w3.org/2005/xpath-functions/math |
| set | http://www.zorba-xquery.com/modules/data-cleaning/set-similarity |
| simc | http://www.zorba-xquery.com/modules/data-cleaning/character-based-string-similarity |
| simh | http://www.zorba-xquery.com/modules/data-cleaning/hybrid-string-similarity |
| simp | http://www.zorba-xquery.com/modules/data-cleaning/phonetic-string-similarity |
| simt | http://www.zorba-xquery.com/modules/data-cleaning/token-based-string-similarity |
| ver | http://www.zorba-xquery.com/options/versioning |
Function Summary
monge-elkan-jaro-winkler($s1 as xs:string, $s2 as xs:string, $prefix as xs:integer, $fact as xs:double) as xs:doubleReturns the Monge-Elkan similarity coefficient between two strings, using the Jaro-Winkler similarity function to discover token identity. | |
soft-cosine-tokens-edit-distance($s1 as xs:string, $s2 as xs:string, $r as xs:string, $t as xs:integer) as xs:doubleReturns the cosine similarity coefficient between sets of tokens extracted from two strings. | |
soft-cosine-tokens-jaro-winkler($s1 as xs:string, $s2 as xs:string, $r as xs:string, $t as xs:double, $prefix as xs:integer?, $fact as xs:double?) as xs:doubleReturns the cosine similarity coefficient between sets of tokens extracted from two strings. | |
soft-cosine-tokens-jaro($s1 as xs:string, $s2 as xs:string, $r as xs:string, $t as xs:double) as xs:doubleReturns the cosine similarity coefficient between sets of tokens extracted from two strings. | |
soft-cosine-tokens-metaphone($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:doubleReturns the cosine similarity coefficient between sets of tokens extracted from two strings. | |
soft-cosine-tokens-soundex($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:doubleReturns the cosine similarity coefficient between sets of tokens extracted from two strings. |
Functions
monge-elkan-jaro-winkler#4
declare function simh:monge-elkan-jaro-winkler(
$s1 as xs:string,
$s2 as xs:string,
$prefix as xs:integer,
$fact as xs:double
) as xs:double Returns the Monge-Elkan similarity coefficient between two strings, using the Jaro-Winkler
similarity function to discover token identity.
Example usage : monge-elkan-jaro-winkler("Comput. Sci. and Eng. Dept., University of California, San Diego", "Department of Computer Scinece, Univ. Calif., San Diego", 4, 0.1)
The function invocation in the example above returns : 0.992
Parameters
$s1 as xs:stringThe first string.$s2 as xs:stringThe second string.$prefix as xs:integerThe number of characters to consider when testing for equal prefixes with the Jaro-Winkler metric.$fact as xs:doubleThe weighting factor to consider when the input strings have equal prefixes with the Jaro-Winkler metric.
Returns
xs:doubleThe Monge-Elkan similarity coefficient between the two strings.
Examples
soft-cosine-tokens-edit-distance#4
declare function simh:soft-cosine-tokens-edit-distance(
$s1 as xs:string,
$s2 as xs:string,
$r as xs:string,
$t as xs:integer
) as xs:double Returns the cosine similarity coefficient between sets of tokens extracted from two strings.
The tokens from each string are weighted according to their occurence frequency (i.e., weighted according to the
term-frequency heuristic from Information Retrieval).
The Edit Distance similarity function is used to discover token identity, and tokens having an edit distance
bellow a given threshold are considered as matching tokens.
Example usage : soft-cosine-tokens-edit-distance("The FLWOR Foundation", "FLWOR Found.", " +", 0 )
The function invocation in the example above returns : 0.408248290463863
Parameters
$s1 as xs:stringThe first string.$s2 as xs:stringThe second string.$r as xs:stringA regular expression forming the delimiter character(s) which mark the boundaries between adjacent tokens.$t as xs:integerA threshold for the similarity function used to discover token identity.
Returns
xs:doubleThe cosine similarity coefficient between the sets tokens extracted from the two strings.
soft-cosine-tokens-jaro-winkler#6
declare function simh:soft-cosine-tokens-jaro-winkler(
$s1 as xs:string,
$s2 as xs:string,
$r as xs:string,
$t as xs:double,
$prefix as xs:integer?,
$fact as xs:double?
) as xs:double Returns the cosine similarity coefficient between sets of tokens extracted from two strings.
The tokens from each string are weighted according to their occurence frequency (i.e., weighted according to the
term-frequency heuristic from Information Retrieval).
The Jaro-Winkler similarity function is used to discover token identity, and tokens having a Jaro-Winkler
similarity above a given threshold are considered as matching tokens.
Example usage : soft-cosine-tokens-jaro-winkler("The FLWOR Foundation", "FLWOR Found.", " +", 1, 4, 0.1 )
The function invocation in the example above returns : 0.45
Parameters
$s1 as xs:stringThe first string.$s2 as xs:stringThe second string.$r as xs:stringA regular expression forming the delimiter character(s) which mark the boundaries between adjacent tokens.$t as xs:doubleA threshold for the similarity function used to discover token identity.$prefix as xs:integerThe number of characters to consider when testing for equal prefixes with the Jaro-Winkler metric.$fact as xs:doubleThe weighting factor to consider when the input strings have equal prefixes with the Jaro-Winkler metric.
Returns
xs:doubleThe cosine similarity coefficient between the sets tokens extracted from the two strings.
Examples
soft-cosine-tokens-jaro#4
declare function simh:soft-cosine-tokens-jaro(
$s1 as xs:string,
$s2 as xs:string,
$r as xs:string,
$t as xs:double
) as xs:double Returns the cosine similarity coefficient between sets of tokens extracted from two strings.
The tokens from each string are weighted according to their occurence frequency (i.e., weighted according to the
term-frequency heuristic from Information Retrieval).
The Jaro similarity function is used to discover token identity, and tokens having a Jaro similarity above
a given threshold are considered as matching tokens.
Example usage : soft-cosine-tokens-jaro("The FLWOR Foundation", "FLWOR Found.", " +", 1 )
The function invocation in the example above returns : 0.5
Parameters
$s1 as xs:stringThe first string.$s2 as xs:stringThe second string.$r as xs:stringA regular expression forming the delimiter character(s) which mark the boundaries between adjacent tokens.$t as xs:doubleA threshold for the similarity function used to discover token identity.
Returns
xs:doubleThe cosine similarity coefficient between the sets tokens extracted from the two strings.
Examples
soft-cosine-tokens-metaphone#3
declare function simh:soft-cosine-tokens-metaphone(
$s1 as xs:string,
$s2 as xs:string,
$r as xs:string
) as xs:double Returns the cosine similarity coefficient between sets of tokens extracted from two strings.
The tokens from each string are weighted according to their occurence frequency (i.e., weighted according to the
term-frequency heuristic from Information Retrieval).
The Metaphone phonetic similarity function is used to discover token identity, which is equivalent to saying that
this function returns the cosine similarity coefficient between sets of Metaphone keys.
Example usage : soft-cosine-tokens-metaphone("ALEKSANDER SMITH", "ALEXANDER SMYTH", " +" )
The function invocation in the example above returns : 1.0
Parameters
$s1 as xs:stringThe first string.$s2 as xs:stringThe second string.$r as xs:stringA regular expression forming the delimiter character(s) which mark the boundaries between adjacent tokens.
Returns
xs:doubleThe cosine similarity coefficient between the sets Metaphone keys extracted from the two strings.
Examples
soft-cosine-tokens-soundex#3
declare function simh:soft-cosine-tokens-soundex(
$s1 as xs:string,
$s2 as xs:string,
$r as xs:string
) as xs:double Returns the cosine similarity coefficient between sets of tokens extracted from two strings.
The tokens from each string are weighted according to their occurence frequency (i.e., weighted according to the
term-frequency heuristic from Information Retrieval).
The Soundex phonetic similarity function is used to discover token identity, which is equivalent to saying that
this function returns the cosine similarity coefficient between sets of Soundex keys.
Example usage : soft-cosine-tokens-soundex("ALEKSANDER SMITH", "ALEXANDER SMYTH", " +")
The function invocation in the example above returns : 1.0
Parameters
$s1 as xs:stringThe first string.$s2 as xs:stringThe second string.$r as xs:stringA regular expression forming the delimiter character(s) which mark the boundaries between adjacent tokens.
Returns
xs:doubleThe cosine similarity coefficient between the sets of Soundex keys extracted from the two strings.