http://www.zorba-xquery.com/modules/data-cleaning/hybrid-string-similarity

Description

Before using any of the functions below please remember to import the module namespace:

import module namespace simh = "http://www.zorba-xquery.com/modules/data-cleaning/hybrid-string-similarity";

This library module provides hybrid string similarity functions, combining the properties of character-based string similarity functions and token-based string similarity functions. The logic contained in this module is not specific to any particular XQuery implementation, although the module requires the trigonometic functions of XQuery 3.0 or a math extension function such as sqrt($x as numeric) for computing the square root.

Author

Bruno Martins and Diogo Simões

XQuery version and encoding

xquery version "3.0" encoding "utf-8";

Namespaces

mathhttp://www.w3.org/2005/xpath-functions/math
sethttp://www.zorba-xquery.com/modules/data-cleaning/set-similarity
simchttp://www.zorba-xquery.com/modules/data-cleaning/character-based-string-similarity
simhhttp://www.zorba-xquery.com/modules/data-cleaning/hybrid-string-similarity
simphttp://www.zorba-xquery.com/modules/data-cleaning/phonetic-string-similarity
simthttp://www.zorba-xquery.com/modules/data-cleaning/token-based-string-similarity
verhttp://www.zorba-xquery.com/options/versioning

Function Summary

monge-elkan-jaro-winkler($s1 as xs:string, $s2 as xs:string, $prefix as xs:integer, $fact as xs:double) as xs:double

Returns the Monge-Elkan similarity coefficient between two strings, using the Jaro-Winkler similarity function to discover token identity.

soft-cosine-tokens-edit-distance($s1 as xs:string, $s2 as xs:string, $r as xs:string, $t as xs:integer) as xs:double

Returns the cosine similarity coefficient between sets of tokens extracted from two strings.

soft-cosine-tokens-jaro-winkler($s1 as xs:string, $s2 as xs:string, $r as xs:string, $t as xs:double, $prefix as xs:integer?, $fact as xs:double?) as xs:double

Returns the cosine similarity coefficient between sets of tokens extracted from two strings.

soft-cosine-tokens-jaro($s1 as xs:string, $s2 as xs:string, $r as xs:string, $t as xs:double) as xs:double

Returns the cosine similarity coefficient between sets of tokens extracted from two strings.

soft-cosine-tokens-metaphone($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double

Returns the cosine similarity coefficient between sets of tokens extracted from two strings.

soft-cosine-tokens-soundex($s1 as xs:string, $s2 as xs:string, $r as xs:string) as xs:double

Returns the cosine similarity coefficient between sets of tokens extracted from two strings.

Functions

monge-elkan-jaro-winkler#4

declare function simh:monge-elkan-jaro-winkler(
    $s1 as xs:string,
    $s2 as xs:string,
    $prefix as xs:integer,
    $fact as xs:double
) as xs:double

Returns the Monge-Elkan similarity coefficient between two strings, using the Jaro-Winkler similarity function to discover token identity.
Example usage : monge-elkan-jaro-winkler("Comput. Sci. and Eng. Dept., University of California, San Diego", "Department of Computer Scinece, Univ. Calif., San Diego", 4, 0.1)
The function invocation in the example above returns : 0.992

Parameters

  • $s1 as xs:string
    The first string.
  • $s2 as xs:string
    The second string.
  • $prefix as xs:integer
    The number of characters to consider when testing for equal prefixes with the Jaro-Winkler metric.
  • $fact as xs:double
    The weighting factor to consider when the input strings have equal prefixes with the Jaro-Winkler metric.

Returns

  • xs:double

    The Monge-Elkan similarity coefficient between the two strings.

Examples

soft-cosine-tokens-edit-distance#4

declare function simh:soft-cosine-tokens-edit-distance(
    $s1 as xs:string,
    $s2 as xs:string,
    $r as xs:string,
    $t as xs:integer
) as xs:double

Returns the cosine similarity coefficient between sets of tokens extracted from two strings. The tokens from each string are weighted according to their occurence frequency (i.e., weighted according to the term-frequency heuristic from Information Retrieval). The Edit Distance similarity function is used to discover token identity, and tokens having an edit distance bellow a given threshold are considered as matching tokens.
Example usage : soft-cosine-tokens-edit-distance("The FLWOR Foundation", "FLWOR Found.", " +", 0 )
The function invocation in the example above returns : 0.408248290463863

Parameters

  • $s1 as xs:string
    The first string.
  • $s2 as xs:string
    The second string.
  • $r as xs:string
    A regular expression forming the delimiter character(s) which mark the boundaries between adjacent tokens.
  • $t as xs:integer
    A threshold for the similarity function used to discover token identity.

Returns

  • xs:double

    The cosine similarity coefficient between the sets tokens extracted from the two strings.

soft-cosine-tokens-jaro-winkler#6

declare function simh:soft-cosine-tokens-jaro-winkler(
    $s1 as xs:string,
    $s2 as xs:string,
    $r as xs:string,
    $t as xs:double,
    $prefix as xs:integer?,
    $fact as xs:double?
) as xs:double

Returns the cosine similarity coefficient between sets of tokens extracted from two strings. The tokens from each string are weighted according to their occurence frequency (i.e., weighted according to the term-frequency heuristic from Information Retrieval). The Jaro-Winkler similarity function is used to discover token identity, and tokens having a Jaro-Winkler similarity above a given threshold are considered as matching tokens.
Example usage : soft-cosine-tokens-jaro-winkler("The FLWOR Foundation", "FLWOR Found.", " +", 1, 4, 0.1 )
The function invocation in the example above returns : 0.45

Parameters

  • $s1 as xs:string
    The first string.
  • $s2 as xs:string
    The second string.
  • $r as xs:string
    A regular expression forming the delimiter character(s) which mark the boundaries between adjacent tokens.
  • $t as xs:double
    A threshold for the similarity function used to discover token identity.
  • $prefix as xs:integer
    The number of characters to consider when testing for equal prefixes with the Jaro-Winkler metric.
  • $fact as xs:double
    The weighting factor to consider when the input strings have equal prefixes with the Jaro-Winkler metric.

Returns

  • xs:double

    The cosine similarity coefficient between the sets tokens extracted from the two strings.

Examples

soft-cosine-tokens-jaro#4

declare function simh:soft-cosine-tokens-jaro(
    $s1 as xs:string,
    $s2 as xs:string,
    $r as xs:string,
    $t as xs:double
) as xs:double

Returns the cosine similarity coefficient between sets of tokens extracted from two strings. The tokens from each string are weighted according to their occurence frequency (i.e., weighted according to the term-frequency heuristic from Information Retrieval). The Jaro similarity function is used to discover token identity, and tokens having a Jaro similarity above a given threshold are considered as matching tokens.
Example usage : soft-cosine-tokens-jaro("The FLWOR Foundation", "FLWOR Found.", " +", 1 )
The function invocation in the example above returns : 0.5

Parameters

  • $s1 as xs:string
    The first string.
  • $s2 as xs:string
    The second string.
  • $r as xs:string
    A regular expression forming the delimiter character(s) which mark the boundaries between adjacent tokens.
  • $t as xs:double
    A threshold for the similarity function used to discover token identity.

Returns

  • xs:double

    The cosine similarity coefficient between the sets tokens extracted from the two strings.

Examples

soft-cosine-tokens-metaphone#3

declare function simh:soft-cosine-tokens-metaphone(
    $s1 as xs:string,
    $s2 as xs:string,
    $r as xs:string
) as xs:double

Returns the cosine similarity coefficient between sets of tokens extracted from two strings. The tokens from each string are weighted according to their occurence frequency (i.e., weighted according to the term-frequency heuristic from Information Retrieval). The Metaphone phonetic similarity function is used to discover token identity, which is equivalent to saying that this function returns the cosine similarity coefficient between sets of Metaphone keys.
Example usage : soft-cosine-tokens-metaphone("ALEKSANDER SMITH", "ALEXANDER SMYTH", " +" )
The function invocation in the example above returns : 1.0

Parameters

  • $s1 as xs:string
    The first string.
  • $s2 as xs:string
    The second string.
  • $r as xs:string
    A regular expression forming the delimiter character(s) which mark the boundaries between adjacent tokens.

Returns

  • xs:double

    The cosine similarity coefficient between the sets Metaphone keys extracted from the two strings.

Examples

soft-cosine-tokens-soundex#3

declare function simh:soft-cosine-tokens-soundex(
    $s1 as xs:string,
    $s2 as xs:string,
    $r as xs:string
) as xs:double

Returns the cosine similarity coefficient between sets of tokens extracted from two strings. The tokens from each string are weighted according to their occurence frequency (i.e., weighted according to the term-frequency heuristic from Information Retrieval). The Soundex phonetic similarity function is used to discover token identity, which is equivalent to saying that this function returns the cosine similarity coefficient between sets of Soundex keys.
Example usage : soft-cosine-tokens-soundex("ALEKSANDER SMITH", "ALEXANDER SMYTH", " +")
The function invocation in the example above returns : 1.0

Parameters

  • $s1 as xs:string
    The first string.
  • $s2 as xs:string
    The second string.
  • $r as xs:string
    A regular expression forming the delimiter character(s) which mark the boundaries between adjacent tokens.

Returns

  • xs:double

    The cosine similarity coefficient between the sets of Soundex keys extracted from the two strings.

Examples

blog comments powered by Disqus