http://www.zorba-xquery.com/modules/data-cleaning/character-based-string-similarity

Description

Before using any of the functions below please remember to import the module namespace:

import module namespace simc = "http://www.zorba-xquery.com/modules/data-cleaning/character-based-string-similarity";

This library module provides character-based string similarity functions that view strings as sequences of characters, generally computing a similarity score that corresponds to the cost of transforming one string into another. These functions are particularly useful for matching near duplicate strings in the presence of typographical errors. The logic contained in this module is not specific to any particular XQuery implementation.

Author

Bruno Martins and Diogo Simões

XQuery version and encoding

xquery version "1.0" encoding "utf-8";

Namespaces

simchttp://www.zorba-xquery.com/modules/data-cleaning/character-based-string-similarity
verhttp://www.zorba-xquery.com/options/versioning

Function Summary

edit-distance($s1 as xs:string, $s2 as xs:string) as xs:integer

Returns the edit distance between two strings.

jaro-winkler($s1 as xs:string, $s2 as xs:string, $prefix as xs:integer, $fact as xs:double) as xs:double

Returns the Jaro-Winkler similarity coefficient between two strings.

jaro($s1 as xs:string, $s2 as xs:string) as xs:double

Returns the Jaro similarity coefficient between two strings.

needleman-wunsch($s1 as xs:string, $s2 as xs:string, $score as xs:integer, $penalty as xs:integer) as xs:double

Returns the Needleman-Wunsch distance between two strings.

smith-waterman($s1 as xs:string, $s2 as xs:string, $score as xs:integer, $penalty as xs:integer) as xs:double

Returns the Smith-Waterman distance between two strings.

Functions

edit-distance#2

declare function simc:edit-distance(
    $s1 as xs:string,
    $s2 as xs:string
) as xs:integer

Returns the edit distance between two strings. This distance, also refered to as the Levenshtein distance, is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character.
Example usage : edit-distance("FLWOR", "FLOWER")
The function invocation in the example above returns : 2

Parameters

  • $s1 as xs:string
    The first string.
  • $s2 as xs:string
    The second string.

Returns

  • xs:integer

    The edit distance between the two strings.

Examples

jaro-winkler#4

declare function simc:jaro-winkler(
    $s1 as xs:string,
    $s2 as xs:string,
    $prefix as xs:integer,
    $fact as xs:double
) as xs:double

Returns the Jaro-Winkler similarity coefficient between two strings. This similarity coefficient corresponds to an extension of the Jaro similarity coefficient that weights or penalizes strings based on their similarity at the beginning of the string, up to a given prefix size.
Example usage : jaro-winkler("DWAYNE", "DUANE", 4, 0.1 )
The function invocation in the example above returns : 0.8577777777777778

Parameters

  • $s1 as xs:string
    The first string.
  • $s2 as xs:string
    The second string.
  • $prefix as xs:integer
    The number of characters to consider when testing for equal prefixes in the strings.
  • $fact as xs:double
    The weighting factor to consider when the input strings have equal prefixes.

Returns

  • xs:double

    The Jaro-Winkler similarity coefficient between the two strings.

Examples

jaro#2

declare function simc:jaro(
    $s1 as xs:string,
    $s2 as xs:string
) as xs:double

Returns the Jaro similarity coefficient between two strings. This similarity coefficient is based on the number of transposed characters and on a weighted sum of the percentage of matched characters held within the strings. The higher the Jaro-Winkler value is, the more similar the strings are. The coefficient is normalized such that 0 equates to no similarity and 1 is an exact match.
Example usage : jaro("FLWOR Found.", "FLWOR Foundation")
The function invocation in the example above returns : 0.5853174603174603

Parameters

  • $s1 as xs:string
    The first string.
  • $s2 as xs:string
    The second string.

Returns

  • xs:double

    The Jaro similarity coefficient between the two strings.

Examples

needleman-wunsch#4

declare function simc:needleman-wunsch(
    $s1 as xs:string,
    $s2 as xs:string,
    $score as xs:integer,
    $penalty as xs:integer
) as xs:double

Returns the Needleman-Wunsch distance between two strings. The Needleman-Wunsch distance is similar to the basic edit distance metric, adding a variable cost adjustment to the cost of a gap (i.e., an insertion or deletion) in the distance metric.
Example usage : needleman-wunsch("KAK", "KQRK", 1, 1)
The function invocation in the example above returns : 0

Parameters

  • $s1 as xs:string
    The first string.
  • $s2 as xs:string
    The second string.
  • $score as xs:integer
    The score value.
  • $penalty as xs:integer
    The penalty value.

Returns

  • xs:double

    The Needleman-Wunsch distance between the two strings.

Examples

smith-waterman#4

declare function simc:smith-waterman(
    $s1 as xs:string,
    $s2 as xs:string,
    $score as xs:integer,
    $penalty as xs:integer
) as xs:double

Returns the Smith-Waterman distance between two strings.
Example usage : smith-waterman("ACACACTA", "AGCACACA", 2, 1)
The function invocation in the example above returns : 12

Parameters

  • $s1 as xs:string
    The first string.
  • $s2 as xs:string
    The second string.
  • $score as xs:integer
    The score value.
  • $penalty as xs:integer
    The penalty value.

Returns

  • xs:double

    The Smith-Waterman distance between the two strings.

blog comments powered by Disqus