http://www.zorba-xquery.com/modules/data-cleaning/set-similarity
Description
Before using any of the functions below please remember to import the module namespace:
import module namespace set = "http://www.zorba-xquery.com/modules/data-cleaning/set-similarity";
This library module provides similarity functions for comparing sets of XML nodes (e.g., sets of XML elements, attributes or atomic values). These functions are particularly useful for matching near duplicate sets of XML nodes. The logic contained in this module is not specific to any particular XQuery implementation.
Author
Bruno Martins
XQuery version and encoding
xquery version "1.0" encoding "utf-8";
Namespaces
| set | http://www.zorba-xquery.com/modules/data-cleaning/set-similarity |
| ver | http://www.zorba-xquery.com/options/versioning |
Function Summary
deep-intersect($s1, $s2) as item()*Returns the intersection between two sets, using the deep-equal() function to compare the XML nodes from the sets. | |
deep-union($s1, $s2) as item()*Returns the union between two sets, using the deep-equal() function to compare the XML nodes from the sets. | |
dice($s1, $s2) as xs:doubleReturns the Dice similarity coefficient between two sets of XML nodes. | |
distinct($s) as item()*Removes exact duplicates from a set, using the deep-equal() function to compare the XML nodes from the sets. | |
jaccard($s1, $s2) as xs:doubleReturns the Jaccard similarity coefficient between two sets of XML nodes. | |
overlap($s1, $s2) as xs:doubleReturns the overlap coefficient between two sets of XML nodes. |
Functions
deep-intersect#2
declare function set:deep-intersect(
$s1 as ,
$s2 as
) as item()* Returns the intersection between two sets, using the deep-equal() function to compare the XML nodes from the sets.
Example usage : deep-intersect ( ( "a", "b", "c") , ( "a", "a",
The function invocation in the example above returns : ("a")
Parameters
$s1 asThe first set.$s2 asThe second set.
Returns
item()*The intersection of both sets.
Examples
deep-union#2
declare function set:deep-union(
$s1 as ,
$s2 as
) as item()* Returns the union between two sets, using the deep-equal() function to compare the XML nodes from the sets.
Example usage : deep-union ( ( "a", "b", "c") , ( "a", "a",
The function invocation in the example above returns : ("a", "b", "c",
Parameters
$s1 asThe first set.$s2 asThe second set.
Returns
item()*The union of both sets.
Examples
dice#2
declare function set:dice(
$s1 as ,
$s2 as
) as xs:double Returns the Dice similarity coefficient between two sets of XML nodes.
The Dice coefficient is defined as defined as twice the shared information between the input sets
(i.e., the size of the intersection) over the sum of the cardinalities for the input sets.
Example usage : dice ( ( "a", "b",
The function invocation in the example above returns : 0.4
Parameters
$s1 asThe first set.$s2 asThe second set.
Returns
xs:doubleThe Dice similarity coefficient between the two sets.
Examples
distinct#1
declare function set:distinct(
$s as
) as item()* Removes exact duplicates from a set, using the deep-equal() function to compare the XML nodes from the sets.
Example usage : distinct ( ( "a", "a", ) )
The function invocation in the example above returns : ("a", )
Parameters
$s asA set.
Returns
item()*The set provided as input without the exact duplicates (i.e., returns the distinct nodes from the set provided as input).
Examples
jaccard#2
declare function set:jaccard(
$s1 as ,
$s2 as
) as xs:double Returns the Jaccard similarity coefficient between two sets of XML nodes.
The Jaccard coefficient is defined as the size of the intersection divided by the size of the
union of the input sets.
Example usage : jaccard ( ( "a", "b",
The function invocation in the example above returns : 0.25
Parameters
$s1 asThe first set.$s2 asThe second set.
Returns
xs:doubleThe Jaccard similarity coefficient between the two sets.
Examples
overlap#2
declare function set:overlap(
$s1 as ,
$s2 as
) as xs:double Returns the overlap coefficient between two sets of XML nodes.
The overlap coefficient is defined as the shared information between the input sets
(i.e., the size of the intersection) over the size of the smallest input set.
Example usage : overlap ( ( "a", "b",
The function invocation in the example above returns : 1.0
Parameters
$s1 asThe first set.$s2 asThe second set.
Returns
xs:doubleThe overlap coefficient between the two sets.