http://www.zorba-xquery.com/modules/data-cleaning/set-similarity

Description

Before using any of the functions below please remember to import the module namespace:

import module namespace set = "http://www.zorba-xquery.com/modules/data-cleaning/set-similarity";

This library module provides similarity functions for comparing sets of XML nodes (e.g., sets of XML elements, attributes or atomic values). These functions are particularly useful for matching near duplicate sets of XML nodes. The logic contained in this module is not specific to any particular XQuery implementation.

Author

Bruno Martins

XQuery version and encoding

xquery version "1.0" encoding "utf-8";

Namespaces

sethttp://www.zorba-xquery.com/modules/data-cleaning/set-similarity
verhttp://www.zorba-xquery.com/options/versioning

Function Summary

deep-intersect($s1, $s2) as item()*

Returns the intersection between two sets, using the deep-equal() function to compare the XML nodes from the sets.

deep-union($s1, $s2) as item()*

Returns the union between two sets, using the deep-equal() function to compare the XML nodes from the sets.

dice($s1, $s2) as xs:double

Returns the Dice similarity coefficient between two sets of XML nodes.

distinct($s) as item()*

Removes exact duplicates from a set, using the deep-equal() function to compare the XML nodes from the sets.

jaccard($s1, $s2) as xs:double

Returns the Jaccard similarity coefficient between two sets of XML nodes.

overlap($s1, $s2) as xs:double

Returns the overlap coefficient between two sets of XML nodes.

Functions

deep-intersect#2

declare function set:deep-intersect(
    $s1 as ,
    $s2 as 
) as item()*

Returns the intersection between two sets, using the deep-equal() function to compare the XML nodes from the sets.
Example usage : deep-intersect ( ( "a", "b", "c") , ( "a", "a", ) )
The function invocation in the example above returns : ("a")

Parameters

  • $s1 as
    The first set.
  • $s2 as
    The second set.

Returns

  • item()*

    The intersection of both sets.

Examples

deep-union#2

declare function set:deep-union(
    $s1 as ,
    $s2 as 
) as item()*

Returns the union between two sets, using the deep-equal() function to compare the XML nodes from the sets.
Example usage : deep-union ( ( "a", "b", "c") , ( "a", "a", ) )
The function invocation in the example above returns : ("a", "b", "c", )

Parameters

  • $s1 as
    The first set.
  • $s2 as
    The second set.

Returns

  • item()*

    The union of both sets.

Examples

dice#2

declare function set:dice(
    $s1 as ,
    $s2 as 
) as xs:double

Returns the Dice similarity coefficient between two sets of XML nodes. The Dice coefficient is defined as defined as twice the shared information between the input sets (i.e., the size of the intersection) over the sum of the cardinalities for the input sets.
Example usage : dice ( ( "a", "b", ) , ( "a", "a", "d") )
The function invocation in the example above returns : 0.4

Parameters

  • $s1 as
    The first set.
  • $s2 as
    The second set.

Returns

  • xs:double

    The Dice similarity coefficient between the two sets.

Examples

distinct#1

declare function set:distinct(
    $s as 
) as item()*

Removes exact duplicates from a set, using the deep-equal() function to compare the XML nodes from the sets.
Example usage : distinct ( ( "a", "a", ) )
The function invocation in the example above returns : ("a", )

Parameters

  • $s as
    A set.

Returns

  • item()*

    The set provided as input without the exact duplicates (i.e., returns the distinct nodes from the set provided as input).

Examples

jaccard#2

declare function set:jaccard(
    $s1 as ,
    $s2 as 
) as xs:double

Returns the Jaccard similarity coefficient between two sets of XML nodes. The Jaccard coefficient is defined as the size of the intersection divided by the size of the union of the input sets.
Example usage : jaccard ( ( "a", "b", ) , ( "a", "a", "d") )
The function invocation in the example above returns : 0.25

Parameters

  • $s1 as
    The first set.
  • $s2 as
    The second set.

Returns

  • xs:double

    The Jaccard similarity coefficient between the two sets.

Examples

overlap#2

declare function set:overlap(
    $s1 as ,
    $s2 as 
) as xs:double

Returns the overlap coefficient between two sets of XML nodes. The overlap coefficient is defined as the shared information between the input sets (i.e., the size of the intersection) over the size of the smallest input set.
Example usage : overlap ( ( "a", "b", ) , ( "a", "a", "b" ) )
The function invocation in the example above returns : 1.0

Parameters

  • $s1 as
    The first set.
  • $s2 as
    The second set.

Returns

  • xs:double

    The overlap coefficient between the two sets.

Examples

blog comments powered by Disqus