Skip to content

RFC: Define (==) for SortedSets#264

Open
kmsquire wants to merge 2 commits intomasterfrom
kms/fix/sorted_set_equality
Open

RFC: Define (==) for SortedSets#264
kmsquire wants to merge 2 commits intomasterfrom
kms/fix/sorted_set_equality

Conversation

@kmsquire
Copy link
Copy Markdown
Member

@kmsquire kmsquire commented Jan 29, 2017

  • Based on definition for Base.Set
  • Also defined <, <=, ⊊, and ⊈ (also based on Base.Set

@kmsquire
Copy link
Copy Markdown
Member Author

@StephenVavasis, here are a few definitions to bring more parity with Base.Set.

@codecov-io
Copy link
Copy Markdown

codecov-io commented Jan 29, 2017

Codecov Report

Merging #264 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #264      +/-   ##
==========================================
+ Coverage   95.86%   95.86%   +<.01%     
==========================================
  Files          30       30              
  Lines        2223     2493     +270     
==========================================
+ Hits         2131     2390     +259     
- Misses         92      103      +11
Impacted Files Coverage Δ
src/sorted_set.jl 94.31% <100%> (+0.16%)
src/heaps.jl 100% <ø> (ø)
src/accumulator.jl 100% <ø> (ø)
src/circular_buffer.jl 100% <ø> (ø)
src/tokens.jl 100% <ø> (ø)
src/multi_dict.jl 96.66% <ø> (ø)
src/heaps/mutable_binary_heap.jl 93.16% <ø> (+0.5%)
src/sorted_dict.jl 98.94% <ø> (+0.62%)
src/default_dict.jl 86.15% <ø> (+0.78%)
src/ordered_dict.jl 88.48% <ø> (+1.51%)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 605d470...36e88ec. Read the comment docs.

@StephenVavasis
Copy link
Copy Markdown
Contributor

Yes, this looks good. Please go ahead and merge.

@KristofferC
Copy link
Copy Markdown
Contributor

Is a corresponding hash function needed?

@StephenVavasis
Copy link
Copy Markdown
Contributor

Actually, now that I think about this, there are some pitfalls here that could trip up a user.

Suppose S = {"ABC", "abc", and "def"}, and the usual lex comparison is used to define S.

Suppose that T = {"ABC", "DEF", and "GHI"} and a case-insensitive lex comparison is used to define T.

In this somewhat pathological example, the test S==T that appears in this pull request will return true because the two sets have the same length, and each item of S appears inside T according to the order-object used for T, so S is a subset of T.

The sorted containers have the feature of a user-specified ordering of keys, which means that two keys can be equal for the purpose of the container even though they are unequal according to the rest of Julia. This means that the sorted containers do not necessarily play nicely with other aspects of Julia.

@kmsquire
Copy link
Copy Markdown
Member Author

kmsquire commented Jan 30, 2017

@StephenVavasis, thanks for the review and example!

It is complicated when dealing with something like case-insensitivity (or other criteria where two keys can be equal according to the ordering, but not with ==).

A few thoughts:

  • Alternative data structures (not yet implemented) that might help clarify some situations. For example, a SortedHashSet might use a tree for traversal, and a hash table for testing membership. (But there still may be ambiguous situations.)
  • It would be useful to have these functions work properly in the common case, and either have reasonable defaults, or fail fast for edge cases.
  • For example, when working with two SortedSet
    1. We could require that the Orderings be identical
    2. We could detect when Orderings are not identical, and check key equality more strictly
    3. We could give a reasonable default behavior, but suggest to the user how to redefine (==) for specific types if they want different behavior.

I'm leaning towards ii. and iii.

@KristofferC, I guess a hash function should also be defined. We should probably define a FrozenSet (and FrozenDict, etc.) at some point, to encourage users not to mutate the set or dict after inserting it in another set or dict.

@StephenVavasis
Copy link
Copy Markdown
Contributor

I think (ii) and (iii) should be fine. I think the following will work: The implementation of == should first compare order objects; if they are equal then the current code is fine. If they are unequal, then the two sorted sets should be converted to Set (I think the built-in constructor in set.jl will handle this without change since it can already build a Set from an iterable), and then the two resulting Base.Set items should be compared.

There is already an isequal() function inside sorted_set.jl. I don't recall exactly what is the default relationship between == and isequal(), but in any case, we should make them consistent.

With regard to hashing a SortedDict, there is no way to be consistent with both (ii) and (iii). However, attaining consistency with (iii) is relatively straightforward, namely, first convert the SortedSet to a Set, and then hash the resulting Set.

On a different note: when I first wrote SortedSet, there was no AbstractSet available, but now I see that it is available in 0.5. So SortedSet should really be a subtype of AbstractSet. In 0.5 there don't seem to be any default methods for AbstractSet, but maybe these are coming in future releases.

@kmsquire
Copy link
Copy Markdown
Member Author

kmsquire commented Feb 1, 2017

I was just reading up on the ]C++ behavior with regard to sorted containers [1,2]. Java seems to

There, comparison in sorted containers (e.g., standard sets and maps) is entirely based on equivalence, rather than equality, which, using Julia terms, means only using the Ordering, and testing !(isless(a,b) || isless(b,a)).

I'm wondering if we should make the same choice. Java seems to have done so.

[1] http://stackoverflow.com/questions/8217588/equality-evaluation-in-associative-containers-stl
[2] Effective STL, by Scott Meyers, Item 19

@StephenVavasis
Copy link
Copy Markdown
Contributor

Kevin, what you have described is already how the sorted containers work. However, it doesn't seem to clarify how to deal with == between two distinct sorted containers, or how to hash the sorted containers.

@kmsquire
Copy link
Copy Markdown
Member Author

kmsquire commented Feb 2, 2017

@StephenVavasis, I was thinking that this would provide an alternative to the current implementation (of equality). But I think you're right, I don't think it clarifies much, if anything.

@kmsquire kmsquire force-pushed the kms/fix/sorted_set_equality branch from 08fa8b4 to da5c2ef Compare February 2, 2017 08:27
@kmsquire
Copy link
Copy Markdown
Member Author

kmsquire commented Feb 3, 2017

@StephenVavasis wrote:

The implementation of == should first compare order objects; if they are equal then the current code is fine. If they are unequal, then the two sorted sets should be converted to Set [...], and then the two resulting Base.Set items should be compared.

This means that

  1. When comparing two SortedSets with the same Ordering, the comparison will use the Ordering to determine "equality" (==equivalence).
  2. When comparing SortedSets with different Orderings, (or a SortedSet and a Set), the comparison will use hash equality.

Some ideas:

  1. For consistency, when comparing two SortedSets with different Orderings, we could insist on equivalence.
    • This would require a different test than the one copied in this PR from Base (basically, we would need to iterate over the zip(set1, set2) and ensure all elements were equivalent, according to both Orderings)
  2. If we do insist on 1., then comparing, say, a Set and an OrderedSet could require both equivalence and hash equality (except that the "equivalence" check is then redundant). Or we could just make them incomparable by throwing an error.
  3. Contrarily, issubset should consider only the "set" component of OrderedSet, which would allow asking if a Set were a subset of a SortedSet (and vice versa).

Thoughts?

@kmsquire kmsquire force-pushed the kms/fix/sorted_set_equality branch from da5c2ef to 041dabd Compare February 5, 2017 06:42
* Based on definition for base.Set
* Also defined <, <=, ⊊, and ⊈
@kmsquire kmsquire force-pushed the kms/fix/sorted_set_equality branch from 041dabd to 36e88ec Compare February 17, 2017 05:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants