Add SortedVector type and tests#290
Add SortedVector type and tests#290dpsanders wants to merge 5 commits intoJuliaCollections:masterfrom
Conversation
|
This functionality is already available in the SortedSet container. Furthermore, SortedSet will in general be more efficient because insertion and deletion to SortedSet require O(log n) operations, whereas insertion and deletion in SortedVector require O(n) operations. In addition, Base already contains the functions: issorted, searchsorted_first, and searchsorted_last to deal with sorted vectors. If you believe that, despite what is already available. there is a use-case for the new SortedVector, then possibly you could document the use-case. Also, you might consider following the same interface provided by SortedSet and the other sorted containers. For SortedSet, the second type parameter is an order object rather than a function. |
|
Thanks, I see your point. One possible answer is the overhead for storing the set, e.g. julia> s = SortedSet([7, 5, 1])
SortedSet([1, 5, 7],
Base.Order.ForwardOrdering())
julia> Base.summarysize(s)
424
julia> v = SortedVector([7, 5, 1])
SortedVector([1, 5, 7])
julia> Base.summarysize(v)
32I understand that the rotations etc. needed to balance the set can be expensive, but it's not very clear to me how to benchmark things like that which modify the structure being tested. |
|
Also, in my application, I will need to have multiple values that compare as being equal (with a non-standard comparison that only compares the first element of a tuple). This doesn't seem to be possible using |
|
My previous statement that SortedSet implements the same functionality as SortedVector is incorrect-- SortedSet{K} does not allow multiple copies of the same data item whereas SortedVector{K} allows multiplicities. Rather, SortedVector{K} could be implemented as SortedDict{K,Int}, where the Int value field contains the multiplicity. If you'd like a reasonable benchmark for a sorted set that includes insertion and/or deletion, you could code up the Sieve of Eratosthenes algorithm for finding all the prime numbers between 1 and, say, 10^7 to compare SortedSet to SortedVector. |
|
I can't see from the docs how to use a non-standard ordering, e.g. |
|
If you want a structure that will preserve multiple insertions of keys that are distinct under == but test as equals under the sort order, then your best bet would be SortedMultiDict{K,Void}. The SortedMultiDict container allows multiple keys that test as equal by the sort order. You can implement a test based on the first entry either using the Lt type from Base and then use |
|
Thanks. One of the reasons I need a sorted vector or related is that I will frequently need to remove all of those entries that are larger than a certain value. Is that functionality available / easy to implement in |
|
In principle, it is possible to chop off an entire subtree of the tree and then rebalance, but it would be a lot of code to write. (Have a look at the |
|
OK thanks. I guess that would be a high complexity operation? (O(number deleted items) or similar?) |
|
Currently, to delete k items from a sorted multidict would require O(k*log n) operations. If new special-purpose code were written for deleting a block of k consecutive items, then I think this would require O(k + log n) operations. To delete the last k items from a vector requires only O(1) time (since it corresponds to resize). |
Just for curiosity: if you run those tests, but use |
|
I guess the slow part is moving the data around, which
|
Codecov Report
@@ Coverage Diff @@
## master #290 +/- ##
==========================================
- Coverage 95.9% 95.36% -0.54%
==========================================
Files 31 32 +1
Lines 2244 2265 +21
==========================================
+ Hits 2152 2160 +8
- Misses 92 105 +13
Continue to review full report at Codecov.
|
kmsquire
left a comment
There was a problem hiding this comment.
I made a few comments about a couple of minor issues.
There has been some discussion about having the output of sort() be a SortedVector type, so this is a nice start to testing that idea!
Depending on how it's used, there are probably more efficient representations as the size grows large (e.g., something like the blocked implementation of Deques here).
Anyway, after fixing the issues I mentioned, I think this is nice to have. Improvements could come as someone needs them or has time to implement them.
| by::F | ||
|
|
||
| function SortedVector(data::Vector{T}, by::F) | ||
| new(sort(data), by) |
There was a problem hiding this comment.
I think you need to pass by to sort() here.
Should also pass through rev and alg as parameters to sort().
There was a problem hiding this comment.
(And for that matter, rev could/should probably be a separate field here as well)
There was a problem hiding this comment.
Good catch, thanks.
rev seems to be redundant with by.
There was a problem hiding this comment.
rev isn't redundant with by: sort takes both arguments. Would be better to make them keyword arguments, to match sort.
It would also be useful to provide a way to sort the data in-place, and to declare that the input is already sorted. Not sure what's the best way to do this; maybe just a sorted keyword argument.
There was a problem hiding this comment.
I think the constructor shouldn't call sort. You end up needing SortedVector(v, dont_sort=true), instead of simply writing SortedVector(v), SortedVector(sort(v)), SortedVector(sort!(v)) as needed.
To support different orderings, SortedVector should have a Base.Order.Ordering type parameter.
There was a problem hiding this comment.
If the constructor doesn't call sort, you can end up with a non-sorted SortedVector. Or am I missing something? Or just call issorted first to see if it needs sorting? (But is that much faster than just calling sort if it's already sorted?)
There was a problem hiding this comment.
Ah, I understand the point. But it should definitely not be possible to create the object without any sorting at all.
|
|
||
|
|
||
| getindex(v::SortedVector, i::Int) = v.data[i] | ||
| length(v::SortedVector) = length(v.data) |
There was a problem hiding this comment.
These and a few functions below can be simplified using @delegate
| push!(v, (1, "x")) | ||
| @test v.data == [(1, "x"), (2, "b"), (3, "z"), (4, "y"), (5, "a"), (6, "z")] | ||
|
|
||
| end |
There was a problem hiding this comment.
Please add a testset using
v = SortedVector([1,2,3], x->-x)There was a problem hiding this comment.
Should also test all methods defined for the type. Can you also test that an error is thrown when calling setindex!?
|
Cc: @nalimilan |
|
Thanks for Ccing me, @kmsquire. Indeed However, I'm not sure it would be OK to deprecate |
nalimilan
left a comment
There was a problem hiding this comment.
It would also be useful to define issorted, IndexStyle, and maybe other functions.
| by::F | ||
|
|
||
| function SortedVector(data::Vector{T}, by::F) | ||
| new(sort(data), by) |
There was a problem hiding this comment.
rev isn't redundant with by: sort takes both arguments. Would be better to make them keyword arguments, to match sort.
It would also be useful to provide a way to sort the data in-place, and to declare that the input is already sorted. Not sure what's the best way to do this; maybe just a sorted keyword argument.
| A `SortedVector` behaves like a standard Julia `Vector`, except that its elements are stored in sorted order, with an optional function `by` that determines the sorting order in the same way as `Base.searchsorted`. | ||
| """ | ||
| immutable SortedVector{T, F<:Function} | ||
| data::Vector{T} |
There was a problem hiding this comment.
Should use AbstractVector everywhere (here, via a type parameter).
There was a problem hiding this comment.
What is the use case for using an AbstractVector?
There was a problem hiding this comment.
I don't know, one could use a DataArray, a CategoricalArray... Is there any reason not to support it?
There was a problem hiding this comment.
I see. I was just worried that there are some subtypes of AbstractVector that wouldn't support the necessary operations.
There was a problem hiding this comment.
AFAICT you only use functions in the AbstractArray interface, right? Anyway, even if a method fails for some peculiar types, it cannot be worse than if you only accept Vector.
There was a problem hiding this comment.
If I understand correctly, there will be a big performance hit for making the inner container an AbstractVector. The compiler won't be able to ascertain the actual type of the inner container at compile-time so there will be a lot more run-time dispatching. To avoid this problem, you would need to make both the data type and the inner container type parameters of SortedVector.
There was a problem hiding this comment.
Making the container type a parameter is what @nalimilan was suggesting.
| @@ -0,0 +1,52 @@ | |||
| """ | |||
| A `SortedVector` behaves like a standard Julia `Vector`, except that its elements are stored in sorted order, with an optional function `by` that determines the sorting order in the same way as `Base.searchsorted`. | |||
There was a problem hiding this comment.
Wrap at 92 chars. Also, better refer to sort than to searchsorted. The docstring should also mention explicitly what happens on insert.
| length(v::SortedVector) = length(v.data) | ||
|
|
||
|
|
||
| function insert!{T}(v::SortedVector{T}, i::Int, x::T) |
There was a problem hiding this comment.
Shouldn't use ::T, just leave push! do the conversion automatically if possible. Same below.
Though this method should probably throw an error, since it doesn't respect the contract of inserting x at position i.
|
|
||
|
|
||
| function insert!{T}(v::SortedVector{T}, i::Int, x::T) | ||
| push!(v.data, i, x) |
|
|
||
| include("priorityqueue.jl") | ||
|
|
||
| include("sorted_vector.jl") |
There was a problem hiding this comment.
Have a look at the other file names in the package.
There was a problem hiding this comment.
Funny, my comment was based on priorityqueue.jl, which is the only example shown in the diff, and apparently also the only exception.
| push!(v, (1, "x")) | ||
| @test v.data == [(1, "x"), (2, "b"), (3, "z"), (4, "y"), (5, "a"), (6, "z")] | ||
|
|
||
| end |
There was a problem hiding this comment.
Should also test all methods defined for the type. Can you also test that an error is thrown when calling setindex!?
|
|
||
|
|
||
| SortedVector{T,F}(data::Vector{T}, by::F) = SortedVector{T,F}(data, by) | ||
| SortedVector{T}(data::Vector{T}) = SortedVector{T,typeof(identity)}(data, identity) |
There was a problem hiding this comment.
You can get rid of this function by adding by=identity to the previous one.
|
|
||
| function push!{T}(v::SortedVector{T}, x::T) | ||
| i = searchsortedfirst(v.data, x, by=v.by) | ||
| insert!(v.data, i, x) |
There was a problem hiding this comment.
I believe this should just check that the pushed item is greater than the current last item in the vector. Otherwise it doesn't implement the contract that pop! returns the last thing you push!ed.
|
The documentation for pop! says that it returns the last element in the structure, not the last element added to the structure. To me push! just means "add this element to the data structure"; I don't see why it should be in the last place. |
|
Many thanks for the detailed review @nalimilan. I don't think I'll have time to do all that any time soon though. |
|
@nalimilan said
I agree. I had a longer response with a more detailed proposal for what specifically to do with |
|
bump |
|
Actually, |
|
Sorry, dropped the ball on this one. |
|
I am interested in reviving this, since it would be nice to have a vector-like container that simply indicates that the contents are sorted. However, from the long discussion above it seems there was some resistance to the idea. I wonder if the package maintainers would indicate under what conditions they would consider merging something like this, otherwise I can just start another package for it. |
|
Made a package for this, comments welcome: |
Since I needed a SortedVector type, here it is.
The hardest part was writing the binary search.
A simpler implementation would be to just to push the new element onto the end and then use
sort!, but that seemed to be slightly slower in my not-extensive-nor-rigorous tests.