Niklas Hambüchen | 14 Jul 13:20 2013

ordNub

tldr: nub is abnormally slow, we shouldn't use it, but we do.

As you might know, Data.List.nub is O(n²). (*)

As you might not know, almost *all* practical Haskell projects use it,
and that in places where an Ord instance is given, e.g. happy, Xmonad,
ghc-mod, Agda, darcs, QuickCheck, yesod, shake, Cabal, haddock, and 600
more (see https://github.com/nh2/haskell-ordnub).

I've taken the Ord-based O(n * log n) implementation from yi using a Set:

  ordNub :: (Ord a) => [a] -> [a]
  ordNub l = go empty l
    where
      go _ []     = []
      go s (x:xs) = if x `member` s then go s xs
                                    else x : go (insert x s) xs

and put benchmarks on
http://htmlpreview.github.io/?https://github.com/nh2/haskell-ordnub/blob/1f0a2c94a/report.html
(compare `nub` vs `ordNub`).

`ordNub` is not only in a different complexity class, but even seems to
perform better than nub for very small numbers of actually different
list elements (that's the numbers before the benchmark names).

(The benchmark also shows some other potential problem: Using a state
monad to keep the set instead of a function argument can be up to 20
times slower. Should that happen?)

(Continue reading)

Clark Gaebel | 14 Jul 13:31 2013
Picon
Picon

Re: ordNub

Similarly, I've always used:

import qualified Data.HashSet as S

nub :: Hashable a => [a] -> [a]
nub = S.toList . S.fromList

And i can't think of any type which i can't write a Hashable instance, so this is extremely practical.

On Jul 14, 2013 7:24 AM, "Niklas Hambüchen" <mail <at> nh2.me> wrote:
tldr: nub is abnormally slow, we shouldn't use it, but we do.


As you might know, Data.List.nub is O(n²). (*)

As you might not know, almost *all* practical Haskell projects use it,
and that in places where an Ord instance is given, e.g. happy, Xmonad,
ghc-mod, Agda, darcs, QuickCheck, yesod, shake, Cabal, haddock, and 600
more (see https://github.com/nh2/haskell-ordnub).

I've taken the Ord-based O(n * log n) implementation from yi using a Set:

  ordNub :: (Ord a) => [a] -> [a]
  ordNub l = go empty l
    where
      go _ []     = []
      go s (x:xs) = if x `member` s then go s xs
                                    else x : go (insert x s) xs


and put benchmarks on
http://htmlpreview.github.io/?https://github.com/nh2/haskell-ordnub/blob/1f0a2c94a/report.html
(compare `nub` vs `ordNub`).

`ordNub` is not only in a different complexity class, but even seems to
perform better than nub for very small numbers of actually different
list elements (that's the numbers before the benchmark names).

(The benchmark also shows some other potential problem: Using a state
monad to keep the set instead of a function argument can be up to 20
times slower. Should that happen?)

What do you think about ordNub?

I've seen a proposal from 5 years ago about adding a *sort*Nub function
started by Neil, but it just died.


(*) The mentioned complexity is for the (very common) worst case, in
which the number of different elements in the list grows with the list
(alias you don't have an N element list with always only 5 different
things inside).

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Niklas Hambüchen | 14 Jul 13:39 2013

Re: ordNub

One of my main points is:

Should we not add such a function (ord-based, same output as nub, 
stable, no sorting) to base?

As the package counting shows, if we don't offer an alternative, people 
obviously use it, and not to our benefit.

(Not to say it this way:
We could make the Haskell world fast with smarter fusion, strictness 
analysis and LLVM backends.
Or we could stop using quadratic algorithms.)
Clark Gaebel | 14 Jul 13:54 2013
Picon
Picon

Re: ordNub

Oops sorry I guess my point wasn't clear.

Why ord based when hashable is faster? Then there's no reason this has to be in base, it can just be a free function in Data.HashSet. If stability is a concern then there's a way to easily account for that using HashMap.

  - Clark

On Jul 14, 2013 7:48 AM, "Niklas Hambüchen" <mail <at> nh2.me> wrote:
One of my main points is:

Should we not add such a function (ord-based, same output as nub,
stable, no sorting) to base?

As the package counting shows, if we don't offer an alternative, people
obviously use it, and not to our benefit.

(Not to say it this way:
We could make the Haskell world fast with smarter fusion, strictness
analysis and LLVM backends.
Or we could stop using quadratic algorithms.)
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Brandon Allbery | 15 Jul 06:46 2013
Picon

Re: ordNub

On Sun, Jul 14, 2013 at 7:54 AM, Clark Gaebel <cgaebel <at> uwaterloo.ca> wrote:

Oops sorry I guess my point wasn't clear.

Why ord based when hashable is faster? Then there's no reason this has to be in base, it can just be a 

Did the point about "stable" fly overhead? 

--
brandon s allbery kf8nh                               sine nomine associates
allbery.b <at> gmail.com                                  ballbery <at> sinenomine.net
unix, openafs, kerberos, infrastructure, xmonad        http://sinenomine.net
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Clark Gaebel | 15 Jul 14:43 2013
Picon
Picon

Re: ordNub

Apologies. I was being lazy. Here's a stable version:

  import qualified Data.HashSet as S

  hashNub :: (Ord a) => [a] -> [a]
  hashNub l = go S.empty l
    where
      go _ []     = []
      go s (x:xs) = if x `S.member` s then go s xs
                                    else x : go (S.insert x s) xs

Which, again, will probably be faster than the one using Ord, and I
can't think of any cases where I'd want the one using Ord instead. I
may just not be creative enough, though.

  - Clark

On Mon, Jul 15, 2013 at 12:46 AM, Brandon Allbery <allbery.b <at> gmail.com> wrote:
> On Sun, Jul 14, 2013 at 7:54 AM, Clark Gaebel <cgaebel <at> uwaterloo.ca> wrote:
>>
>> Oops sorry I guess my point wasn't clear.
>>
>> Why ord based when hashable is faster? Then there's no reason this has to
>> be in base, it can just be a
>
> Did the point about "stable" fly overhead?
>
> --
> brandon s allbery kf8nh                               sine nomine associates
> allbery.b <at> gmail.com                                  ballbery <at> sinenomine.net
> unix, openafs, kerberos, infrastructure, xmonad        http://sinenomine.net
John Lato | 16 Jul 03:46 2013
Picon

Re: ordNub

In my tests, using unordered-containers was slightly slower than using Ord, although as the number of repeated elements grows unordered-containers appears to have an advantage.  I'm sure the relative costs of comparison vs hashing would affect this also.  But both are dramatically better than the current nub.

Has anyone looked at Bart's patches to see how difficult it would be to apply them (or re-write them)?


On Mon, Jul 15, 2013 at 8:43 PM, Clark Gaebel <cgaebel <at> uwaterloo.ca> wrote:
Apologies. I was being lazy. Here's a stable version:

  import qualified Data.HashSet as S

  hashNub :: (Ord a) => [a] -> [a]
  hashNub l = go S.empty l
    where
      go _ []     = []
      go s (x:xs) = if x `S.member` s then go s xs
                                    else x : go (S.insert x s) xs

Which, again, will probably be faster than the one using Ord, and I
can't think of any cases where I'd want the one using Ord instead. I
may just not be creative enough, though.


  - Clark

On Mon, Jul 15, 2013 at 12:46 AM, Brandon Allbery <allbery.b <at> gmail.com> wrote:
> On Sun, Jul 14, 2013 at 7:54 AM, Clark Gaebel <cgaebel <at> uwaterloo.ca> wrote:
>>
>> Oops sorry I guess my point wasn't clear.
>>
>> Why ord based when hashable is faster? Then there's no reason this has to
>> be in base, it can just be a
>
> Did the point about "stable" fly overhead?
>
> --
> brandon s allbery kf8nh                               sine nomine associates
> allbery.b <at> gmail.com                                  ballbery <at> sinenomine.net
> unix, openafs, kerberos, infrastructure, xmonad        http://sinenomine.net

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Ivan Lazar Miljenovic | 16 Jul 04:31 2013
Picon

Re: ordNub

On 16 July 2013 11:46, John Lato <jwlato <at> gmail.com> wrote:
> In my tests, using unordered-containers was slightly slower than using Ord,
> although as the number of repeated elements grows unordered-containers
> appears to have an advantage.  I'm sure the relative costs of comparison vs
> hashing would affect this also.  But both are dramatically better than the
> current nub.
>
> Has anyone looked at Bart's patches to see how difficult it would be to
> apply them (or re-write them)?

If I understand correctly, this function is proposed to be added to
Data.List which lives in base... but the proposals here are about
using either Sets from containers or HashSet from
unordered-containers; I thought base wasn't supposed to depend on any
other package :/

>
>
>
> On Mon, Jul 15, 2013 at 8:43 PM, Clark Gaebel <cgaebel <at> uwaterloo.ca> wrote:
>>
>> Apologies. I was being lazy. Here's a stable version:
>>
>>   import qualified Data.HashSet as S
>>
>>   hashNub :: (Ord a) => [a] -> [a]
>>   hashNub l = go S.empty l
>>     where
>>       go _ []     = []
>>       go s (x:xs) = if x `S.member` s then go s xs
>>                                     else x : go (S.insert x s) xs
>>
>> Which, again, will probably be faster than the one using Ord, and I
>> can't think of any cases where I'd want the one using Ord instead. I
>> may just not be creative enough, though.
>>
>>
>>   - Clark
>>
>> On Mon, Jul 15, 2013 at 12:46 AM, Brandon Allbery <allbery.b <at> gmail.com>
>> wrote:
>> > On Sun, Jul 14, 2013 at 7:54 AM, Clark Gaebel <cgaebel <at> uwaterloo.ca>
>> > wrote:
>> >>
>> >> Oops sorry I guess my point wasn't clear.
>> >>
>> >> Why ord based when hashable is faster? Then there's no reason this has
>> >> to
>> >> be in base, it can just be a
>> >
>> > Did the point about "stable" fly overhead?
>> >
>> > --
>> > brandon s allbery kf8nh                               sine nomine
>> > associates
>> > allbery.b <at> gmail.com
>> > ballbery <at> sinenomine.net
>> > unix, openafs, kerberos, infrastructure, xmonad
>> > http://sinenomine.net
>>
>> _______________________________________________
>> Haskell-Cafe mailing list
>> Haskell-Cafe <at> haskell.org
>> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
>
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe <at> haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>

--

-- 
Ivan Lazar Miljenovic
Ivan.Miljenovic <at> gmail.com
http://IvanMiljenovic.wordpress.com
Francesco Mazzoli | 14 Jul 13:45 2013
Picon

Re: ordNub

At Sun, 14 Jul 2013 07:31:05 -0400,
Clark Gaebel wrote:
> Similarly, I've always used:
> 
> import qualified Data.HashSet as S
> 
> nub :: Hashable a => [a] -> [a]
> nub = S.toList . S.fromList
> 
> And i can't think of any type which i can't write a Hashable instance, so
> this is extremely practical.

Well, the above is not stable while Niklas’ is.  But I guess that’s not
the point of your message :).

I’ve always avoided “nub” too, and FWIW I’d like a constrained version
too—maybe avoiding Data.Set so that it could live in Data.List.  I think
Ord would be much better than Hashable, since it is 1. in “base” 2. much
more established and understood.

Although if you find yourself using “nub” too much you’re probably doing
something wrong...

Francesco

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Ketil Malde | 16 Jul 14:46 2013

Re: ordNub


Francesco Mazzoli <f <at> mazzo.li> writes:

>> import qualified Data.HashSet as S
>> 
>> nub :: Hashable a => [a] -> [a]
>> nub = S.toList . S.fromList

> Well, the above is not stable while Niklas’ is.  But I guess that’s not
> the point of your message :).

We could also implement Data.BloomFilter.nub, which removes equal
elements probabilistically (with a small but non-zero chance of removing
some unique elements) :-)

-k
--

-- 
If I haven't seen further, it is by standing in the footprints of giants

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Joey Adams | 15 Jul 03:54 2013
Picon

Re: ordNub

On Sun, Jul 14, 2013 at 7:31 AM, Clark Gaebel <cgaebel <at> uwaterloo.ca> wrote:

Similarly, I've always used:

import qualified Data.HashSet as S

nub :: Hashable a => [a] -> [a]
nub = S.toList . S.fromList

And i can't think of any type which i can't write a Hashable instance, so this is extremely practical.

This won't yield results lazily (e.g. nub (repeat 'x') = _|_ instead of 'x' : _|_), but Niklas' ordNub will.  His ordNub can be translated directly to HashSet and still have the stability and laziness properties.

A difficulty with putting ordNub in Data.List is that it depends on containers, which is outside of the base package.  Some options:

 * Move the implementation of Set to base.

 * Implement a lean version of Set in base that only provides 'insert' and 'member'.

 * Define ordNub in Data.Set instead.

Adding a Hashable-based nub to base would be even more problematic, since you'd need Hashable in base.
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Conrad Parker | 15 Jul 04:24 2013

Re: ordNub

On 15 July 2013 09:54, Joey Adams <joeyadams3.14159 <at> gmail.com> wrote:
> On Sun, Jul 14, 2013 at 7:31 AM, Clark Gaebel <cgaebel <at> uwaterloo.ca> wrote:
>>
>> Similarly, I've always used:
>>
>> import qualified Data.HashSet as S
>>
>> nub :: Hashable a => [a] -> [a]
>> nub = S.toList . S.fromList
>>
>> And i can't think of any type which i can't write a Hashable instance, so
>> this is extremely practical.
>
> This won't yield results lazily (e.g. nub (repeat 'x') = _|_ instead of 'x'
> : _|_), but Niklas' ordNub will.  His ordNub can be translated directly to
> HashSet and still have the stability and laziness properties.
>
> A difficulty with putting ordNub in Data.List is that it depends on
> containers, which is outside of the base package.  Some options:
>
>  * Move the implementation of Set to base.
>
>  * Implement a lean version of Set in base that only provides 'insert' and
> 'member'.
>
>  * Define ordNub in Data.Set instead.
>
> Adding a Hashable-based nub to base would be even more problematic, since
> you'd need Hashable in base.

Right, I suggest the following community course of action:

1a) add ordNub to Data.Set
1b) add ordNub to Data.Hashable
(1 day)

2) make a libraries <at>  proposal to include a stripped-down Data.Set-like
balanced binary tree implementation to base.
(2 weeks)

3) bikeshed about the name, eg.:
  * is "nub" really intuitive? how about "uniq", like in
perl/ruby/underscore.js?
  * but "uniq" in unix only removes _adjacent_ duplicates, confusing!
  * how about "distinct"? "sole"? "unique"? "azygous"?
(7 weeks)

4) Failing consensus on technical grounds (that the stripped-down
Data.Set implementation is overkill for one library function), agree
that anyone who really cares should just use the version from
containers or hashable. Only newbs and textbook authors actually use
base anyway, and it's impossible to change the language definition.
Prelude will continue to fulfil its role of avoiding success at all
costs, quadratic or otherwise.

(Please, let's have both 1a and 1b :)

Conrad.
Roman Cheplyaka | 14 Jul 15:12 2013

Re: ordNub

Something like that should definitely be included in Data.List.
Thanks for working on it.

Roman

* Niklas Hambüchen <mail <at> nh2.me> [2013-07-14 19:20:52+0800]
> tldr: nub is abnormally slow, we shouldn't use it, but we do.
> 
> 
> As you might know, Data.List.nub is O(n²). (*)
> 
> As you might not know, almost *all* practical Haskell projects use it,
> and that in places where an Ord instance is given, e.g. happy, Xmonad,
> ghc-mod, Agda, darcs, QuickCheck, yesod, shake, Cabal, haddock, and 600
> more (see https://github.com/nh2/haskell-ordnub).
> 
> I've taken the Ord-based O(n * log n) implementation from yi using a Set:
> 
>   ordNub :: (Ord a) => [a] -> [a]
>   ordNub l = go empty l
>     where
>       go _ []     = []
>       go s (x:xs) = if x `member` s then go s xs
>                                     else x : go (insert x s) xs
> 
> 
> and put benchmarks on
> http://htmlpreview.github.io/?https://github.com/nh2/haskell-ordnub/blob/1f0a2c94a/report.html
> (compare `nub` vs `ordNub`).
> 
> `ordNub` is not only in a different complexity class, but even seems to
> perform better than nub for very small numbers of actually different
> list elements (that's the numbers before the benchmark names).
> 
> (The benchmark also shows some other potential problem: Using a state
> monad to keep the set instead of a function argument can be up to 20
> times slower. Should that happen?)
> 
> What do you think about ordNub?
> 
> I've seen a proposal from 5 years ago about adding a *sort*Nub function
> started by Neil, but it just died.
> 
> 
> (*) The mentioned complexity is for the (very common) worst case, in
> which the number of different elements in the list grows with the list
> (alias you don't have an N element list with always only 5 different
> things inside).
> 
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe <at> haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
Niklas Hambüchen | 12 Oct 20:47 2013

Re: ordNub

I would like to come back to the original question:

How can ordNub be added to base?

I guess we agree that Data.List is the right module for a function of
type Ord a => [a] -> [a], but this introduces

* a cyclic dependency between Data.List and Data.Set
* a base dependency on containers.

What is the right way to go with that?

Should ordNub be introduced as part of Data.Set, as Conrad suggested?

It does not really have anything to do with Set, apart from being
implemented with it.

On 14/07/13 14:12, Roman Cheplyaka wrote:
> Something like that should definitely be included in Data.List.
> Thanks for working on it.
> 
> Roman
Anthony Cowley | 12 Oct 21:43 2013

Re: ordNub


On Oct 12, 2013, at 2:47 PM, Niklas Hambüchen <mail <at> nh2.me> wrote:
> 
> I would like to come back to the original question:
> 
> How can ordNub be added to base?
> 
> I guess we agree that Data.List is the right module for a function of
> type Ord a => [a] -> [a], but this introduces
> 
> * a cyclic dependency between Data.List and Data.Set
> * a base dependency on containers.
> 
> What is the right way to go with that?
> 
> Should ordNub be introduced as part of Data.Set, as Conrad suggested?
> 
> It does not really have anything to do with Set, apart from being
> implemented with it.

I think nub's behavior is rather set-related, so I don't really understand the objection to putting it in Data.Set.

Anthony

> 
>> On 14/07/13 14:12, Roman Cheplyaka wrote:
>> Something like that should definitely be included in Data.List.
>> Thanks for working on it.
>> 
>> Roman
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe <at> haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Niklas Hambüchen | 12 Oct 23:12 2013

Re: ordNub

On 12/10/13 20:43, Anthony Cowley wrote:
> I think nub's behavior is rather set-related, so I don't really understand the objection to putting it in Data.Set.

In sets, the order does not matter, while for nub it does.

    nub    :: Eq a  => [a] -> [a]
    ordNub :: Ord a => [a] -> [a]

both do not mention Set, and the fact that Set is used inside my
proposed ordNub implementation is a detail not visible to the caller.

That's why it looks like a Data.List function to me.
AntC | 13 Oct 22:42 2013
Picon

Re: ordNub

> Niklas Hambüchen <mail <at> nh2.me> writes:
> 
> In sets, the order does not matter, while for nub it does.
> 

Let's be careful here!. Niklas, when you say "order", do you mean:
* the _ordering_ from the Ord instance? Or
* the relative sequence of elements in the list?

> ... the fact that Set is used inside my proposed 
> ordNub implementation is a detail not visible to the caller.

If you use the Set library, that fact may be very visible!
Because Set re-sequences the whole list, as per its Ord instance.

But List.nub preserves the list sequence (except for omitting duplicates).

Furthermore, the Ord instance might compare two elements as EQ, even 
though their Eq instance says they're not equal.

So a Set-based ordNub could end up returning:
* not the same elements as List.nub
* and/or not in the same list sequence

I'd call that very much *visible* to the caller.

> 
> That's why it looks like a Data.List function to me.
> 

[BTW I am still less than convinced that overall a Set-based ordNub is 
significantly more efficient. I suspect it depends on how big is your 
list.]

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Niklas Hambüchen | 14 Oct 00:25 2013

Re: ordNub

On 13/10/13 21:42, AntC wrote:
>> Niklas Hambüchen <mail <at> nh2.me> writes:
>>
>> In sets, the order does not matter, while for nub it does.
>>
> 
> Let's be careful here!. Niklas, when you say "order", do you mean:
> * the _ordering_ from the Ord instance? Or
> * the relative sequence of elements in the list?
> 
>> ... the fact that Set is used inside my proposed 
>> ordNub implementation is a detail not visible to the caller.
> 
> If you use the Set library, that fact may be very visible!
> Because Set re-sequences the whole list, as per its Ord instance.
> 
> But List.nub preserves the list sequence (except for omitting duplicates).

I mean *exactly* what you say here.

ordNub behaves has the same behaviour as nub, while (Set.toList .
Set.fromList) doesn't.

> [BTW I am still less than convinced that overall a Set-based ordNub is 
> significantly more efficient. I suspect it depends on how big is your 
> list.]

What do you mean?

ordNub is clearly in a different complexity class, and the benchmarks
that I provided show not only this, but also that ordNub is *always*
faster than nub, even for singleton lists.
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
AntC | 14 Oct 04:20 2013
Picon

Re: ordNub

> Niklas Hambüchen <mail <at> nh2.me> writes:
> 
> > On 13/10/13 21:42, AntC wrote:
> > ...
> > If you use the Set library, that fact may be very visible!
> > Because Set re-sequences the whole list, as per its Ord instance.
> > 
> > But List.nub preserves the list sequence
> > (except for omitting duplicates).
> 
> I mean *exactly* what you say here.
> 
> ordNub behaves has the same behaviour as nub, while (Set.toList .
> Set.fromList) doesn't.
> 

That's great, thank you.

> > [BTW I am still less than convinced that overall a Set-based ordNub is 
> > significantly more efficient. I suspect it depends on how big is your 
> > list.]
> 
> What do you mean?
> 
> ordNub is clearly in a different complexity class, ...

Yes, I'm not disputing that.

> ... and the benchmarks that I provided show not only this,
> but also that ordNub is *always* faster than nub,
> even for singleton lists.

Thanks Niklas, I hadn't spotted those benchmarks back in July.

I'm surprised at that result for singletons 
(and for very small numbers of elements which are in fact each different).

Especially because List's `nub` uses `filter` ==> fold, which should be 
tail-recursive.

It seems to me that for small numbers, the Set-based approach still 
requires comparing each element to each other. Plus there's the overhead 
for building the Set and inserting each element into it -- where `insert` 
again walks the Set to find the insertion point.

Then here's a further possible optimisation, instead of making separate 
calls to `member` and `insert`:
* Make a single call to
    insert' :: (Ord a) => a -> Set a -> (Bool, Set a)
* The Bool returns True if already a member.
* Else returns an updated Set in the snd, with the element inserted.

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Niklas Hambüchen | 14 Oct 04:35 2013

Re: ordNub

On 14/10/13 03:20, AntC wrote:
> Thanks Niklas, I hadn't spotted those benchmarks back in July.

No worries :)

> I'm surprised at that result for singletons 
> (and for very small numbers of elements which are in fact each different).

I think one of the main reasons for the performance difference is that a
list node and a Set binary tree node have pretty much the same
performance, with the difference that in

http://hackage.haskell.org/package/containers-0.5.2.1/docs/src/Data-Set-Base.html

   data Set a = Bin {-# UNPACK #-} !Size !a !(Set a) !(Set a) | Tip

there are strictness and unpack annotations, while for

   data [a] = [] | a : [a] -- pseudo syntax

there are not.

Good for us in this case, I guess.

> It seems to me that for small numbers, the Set-based approach still 
> requires comparing each element to each other.

This I don't understand.

> Then here's a further possible optimisation, instead of making separate 
> calls to `member` and `insert`:

This I understand again. Where do you get insert' from? containers
doesn't seem to have it. Do you suggest adding it?

Niklas
AntC | 14 Oct 07:03 2013
Picon

Re: ordNub

> Niklas Hambüchen <mail <at> nh2.me> writes:
> 
> > On 14/10/13 03:20, AntC wrote:
> > ... 
> > Then here's a further possible optimisation, instead of making 
> > separate calls to `member` and `insert`:
> 
> This I understand again. Where do you get insert' from? containers
> doesn't seem to have it. Do you suggest adding it?
> 

err, well I didn't have any specific library in mind.

More there's a kind of 'folk idiom' for managing data structures,
(this applies more for imperative code/update-in-situ than functional)
that if you know the next thing you're going to do after failing to find 
an element is insert it, you might as well get on with the insert there 
and then.

(It's a higher-level analogue of a machine instruction decrement-and-
branch-if-zero.)

I'm looking at all the remarks about managing libraries and dependencies.
Would it make sense to build a stand-alone version of Set purely to 
support ordNub? Then it needs only methods `empty` and `insertIfAbsent`.

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Yitzchak Gale | 16 Oct 18:46 2013

Re: ordNub

On Sun, Oct 13, 2013 at 6:25 PM, Niklas Hambüchen <mail <at> nh2.me> wrote:
> ordNub behaves has the same behaviour as nub, while (Set.toList .
> Set.fromList) doesn't.

Perhaps you mean that they produce exactly the same
output when fully evaluated. But I doubt that their behavior
is *exactly* the same in every aspect. Given the differences
in strictness between sets and lists, can you prove that
nub and nubOrd have *exactly* the same strictness properties
in every possible case?

> ...ordNub is *always* faster than nub, even for singleton lists.

This is also a claim that I doubt. You say that you tested the
case of singletons. How about this:

2 : replicate 100000 1 ++ [3]

Well, you could optimize for that by having nubOrd squeeze
runs of of equal elements in the input. But then how about
something like this:

[3,2,1] ++ take 100000 (cycle [3,2,1]) ++ [4]

Or variations on that, cycling some other fairly small number of
elements. Set containers must do extra comparisons, whereas
the of cost running down the first few hundred elements of a linked
list (as long as the compiler fuses the iteration down to a tight
loop in the output code and you remain within the cache) is near zero.
How could nubOrd be as fast as ord in these kinds of cases?

I'm not saying that it wouldn't be worthwhile to add a standard
well-optimized implementation of nubOrd somewhere.
But nub is a very useful function. The only advantage of nubOrd
is for very large lists where the complexity advantage overtakes
the excellent constants of nub to provide better speed. In
practice, the cases where that's worthwhile are unusual.
I rarely bother with nubOrd over nub.

-Yitz
Kim-Ee Yeoh | 16 Oct 19:20 2013

Re: ordNub


On Wed, Oct 16, 2013 at 11:46 PM, Yitzchak Gale <gale <at> sefer.org> wrote:
I'm not saying that it wouldn't be worthwhile to add a standard
well-optimized implementation of nubOrd somewhere.
But nub is a very useful function.

+1

In many cases, an elementary operational semantics beats something more complicated, even if faster.

Please don't kill nub.

-- Kim-Ee
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Niklas Hambüchen | 16 Oct 19:32 2013

Re: ordNub

On 16/10/13 17:46, Yitzchak Gale wrote:
> On Sun, Oct 13, 2013 at 6:25 PM, Niklas Hambüchen <mail <at> nh2.me> wrote:
>> ordNub behaves has the same behaviour as nub, while (Set.toList .
>> Set.fromList) doesn't.
> 
> Perhaps you mean that they produce exactly the same
> output when fully evaluated. But I doubt that their behavior
> is *exactly* the same in every aspect. Given the differences
> in strictness between sets and lists, can you prove that
> nub and nubOrd have *exactly* the same strictness properties
> in every possible case?

Yes. The elements are are evaluated in the same order as in `nub`.
We can see that by comparing the implementations.

If you have objects that need not be fully evaluated to use (==) on
them, and need not fully evaluated *in a different way* when using
`compare` on them, then of course the strictness properties are different.
That is why one function has "Eq a =>" and the other one "Ord a =>" at
the front.

>> ...ordNub is *always* faster than nub, even for singleton lists.
> 
> This is also a claim that I doubt. You say that you tested the
> case of singletons. How about this:
> 
> 2 : replicate 100000 1 ++ [3]
> 
> Well, you could optimize for that by having nubOrd squeeze
> runs of of equal elements in the input. But then how about
> something like this:
> 
> [3,2,1] ++ take 100000 (cycle [3,2,1]) ++ [4]
> 
> Or variations on that, cycling some other fairly small number of
> elements. Set containers must do extra comparisons, whereas
> the of cost running down the first few hundred elements of a linked
> list (as long as the compiler fuses the iteration down to a tight
> loop in the output code and you remain within the cache) is near zero.
> How could nubOrd be as fast as ord in these kinds of cases?

I make the following, really cool proposal:

* You add your test cases that you just mentioned to the benchmark list
in "ordnub.hs"
* run `cabal build`
* run `dist/build/ordnub/ordnub -o report.html`.

Then you will see that ordNub is, indeed, faster.

> (as long as the compiler fuses the iteration down to a tight
> loop in the output code and you remain within the cache)

Even though a list is conceptually simpler than a set, a list node and a
set node are not that different (as I explained in the last email).
Both have a content and a pointer to the next node. The only difference
is that the Set node is optimised with strictness and unboxing, and the
list one is not.

> The only advantage of nubOrd
> is for very large lists where the complexity advantage overtakes
> the excellent constants of nub to provide better speed.

Sorry, I think this is absolutely invented.

What "excellent constants" do you mean?

How large are "very large lists"? I find lists of length 1000 a very
common case. An `n log2 n` algorithm is already 100 times faster on
those than an n² one.

I do not dispute that there exist n² algorithms that are faster than
others with better complexity for small inputs.

nub is not one of them.
Yitzchak Gale | 17 Oct 18:56 2013

Re: ordNub

Niklas Hambüchen wrote:
> I make the following, really cool proposal:>
> * You add your test cases that you just mentioned to the benchmark list
> in "ordnub.hs"...

No thanks. Those were not meant to be test cases to add
to a benchmark suite - they were meant to get you thinking a
little more deeply about execution models of Haskell in
general, and the GHC execution model in particular, given
your claim that nubOrd is better than ord is *every* case,
in *every* sense of "better".

But that is a side issue. Everyone agrees with you
that it is useful to add a good implementation of nubOrd to
the standard libraries in some convenient and logical
place. Let's get back to focusing on that.

Thanks,
Yitz
Niklas Hambüchen | 17 Oct 20:52 2013

Re: ordNub

On 17/10/13 17:56, Yitzchak Gale wrote:
> Let's get back to focusing on that.

Agreed.

My main question is still:

Can the Data.List <-> Data.Set issue be resolved?

Is it OK to introduce a cyclic dependency between the two, making use of
a .boot.hs file?
Andreas Abel | 18 Oct 15:03 2013
Picon

Re: ordNub

On 17.10.2013 20:52, Niklas Hambüchen wrote:
> Can the Data.List <-> Data.Set issue be resolved?
>
> Is it OK to introduce a cyclic dependency between the two, making use of
> a .boot.hs file?

As long as the users of Data.List and Data.Set do not have to fiddle 
with {-# SOURCE #-} imports, I do not see an issue, so please go ahead.

Cheers,
Andreas

P.S.: The Agda source has tons of .hs-boot files, I hate them, but what 
can you do?

--

-- 
Andreas Abel  <><      Du bist der geliebte Mensch.

Theoretical Computer Science, University of Munich
Oettingenstr. 67, D-80538 Munich, GERMANY

andreas.abel <at> ifi.lmu.de
http://www2.tcs.ifi.lmu.de/~abel/
Stijn van Drongelen | 18 Oct 15:20 2013
Picon

Re: ordNub

On Fri, Oct 18, 2013 at 3:03 PM, Andreas Abel <andreas.abel <at> ifi.lmu.de> wrote:

P.S.: The Agda source has tons of .hs-boot files, I hate them, but what can you do?


Implement support for mutually recursive modules? ;)
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Dan Doel | 18 Oct 15:49 2013
Picon

Re: ordNub

On Thu, Oct 17, 2013 at 2:52 PM, Niklas Hambüchen <mail <at> nh2.me> wrote:
> Can the Data.List <-> Data.Set issue be resolved?
>
> Is it OK to introduce a cyclic dependency between the two, making use of
> a .boot.hs file?

Data.List and Data.Set aren't even members of the same package. So
hs-boot files won't help.

-- Dan
Niklas Hambüchen | 18 Oct 16:19 2013

Re: ordNub

> Data.List and Data.Set aren't even members of the same package. So
> hs-boot files won't help.

Damn, you are right. I even mentioned this myself earlier.

It would need a set in base. I wonder if it was ok to use the Data.Set 
implementation internally in base, without exposing it.
Yitzchak Gale | 18 Oct 20:47 2013

Re: ordNub

Niklas Hambüchen wrote:
> It would need a set in base. I wonder if it was ok to use the Data.Set
> implementation internally in base, without exposing it.

Ah yes, this is what stopped previous attempts at getting
ordNub in.

The GHC team is very keen on getting more things *out* of base.
Unless something has changed, getting them to include new things
in base, even if hidden, would be extremely difficult.
Adding additional dependencies and maintenance burdens would
add extra work for the overburdened GHC team and slow the
GHC release cycle.

How about adding a module named something like Data.List.Nub
to containers and putting ordNub in there? Or just go back to the
idea of putting it in Data.Set.

-Yitz
David Thomas | 18 Oct 21:13 2013
Picon

Re: ordNub

With such a big speedup and not a lot of downside, I'd really rather it wind up central enough to be used in preference to nub wherever possible (including in base!).  Perhaps they can duplicate the minimum functionality necessary from Data.Set, internally behind the scenes?  Alternatively, maybe we can move nub out of base?


On Fri, Oct 18, 2013 at 11:47 AM, Yitzchak Gale <gale <at> sefer.org> wrote:
Niklas Hambüchen wrote:
> It would need a set in base. I wonder if it was ok to use the Data.Set
> implementation internally in base, without exposing it.

Ah yes, this is what stopped previous attempts at getting
ordNub in.

The GHC team is very keen on getting more things *out* of base.
Unless something has changed, getting them to include new things
in base, even if hidden, would be extremely difficult.
Adding additional dependencies and maintenance burdens would
add extra work for the overburdened GHC team and slow the
GHC release cycle.

How about adding a module named something like Data.List.Nub
to containers and putting ordNub in there? Or just go back to the
idea of putting it in Data.Set.

-Yitz
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Yitzchak Gale | 18 Oct 22:11 2013

Re: ordNub

David Thomas wrote:
> With such a big speedup and not a lot of downside

That is far from proven, and frankly, almost certainly untrue.

But again, it isn't the point here. There *are* case where you
do need nubOrd, and those cases are common enough for
it to make sense to get it into the libraries.

> I'd really rather it wind up central enough to be used in
> preference to nub wherever possible

You are welcome to your own preference. I, and many
other experienced Haskellers I know, prefer nub in many cases.
Let's just make both easily available please.

> (including in base!).

I wouldn't mind having ordNub in base. But let's at least
get it in somewhere convenient and easily usable. If you
focus the energy on trying to get it into base, this proposal
will likely die the same death as all previous attempts.

> Perhaps they can duplicate the minimum functionality
> necessary from Data.Set, internally behind the scenes?

Why should the GHC team agree to the maintenance burden
of that? They have enough of a challenge keeping up with the
work they already have.

But you know what, give it a try. Just don't get stuck
on that point if you meet with resistance.

> Alternatively, maybe we can move nub out of base?

No. nub is part of the Haskell standard, so it's required to
be there. Even if you disagree with my view that it should
be there on its own merits.

-Yitz
Dan Doel | 18 Oct 22:16 2013
Picon

Re: ordNub

On Fri, Oct 18, 2013 at 3:13 PM, David Thomas <davidleothomas <at> gmail.com> wrote:
> With such a big speedup and not a lot of downside, I'd really rather it wind
> up central enough to be used in preference to nub wherever possible
> (including in base!).  Perhaps they can duplicate the minimum functionality
> necessary from Data.Set, internally behind the scenes?  Alternatively, maybe
> we can move nub out of base?

The string 'nub' does not appear anywhere (that I can find) in base
outside of Data.List. All occurrences of 'nub' in Data.List are in the
definition and export of nub and nubBy, except one use of nubBy in
unionBy, which cannot switch to ordNub, and would have to have its own
alternate function.

Of all the libraries packaged with GHC, the only one I see that uses
nub outside of tests is Cabal, which already depends on containers.
And in fact, one of the libraries I grepped through _was_ containers,
because it's already about as close as you can get to being in base
without actually being there, and is certainly as close as many other
things will be once we split up base into more packages.

I don't think it's really accurate to suggest that containers is less
core than base. It's not just, 'some library.'

-- Dan
Roman Cheplyaka | 12 Oct 23:18 2013

Re: ordNub

* Anthony Cowley <acowley <at> seas.upenn.edu> [2013-10-12 15:43:57-0400]
> 
> On Oct 12, 2013, at 2:47 PM, Niklas Hambüchen <mail <at> nh2.me> wrote:
> > 
> > I would like to come back to the original question:
> > 
> > How can ordNub be added to base?
> > 
> > I guess we agree that Data.List is the right module for a function of
> > type Ord a => [a] -> [a], but this introduces
> > 
> > * a cyclic dependency between Data.List and Data.Set
> > * a base dependency on containers.
> > 
> > What is the right way to go with that?
> > 
> > Should ordNub be introduced as part of Data.Set, as Conrad suggested?
> > 
> > It does not really have anything to do with Set, apart from being
> > implemented with it.
> 
> I think nub's behavior is rather set-related, so I don't really
> understand the objection to putting it in Data.Set.

It's not Set (in the data structure sense) related. It's list-related,
because it clearly acts on lists.

Therefore, it belongs to Data.List.

Besides, we already have the precedent of the slow nub being in
Data.List.

Roman
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe <at> haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Thomas DuBuisson | 15 Jul 05:16 2013
Picon

Re: ordNub

Just so people are aware - five years ago the notion of nubOrd and
nubWith was discussed and a consensus reached on including nubOrd.  I
think Bart got too busy, didn't submit a final patch, and no one with
commit access actually commited any code.

http://haskell.1045720.n5.nabble.com/GHC-2717-Add-nubWith-nubOrd-td3159919.html

I fully support an efficient nub implementation making its way into
base - it's far past time.  Using Set seems sensible.

Cheers,
Thomas

On Sun, Jul 14, 2013 at 4:20 AM, Niklas Hambüchen <mail <at> nh2.me> wrote:
> tldr: nub is abnormally slow, we shouldn't use it, but we do.
>
>
> As you might know, Data.List.nub is O(n²). (*)
>
> As you might not know, almost *all* practical Haskell projects use it,
> and that in places where an Ord instance is given, e.g. happy, Xmonad,
> ghc-mod, Agda, darcs, QuickCheck, yesod, shake, Cabal, haddock, and 600
> more (see https://github.com/nh2/haskell-ordnub).
>
> I've taken the Ord-based O(n * log n) implementation from yi using a Set:
>
>   ordNub :: (Ord a) => [a] -> [a]
>   ordNub l = go empty l
>     where
>       go _ []     = []
>       go s (x:xs) = if x `member` s then go s xs
>                                     else x : go (insert x s) xs
>
>
> and put benchmarks on
> http://htmlpreview.github.io/?https://github.com/nh2/haskell-ordnub/blob/1f0a2c94a/report.html
> (compare `nub` vs `ordNub`).
>
> `ordNub` is not only in a different complexity class, but even seems to
> perform better than nub for very small numbers of actually different
> list elements (that's the numbers before the benchmark names).
>
> (The benchmark also shows some other potential problem: Using a state
> monad to keep the set instead of a function argument can be up to 20
> times slower. Should that happen?)
>
> What do you think about ordNub?
>
> I've seen a proposal from 5 years ago about adding a *sort*Nub function
> started by Neil, but it just died.
>
>
> (*) The mentioned complexity is for the (very common) worst case, in
> which the number of different elements in the list grows with the list
> (alias you don't have an N element list with always only 5 different
> things inside).
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe <at> haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
Jason Dagit | 15 Jul 07:26 2013
Picon

Re: ordNub

On Sun, Jul 14, 2013 at 4:20 AM, Niklas Hambüchen <mail <at> nh2.me> wrote:
> tldr: nub is abnormally slow, we shouldn't use it, but we do.
>
>
> As you might know, Data.List.nub is O(n²). (*)
>
> As you might not know, almost *all* practical Haskell projects use it,
> and that in places where an Ord instance is given, e.g. happy, Xmonad,
> ghc-mod, Agda, darcs, QuickCheck, yesod, shake, Cabal, haddock, and 600
> more (see https://github.com/nh2/haskell-ordnub).
>
> I've taken the Ord-based O(n * log n) implementation from yi using a Set:
>
>   ordNub :: (Ord a) => [a] -> [a]
>   ordNub l = go empty l
>     where
>       go _ []     = []
>       go s (x:xs) = if x `member` s then go s xs
>                                     else x : go (insert x s) xs
>
>
> and put benchmarks on
> http://htmlpreview.github.io/?https://github.com/nh2/haskell-ordnub/blob/1f0a2c94a/report.html
> (compare `nub` vs `ordNub`).

Richard Bird has a book, "Pearls of Functional Algorithm Design" that
is meant to teach a form of deriving algorithms from the properties we
ask of them. In this book, he gives a possible derivation of ordNub,
simply called nub in the book, following the methodology he is
teaching. He notes in the text that this derivation feels more
complicated than it ought.

Here is his version: http://lpaste.net/87625

I just sent you a pull request to add that one and S.toList .
S.fromList that was suggested in this thread. I don't think those two
implementations are faster than the others but it's nice to have them
for completeness.

Jason
Niklas Hambüchen | 15 Jul 08:43 2013

Re: ordNub

Hey Jason,

would you mind giving a short idea of what the point of Bird's
implementation is / from what properties it is derived?

Also, running the QuickCheck tests you added, it doesn't give the same
output (order) as nub.

On 15/07/13 13:26, Jason Dagit wrote:
> Richard Bird has a book, "Pearls of Functional Algorithm Design" that
> is meant to teach a form of deriving algorithms from the properties we
> ask of them. In this book, he gives a possible derivation of ordNub,
> simply called nub in the book, following the methodology he is
> teaching. He notes in the text that this derivation feels more
> complicated than it ought.
> 
> Here is his version: http://lpaste.net/87625
> 
> I just sent you a pull request to add that one and S.toList .
> S.fromList that was suggested in this thread. I don't think those two
> implementations are faster than the others but it's nice to have them
> for completeness.
> 
> Jason
> 
Andreas Abel | 17 Jul 00:24 2013
Picon

Re: ordNub

On 14.07.2013 13:20, Niklas Hambüchen wrote:
> I've taken the Ord-based O(n * log n) implementation from yi using a Set:
>
>    ordNub :: (Ord a) => [a] -> [a]
>    ordNub l = go empty l
>      where
>        go _ []     = []
>        go s (x:xs) = if x `member` s then go s xs
>                                      else x : go (insert x s) xs
>
> (The benchmark also shows some other potential problem: Using a state
> monad to keep the set instead of a function argument can be up to 20
> times slower. Should that happen?)

I cannot say whether this should happen, but your code about can be 
straight-forwardly refactored using a *Reader* monad.

import Control.Monad.Reader

import Data.Functor ((<$>))
import qualified Data.Set as Set

-- ifM still not in Control.Monad
ifM mc md me = do { c <- mc; if c then md else me }

ordNub :: (Ord a) => [a] -> [a]
ordNub l = runReader (go l) Set.empty
   where
     go []     = return []
     go (x:xs) = ifM ((x `Set.member`) <$> ask)
                     (go xs)
                     ((x :) <$> local (Set.insert x) (go xs))

test = ordNub [1,2,4,1,3,5,2,3,4,5,6,1]

Of course, this does not lend itself to an application of filterM.

In fact, your implementation is already in the (Set a ->) reader monad, 
in normalized form.  It looks already optimal to me.

Cheers,
Andreas

--

-- 
Andreas Abel  <><      Du bist der geliebte Mensch.

Theoretical Computer Science, University of Munich
Oettingenstr. 67, D-80538 Munich, GERMANY

andreas.abel <at> ifi.lmu.de
http://www2.tcs.ifi.lmu.de/~abel/
Niklas Hambüchen | 26 Aug 13:59 2013

Re: ordNub

On 14/07/13 20:20, Niklas Hambüchen wrote:
> As you might not know, almost *all* practical Haskell projects use it,
> and that in places where an Ord instance is given, e.g. happy, Xmonad,
> ghc-mod, Agda, darcs, QuickCheck, yesod, shake, Cabal, haddock, and 600
> more (see https://github.com/nh2/haskell-ordnub).

GHC uses nub.

Also let me stress again that the n² case happens even if there are no
duplicates.

Gmane