18 Mar 2013 21:21

## Improving my implementation of "The Lady Tasting Tea"

Hello,

For my first project in Haskell, I thought I would re-implement a statistical problem that I have previously done in R. In his 1925 book, Fisher tells of an experiment in which a lady claims she can tell the difference in cups of tea that have the milk or the tea/water added first (I'm a coffee drinker, so I'm unclear on the actual physical interpretation). To test this claim, Fisher proposes an experiment in which 8 cups are created, with 4 having milk first and 4 having tea first. If the lady can properly label 3 of the 4 milk first cups correctly, should we be suprised? What is the probability that she would get 3 or 4 of the milk first cups correct just due to chance (that is, independently of the actual allocation of milk and tea first)?

Fisher shows that these questions can be answered by writing out all the possible ways the 4 milk cups could be allocated (there are 70 such possibilities) and then counting the subset that have 3 or 4 correct according to the lady's guess (there are 17). Therefore the probability of getting 3 or 4 correct just by guessing would occur about 24% of the time.

I've implemented this algorithm for problems of arbitrary size, and I'd like some feedback on improving the efficiency of my code:

https://gist.github.com/markmfredrickson/5190212

The problem is basically a map-reduce:

1. Create the C(N,K) possible ways the cups could be set up (where N is the total number of cups, and K is the number with milk first)
2. Map a function across them that scores the lady's guess
3. Reduce the set of scores to a p-value (a value between 0 and 1) that indicates what percentage of all possible allocations of milk first would have a score equal to or greater than the lady's guess.

My first attempt at profiling indicates that most time is spent in the scoring function, unsurprising as this is in the inner most loop of the algorithm. The generation of all possible combinations is also time consuming. From what I can tell, this implementation is not very efficient from a space perspective.

In the long term, I'm interested in using this same basic procedure to solve more interesting statistical problems, so I'm particulary interested in getting feed back on:

1. A good data type for storing the list of treatment allocations (i.e. milk first cups).

The current problem calls for binary treatment with no extra data. In the future, my N units may be partitioned into K groups (here K=2), and the score function will require additional data (e.g. each unit will have an outcome measure on a continuous or ordinal scale). In a sense, I'd like to use tuples to hold index values from 1 to N, but I wouldn't necessarily know the tuple size at compile time. The index values would point to an array or similar structure holding the additional data.

Some other properties of allocations:
- They are always the same size
- Order is not important
- All values are unique

Any standard types fit this description?

2. A better algorithm and perhaps data type for the combinations.

The current implementation uses a tree recursion to get the set of all possible combinations. In the future, this set will be too large to enumerate entirely, so I'll need to sample from it. It would be desirable to be able to write `take samples (permute \$ combinations total treated)` where the `permute` function (perhaps with a random seed) properly re-orders the combinations. If these goals are supported by a data structure better than a list, I'm very open to using other types to hold the combinations.

I would also appreciate any suggestions on making the code more idiomatic or readable.

Thanks,
-Mark

```_______________________________________________