18 Apr 2013 20:49

## Map Reduce spec in Haskell

I'm learning about the Map Reduce computation frequently used with big data.  For the fun of it, I decided to write a very high-level spec of Map Reduce.  Here is what I came up with.  Enjoy.

John

> module MapReduce where
> import Data.List (nub)

A high-level specification of Map Reduce as a Haskell program.  The
program uses lists to represent multisets.  As multisets have no
implied ordering, the ordering implied by lists in this specification
should be ignored.

The database is a multiset of key-value pairs.

> type Key = String
> type Value = String
> type Datum = (Key, Value)
> type Data = [Datum]

A mapper maps a datum to a finite multiset of key-value pairs.

> type Mapper = Datum -> Data

A reducer takes a key and a multiset of values and produces a finite
multiset of values.

> type Reducer = (Key, [Value]) -> [Value]

A step is a mapper followed by a reducer

> type Step = (Mapper, Reducer)

A program is a finite sequence of steps

> type Program = [Step]

The semantics of a program is provided by the run function.

> run :: Program -> Data -> Data
> run [] d = d
> run (s : p) d =
>   run p (step s d)

The three parts of a step are mapping, shuffling, and reducing.

> step :: Step -> Data -> Data
> step (m, r) d =
>   let mapped = transform m d
>       shuffled = shuffle mapped in
>   reduce r shuffled

The first part of a step is to transform the data by applying the
mapper to each datum and collecting the results.

> transform :: Mapper -> Data -> Data
> transform m d =
>   [p | u <- d, p <- m u]

Next, values with common keys are collected.  Keys are unique after
shuffling.

> shuffle :: Data -> [(Key, [Value])]
> shuffle d =
>   [(k, vs) | k <- nub (map fst d), -- nub eliminates duplicate keys
>              let vs = [v | (k', v) <- d, k' == k]]

A reducer is applied to the data associated with one key, and always
produces data with that key.

> reduce :: Reducer -> [(Key, [Value])] -> Data
> reduce r rs =
>   [(k, v) | (k, vs) <- rs, v <- r (k, vs)]

```_______________________________________________
```
18 Apr 2013 21:41

### Re: Map Reduce spec in Haskell

```This looks right, but there is definitely a lot more to mapreduce
implementations than algebraic signatures!

It might also be considered that there are lots of people using
MapReduce technology on things other than "bare metal"
Hadoop/MapReduce, etc..  Lots of data analysts, ML people, etc.., use
Apache Pig, Hive, HBase, etc...

One thing I've really been interested in seeing is a (presumably
with dataflow-like programming.  Having a (co?)monad where the values
represent "types" of terms in the Pig language, that produces and runs
Pig programs seems like a fun idea: I'd definitely try it out if
anyone came up with it.

But of course, don't forget about Cloud Haskell and the other related efforts!

Kris

On Thu, Apr 18, 2013 at 2:49 PM, John D. Ramsdell <ramsdell0 <at> gmail.com> wrote:
> I'm learning about the Map Reduce computation frequently used with big data.
> For the fun of it, I decided to write a very high-level spec of Map Reduce.
> Here is what I came up with.  Enjoy.
>
> John
>
>> module MapReduce where
>> import Data.List (nub)
>
> A high-level specification of Map Reduce as a Haskell program.  The
> program uses lists to represent multisets.  As multisets have no
> implied ordering, the ordering implied by lists in this specification
> should be ignored.
>
> The database is a multiset of key-value pairs.
>
>> type Key = String
>> type Value = String
>> type Datum = (Key, Value)
>> type Data = [Datum]
>
> A mapper maps a datum to a finite multiset of key-value pairs.
>
>> type Mapper = Datum -> Data
>
> A reducer takes a key and a multiset of values and produces a finite
> multiset of values.
>
>> type Reducer = (Key, [Value]) -> [Value]
>
> A step is a mapper followed by a reducer
>
>> type Step = (Mapper, Reducer)
>
> A program is a finite sequence of steps
>
>> type Program = [Step]
>
> The semantics of a program is provided by the run function.
>
>> run :: Program -> Data -> Data
>> run [] d = d
>> run (s : p) d =
>>   run p (step s d)
>
> The three parts of a step are mapping, shuffling, and reducing.
>
>> step :: Step -> Data -> Data
>> step (m, r) d =
>>   let mapped = transform m d
>>       shuffled = shuffle mapped in
>>   reduce r shuffled
>
> The first part of a step is to transform the data by applying the
> mapper to each datum and collecting the results.
>
>> transform :: Mapper -> Data -> Data
>> transform m d =
>>   [p | u <- d, p <- m u]
>
> Next, values with common keys are collected.  Keys are unique after
> shuffling.
>
>> shuffle :: Data -> [(Key, [Value])]
>> shuffle d =
>>   [(k, vs) | k <- nub (map fst d), -- nub eliminates duplicate keys
>>              let vs = [v | (k', v) <- d, k' == k]]
>
> A reducer is applied to the data associated with one key, and always
> produces data with that key.
>
>> reduce :: Reducer -> [(Key, [Value])] -> Data
>> reduce r rs =
>>   [(k, v) | (k, vs) <- rs, v <- r (k, vs)]
>
>
> _______________________________________________
```There was a paper about this. I think I found it.