Skip to content

h30s/GoMapReduce

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GoMapReduce

I wanted to see how the original Google MapReduce paper actually works under the hood, so I built a fault-tolerant version from scratch in Go.

It’s a Coordinator/Worker setup talking over gRPC. You can plug in your own Map/Reduce functions (I included wordcount and invertedindex), and the whole thing runs as an 8-worker cluster in Docker.

Getting it Running

Grab the sample data and spin up the cluster:

# Get the sample text corpus
./scripts/download_corpus.sh

# Boot the Coordinator and 8 Workers
docker-compose up --build

Pulling the Plug (Testing Fault Tolerance)

The coolest part of MapReduce is watching it heal. While the cluster is crunching data, try this:

  1. Find a worker: docker ps | grep worker
  2. Nuke it: docker kill <container_id>
  3. Watch the docker-compose logs.

The Coordinator tracks heartbeats. After 10 seconds of silence, it marks the missing worker as dead. Crucially, it takes all the tasks assigned to that worker—including the completed Map tasks, since the intermediate data was sitting on that dead worker's local disk—and reassigns them to the survivors. The job just keeps going and finishes successfully.

Tests & Benchmarks

I wrote a few scripts to prove it actually works and scales:

# Compare the distributed output against a single-threaded reference implementation
./test/test_correctness.sh

# Automated chaos testing
./test/test_fault_tolerance.sh

# See how it scales from 1 to 8 workers
./bench/bench.sh 

# (You can also throw a huge file at it: ./bench/bench.sh input/large_10gb_corpus.txt)

Deviations from the Paper

  • No GFS: The real MapReduce relies heavily on the Google File System. I faked this by having Map workers write to local folders and serve the intermediate files over an HTTP server to the Reduce workers. The Reduce workers then just dump their final output to a shared local folder.
  • Simple Partitioning: Instead of writing a custom partition function, I just used Go's built-in hash/fnv to divide up the keys.

About

A fault-tolerant MapReduce implementation in Go, following the original Google paper.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors