Why ?
Imagine you’re ordering office lunch, and you can only order from one restaurant
You have a voting system, and you cast your vote
But your boss casts more than one!
Well, there are more use-cases:
- For IOT devices that send sensor data, you may want only discrete values per sensor.
- In a producer-consumer pattern, this layer can help when a producer does not receive acknowledgement and resends data, causing duplicates.
- In ETL systems or scrapers that run periodically, you only want deltas without running pipelines each time.
- A live polling or voting system, allowing a person to vote only once.
- Any system where you have a high-throughput data stream and you want to make sure there are no duplicates.
It seems like a trivial task initially, just use a hash table right ?
If you want scalability, performance and robustness, it get’s interesting.
Simplest implementation
- Read data
- Create a Map/Hash table and save the data as the key
- Next time you read, check that this data is present in the Map / Hash table
- If yes, you’ve found a duplicate
- If not, Write data back

Let’s create a simple application that reads a stream of input, deduplicates and writes a unique stream of output
Input:
sudopower@MacBookAir deduplication-service % cat /tmp/duplicate_stream.txt
Dwight voted for Salad 🥗
Stanely voted for Salad 🥗
Michael voted for 🍕 Pizza, but he may vote again, coz he's naughty and doesn't want Salad
Michael voted for 🍕 Pizza, but he may vote again, coz he's naughty and doesn't want Salad
Michael voted for 🍕 Pizza, but he may vote again, coz he's naughty and doesn't want Salad
Pam voted for Salad 🥗
Jim voted for Salad 🥗
Output:
sudopower@MacBookAir deduplication-service % cat /tmp/duplicate_stream.txt | go run main.go > /tmp/de_duplicated_stream.txt
2025/07/01 17:45:28 Starting deduplication service
2025/07/01 17:45:28 Deduplication service finished.
sudopower@MacBookAir deduplication-service % cat /tmp/de_duplicated_stream.txt
Dwight voted for Salad 🥗
Stanely voted for Salad 🥗
Michael voted for 🍕 Pizza, but he may vote again, coz he's naughty and doesn't want Salad
Pam voted for Salad 🥗
Jim voted for Salad 🥗
Code:
package main
import (
"bufio"
"fmt"
"io"
"log"
"os"
)
// Deduplicator holds the state for tracking seen message keys.
type Deduplicator struct {
// seen stores the last time a key was observed.
// The key is of type interface{} to handle various types (numbers, strings).
seen map[interface{}]struct{}
}
// NewDeduplicator creates and initializes a new Deduplicator instance.
func NewDeduplicator() *Deduplicator {
d := &Deduplicator{
seen: make(map[interface{}]struct{}),
}
return d
}
// ProcessMessages reads messages from the provided reader, deduplicates them,
// and writes the unique messages to the writer.
func (d *Deduplicator) ProcessMessages(writer io.Writer, reader io.Reader) {
scanner := bufio.NewScanner(reader)
for scanner.Scan() {
lineBytes := scanner.Bytes()
// An empty line is not a valid JSON, so we skip it.
if len(lineBytes) == 0 {
continue
}
// Check if the key is a duplicate.
if !d.isDuplicate(string(lineBytes)) {
// If not a duplicate, write the original message to the output.
fmt.Fprintln(writer, string(lineBytes))
}
}
if err := scanner.Err(); err != nil {
log.Printf("Error reading from STDIN: %v", err)
}
}
// isDuplicate checks if a key has been seen before.
// It returns true if the message is a duplicate, and false otherwise.
// If the message is not a duplicate, it records the key.
func (d *Deduplicator) isDuplicate(key interface{}) bool {
if _, found := d.seen[key]; found {
return true
}
// Otherwise, it's not a duplicate. Store it.
d.seen[key] = struct{}{}
return false
}
func main() {
// Create and run the service
log.Println("Starting deduplication service")
deduplicator := NewDeduplicator()
deduplicator.ProcessMessages(os.Stdout, os.Stdin)
log.Println("Deduplication service finished.")
}
Problem
What if the Seen Table grows too large ?
Is it necessary to never have duplicates ?
What if others want to order a 🍕 Pizza, the next day ?
These are very interesting problems and require a separate post to discuss them in detail.
Checkout NEXT POST to find out.
Comment if you’re interested in more.