sudopower

De-Duplication Service

Why ?

Imagine you’re ordering office lunch, and you can only order from one restaurant
You have a voting system, and you cast your vote
But your boss casts more than one!

Well, there are more use-cases:

  • For IOT devices that send sensor data, you may want only discrete values per sensor.
  • In a producer-consumer pattern, this layer can help when a producer does not receive acknowledgement and resends data, causing duplicates.
  • In ETL systems or scrapers that run periodically, you only want deltas without running pipelines each time.
  • A live polling or voting system, allowing a person to vote only once.
  • Any system where you have a high-throughput data stream and you want to make sure there are no duplicates.

It seems like a trivial task initially, just use a hash table right ?
If you want scalability, performance and robustness, it get’s interesting.

Simplest implementation

  • Read data
  • Create a Map/Hash table and save the data as the key
  • Next time you read, check that this data is present in the Map / Hash table
    • If yes, you’ve found a duplicate
    • If not, Write data back


Let’s create a simple application that reads a stream of input, deduplicates and writes a unique stream of output

Input:

sudopower@MacBookAir deduplication-service % cat /tmp/duplicate_stream.txt
Dwight voted for Salad 🥗
Stanely voted for Salad 🥗
Michael voted for 🍕 Pizza, but he may vote again, coz he's naughty and doesn't want Salad 
Michael voted for 🍕 Pizza, but he may vote again, coz he's naughty and doesn't want Salad 
Michael voted for 🍕 Pizza, but he may vote again, coz he's naughty and doesn't want Salad 
Pam voted for Salad 🥗
Jim voted for Salad 🥗

Output:

sudopower@MacBookAir deduplication-service % cat /tmp/duplicate_stream.txt | go run main.go > /tmp/de_duplicated_stream.txt
2025/07/01 17:45:28 Starting deduplication service
2025/07/01 17:45:28 Deduplication service finished.
sudopower@MacBookAir deduplication-service % cat /tmp/de_duplicated_stream.txt                                             
Dwight voted for Salad 🥗
Stanely voted for Salad 🥗
Michael voted for 🍕 Pizza, but he may vote again, coz he's naughty and doesn't want Salad 
Pam voted for Salad 🥗
Jim voted for Salad 🥗


Code:

package main

import (
	"bufio"
	"fmt"
	"io"
	"log"
	"os"
)

// Deduplicator holds the state for tracking seen message keys.
type Deduplicator struct {
	// seen stores the last time a key was observed.
	// The key is of type interface{} to handle various types (numbers, strings).
	seen map[interface{}]struct{}
}

// NewDeduplicator creates and initializes a new Deduplicator instance.
func NewDeduplicator() *Deduplicator {
	d := &Deduplicator{
		seen: make(map[interface{}]struct{}),
	}

	return d
}

// ProcessMessages reads messages from the provided reader, deduplicates them,
// and writes the unique messages to the writer.
func (d *Deduplicator) ProcessMessages(writer io.Writer, reader io.Reader) {
	scanner := bufio.NewScanner(reader)
	for scanner.Scan() {
		lineBytes := scanner.Bytes()

		// An empty line is not a valid JSON, so we skip it.
		if len(lineBytes) == 0 {
			continue
		}

		// Check if the key is a duplicate.
		if !d.isDuplicate(string(lineBytes)) {
			// If not a duplicate, write the original message to the output.
			fmt.Fprintln(writer, string(lineBytes))
		}
	}

	if err := scanner.Err(); err != nil {
		log.Printf("Error reading from STDIN: %v", err)
	}
}

// isDuplicate checks if a key has been seen before.
// It returns true if the message is a duplicate, and false otherwise.
// If the message is not a duplicate, it records the key.
func (d *Deduplicator) isDuplicate(key interface{}) bool {
	if _, found := d.seen[key]; found {
		return true
	}

	// Otherwise, it's not a duplicate. Store it.
	d.seen[key] = struct{}{}
	return false
}

func main() {

	// Create and run the service
	log.Println("Starting deduplication service")

	deduplicator := NewDeduplicator()
	deduplicator.ProcessMessages(os.Stdout, os.Stdin)

	log.Println("Deduplication service finished.")
}


Problem

What if the Seen Table grows too large ?
Is it necessary to never have duplicates ?

What if others want to order a 🍕 Pizza, the next day ?

These are very interesting problems and require a separate post to discuss them in detail. 

Checkout NEXT POST to find out.

Comment if you’re interested in more.