Skip to content

Hash Tables and Sets

Hash tables and sets are fundamental data structures that provide fast lookups, insertions, and deletions. They are essential tools for solving a wide range of algorithmic problems efficiently.

What Are Hash Tables?

A hash table (also called hash map or dictionary) is a data structure that maps keys to values using a hash function. It provides average O(1) time complexity for basic operations.

Key Concepts

Hash Function: Converts keys into array indices - Should be deterministic (same key always produces same hash) - Should distribute keys uniformly across the table - Should be fast to compute

Collision Handling: When two keys hash to the same index - Chaining: Store multiple items at same index using linked lists - Open Addressing: Find another empty slot (linear probing, quadratic probing)

Hash Tables in Python

Python provides dictionary (dict) as a built-in hash table implementation.

# Creating hash tables
hash_map = {}
hash_map = dict()

# Basic operations
hash_map['key'] = 'value'  # Insert/Update: O(1) average
value = hash_map['key']    # Access: O(1) average
del hash_map['key']        # Delete: O(1) average

# Check existence
if 'key' in hash_map:      # O(1) average
    print("Key exists")

# Get with default
value = hash_map.get('key', default_value)

# Iterate
for key in hash_map:
    print(key, hash_map[key])

for key, value in hash_map.items():
    print(key, value)

Sets

A set is a collection of unique elements with fast membership testing. It's essentially a hash table without values.

# Creating sets
my_set = set()
my_set = {1, 2, 3}

# Basic operations
my_set.add(4)           # Insert: O(1) average
my_set.remove(2)        # Delete: O(1) average
my_set.discard(5)       # Delete if exists (no error)

# Check membership
if 3 in my_set:         # O(1) average
    print("Element exists")

# Set operations
set1 = {1, 2, 3}
set2 = {3, 4, 5}

union = set1 | set2           # {1, 2, 3, 4, 5}
intersection = set1 & set2    # {3}
difference = set1 - set2      # {1, 2}
symmetric_diff = set1 ^ set2  # {1, 2, 4, 5}

Common Patterns

Frequency Counting

Count occurrences of elements.

def count_frequencies(arr):
    """Count frequency of each element."""
    freq = {}
    for num in arr:
        freq[num] = freq.get(num, 0) + 1
    return freq

# Example
arr = [1, 2, 2, 3, 3, 3]
print(count_frequencies(arr))  # {1: 1, 2: 2, 3: 3}

Finding Duplicates

Detect duplicate elements efficiently.

def has_duplicates(arr):
    """Check if array contains duplicates."""
    seen = set()
    for num in arr:
        if num in seen:
            return True
        seen.add(num)
    return False

def find_all_duplicates(arr):
    """Find all duplicate elements."""
    seen = set()
    duplicates = set()

    for num in arr:
        if num in seen:
            duplicates.add(num)
        else:
            seen.add(num)

    return list(duplicates)

Two Sum Problem

Find two numbers that add up to a target.

def two_sum(nums, target):
    """
    Find indices of two numbers that add up to target.

    Time Complexity: O(n)
    Space Complexity: O(n)
    """
    seen = {}  # value -> index

    for i, num in enumerate(nums):
        complement = target - num
        if complement in seen:
            return [seen[complement], i]
        seen[num] = i

    return []

# Example
nums = [2, 7, 11, 15]
target = 9
print(two_sum(nums, target))  # [0, 1]

Grouping Elements

Group elements by a key.

def group_anagrams(words):
    """
    Group words that are anagrams of each other.

    Time Complexity: O(n * k log k) where k is max word length
    Space Complexity: O(n * k)
    """
    groups = {}

    for word in words:
        # Use sorted word as key
        key = ''.join(sorted(word))
        if key not in groups:
            groups[key] = []
        groups[key].append(word)

    return list(groups.values())

# Example
words = ["eat", "tea", "tan", "ate", "nat", "bat"]
print(group_anagrams(words))
# [['eat', 'tea', 'ate'], ['tan', 'nat'], ['bat']]

Caching and Memoization

Store computed results for reuse.

def fibonacci_memo(n, memo=None):
    """
    Compute fibonacci with memoization.

    Time Complexity: O(n)
    Space Complexity: O(n)
    """
    if memo is None:
        memo = {}

    if n in memo:
        return memo[n]

    if n <= 1:
        return n

    memo[n] = fibonacci_memo(n - 1, memo) + fibonacci_memo(n - 2, memo)
    return memo[n]

# Using Python's functools
from functools import lru_cache

@lru_cache(maxsize=None)
def fibonacci_cached(n):
    """Fibonacci with automatic caching."""
    if n <= 1:
        return n
    return fibonacci_cached(n - 1) + fibonacci_cached(n - 2)

Advanced Techniques

Default Dictionary

Automatically initialize missing keys.

from collections import defaultdict

# Auto-initialize with list
graph = defaultdict(list)
graph[1].append(2)  # No need to check if key exists

# Auto-initialize with int (useful for counting)
counter = defaultdict(int)
for item in items:
    counter[item] += 1  # No need to check if key exists

# Auto-initialize with set
groups = defaultdict(set)
groups['category'].add('item')

Counter

Specialized dict for counting.

from collections import Counter

# Count elements
arr = [1, 2, 2, 3, 3, 3]
counter = Counter(arr)
print(counter)  # Counter({3: 3, 2: 2, 1: 1})

# Most common elements
print(counter.most_common(2))  # [(3, 3), (2, 2)]

# Arithmetic operations
counter1 = Counter(['a', 'b', 'c'])
counter2 = Counter(['b', 'c', 'd'])
print(counter1 + counter2)  # Counter({'b': 2, 'c': 2, 'a': 1, 'd': 1})

Ordered Dictionary

Maintains insertion order (Python 3.7+ dicts are ordered by default).

from collections import OrderedDict

# Maintain insertion order
ordered = OrderedDict()
ordered['first'] = 1
ordered['second'] = 2
ordered['third'] = 3

# Move to end
ordered.move_to_end('first')

# Pop in FIFO order
key, value = ordered.popitem(last=False)

Problem Examples

Problem 1: Contains Duplicate

Check if array contains any duplicates.

def contains_duplicate(nums):
    """
    Check if array contains duplicates.

    Time Complexity: O(n)
    Space Complexity: O(n)
    """
    return len(nums) != len(set(nums))

# Alternative: Early termination
def contains_duplicate_early(nums):
    """Check with early termination."""
    seen = set()
    for num in nums:
        if num in seen:
            return True
        seen.add(num)
    return False

Problem 2: First Unique Character

Find first non-repeating character.

def first_unique_char(s):
    """
    Find index of first unique character.

    Time Complexity: O(n)
    Space Complexity: O(1) - at most 26 letters
    """
    # Count frequencies
    freq = {}
    for char in s:
        freq[char] = freq.get(char, 0) + 1

    # Find first unique
    for i, char in enumerate(s):
        if freq[char] == 1:
            return i

    return -1

Problem 3: Longest Consecutive Sequence

Find length of longest consecutive sequence.

def longest_consecutive(nums):
    """
    Find longest consecutive sequence length.

    Time Complexity: O(n)
    Space Complexity: O(n)
    """
    if not nums:
        return 0

    num_set = set(nums)
    max_length = 0

    for num in num_set:
        # Only start counting from sequence starts
        if num - 1 not in num_set:
            current = num
            length = 1

            while current + 1 in num_set:
                current += 1
                length += 1

            max_length = max(max_length, length)

    return max_length

# Example
nums = [100, 4, 200, 1, 3, 2]
print(longest_consecutive(nums))  # 4 (sequence: 1, 2, 3, 4)

Performance Characteristics

Operation Average Case Worst Case
Insert O(1) O(n)
Delete O(1) O(n)
Search O(1) O(n)
Space O(n) O(n)

Note: Worst case occurs with many collisions. Good hash functions make this rare.

When to Use Hash Tables vs Sets

Use Hash Table (Dict) when: - Need to map keys to values - Need to store associated data with keys - Implementing caching or memoization - Grouping or categorizing data

Use Set when: - Only need to track existence (no associated values) - Need set operations (union, intersection, difference) - Removing duplicates - Membership testing

Common Applications

Algorithm Optimization

  • Reducing time complexity from O(n²) to O(n)
  • Two sum, three sum variations
  • Substring/subarray problems

Data Processing

  • Frequency analysis
  • Grouping and categorization
  • Deduplication

Caching

  • Memoization in dynamic programming
  • Result caching
  • LRU cache implementation

Graph Algorithms

  • Visited node tracking
  • Adjacency list representation
  • Cycle detection

Practice Tips

  • Think about what you need to look up quickly
  • Consider using set when you don't need values
  • Remember average O(1) operations make many O(n²) problems O(n)
  • Use Counter for frequency counting problems
  • Use defaultdict to simplify initialization logic
  • Consider memory trade-offs: hash tables use extra space

Common Mistakes

  • Forgetting hash tables use extra space
  • Not considering collision handling in worst case
  • Using unhashable types as keys (lists, dicts)
  • Modifying keys after insertion (use immutable types)
  • Not handling key existence before access (use .get() or in)
  • Assuming ordering in regular dicts (Python < 3.7)
  • Not considering hash function quality for custom objects

Interview Problem Patterns with Hash Tables

Hash tables transform many O(n²) brute force solutions into O(n) optimal solutions. Here are the essential patterns.

Pattern 1: Frequency/Count Tracking

When to use: Need to count occurrences or track presence.

def find_anagrams(s, p):
    """
    Find all anagram starting indices of p in s.

    Example: s = "cbaebabacd", p = "abc"
    Output: [0, 6] - "cba" and "bac" are anagrams

    Time: O(n) - One pass with sliding window
    Space: O(1) - At most 26 letters
    """
    from collections import Counter

    if len(p) > len(s):
        return []

    # Target frequency map
    p_count = Counter(p)
    window_count = Counter()

    result = []

    # Sliding window
    for i in range(len(s)):
        # Add new character to window
        window_count[s[i]] += 1

        # Remove character that left window
        if i >= len(p):
            if window_count[s[i - len(p)]] == 1:
                del window_count[s[i - len(p)]]
            else:
                window_count[s[i - len(p)]] -= 1

        # Check if window is anagram
        if window_count == p_count:
            result.append(i - len(p) + 1)

    return result

Key insight: Hash table makes comparison O(1) instead of O(k) sorting.

Pattern 2: Complement Finding

When to use: Need to find pairs that satisfy a condition.

def three_sum(nums):
    """
    Find all unique triplets that sum to zero.

    Example: [-1, 0, 1, 2, -1, -4]
    Output: [[-1, -1, 2], [-1, 0, 1]]

    Time: O(n²) - O(n²) worst vs O(n³) brute force
    Space: O(n) - Hash table for seen values
    """
    nums.sort()
    result = []

    for i in range(len(nums) - 2):
        # Skip duplicates
        if i > 0 and nums[i] == nums[i-1]:
            continue

        # Two sum on remaining array
        seen = set()
        target = -nums[i]

        for j in range(i + 1, len(nums)):
            complement = target - nums[j]

            if complement in seen:
                triplet = [nums[i], complement, nums[j]]
                result.append(triplet)

                # Skip duplicates
                while j + 1 < len(nums) and nums[j] == nums[j+1]:
                    j += 1

            seen.add(nums[j])

    return result
  • Alternative approaches
    • Brute force: O(n³) - try all triplets
    • Hash table: O(n²) - fix one, two-sum on rest
    • Two pointers: O(n²) - same complexity but O(1) space

Pattern 3: Grouping/Categorization

When to use: Need to group elements by computed property.

def group_shifted_strings(strings):
    """
    Group strings that are shifts of each other.

    Example: ["abc", "bcd", "xyz", "az", "ba"]
    Output: [["abc","bcd","xyz"], ["az","ba"]]

    "abc" -> "bcd": shift each char by 1
    "abc" -> "xyz": shift each char by 23

    Time: O(n * k) where k is max string length
    Space: O(n * k)
    """
    from collections import defaultdict

    def get_pattern(s):
        """
        Compute shift pattern (differences between adjacent chars).

        "abc": (1, 1) - b-a=1, c-b=1
        "bcd": (1, 1) - same pattern!
        "az": (25,) - z-a=25 (wraps around)
        "ba": (25,) - same pattern!
        """
        if not s:
            return ()

        pattern = []
        for i in range(1, len(s)):
            diff = (ord(s[i]) - ord(s[i-1])) % 26
            pattern.append(diff)

        return tuple(pattern)

    groups = defaultdict(list)

    for string in strings:
        pattern = get_pattern(string)
        groups[pattern].append(string)

    return list(groups.values())

Key insight: Hash on computed property, not original value.

Pattern 4: Sliding Window with Constraints

When to use: Need to maintain a window that satisfies condition.

def longest_substring_k_distinct(s, k):
    """
    Longest substring with at most k distinct characters.

    Example: s = "eceba", k = 2
    Output: 3 ("ece" or "eba")

    Time: O(n) - Each char visited at most twice
    Space: O(k) - At most k+1 entries in map
    """
    if k == 0:
        return 0

    char_count = {}
    left = 0
    max_length = 0

    for right in range(len(s)):
        # Expand window: add right character
        char_count[s[right]] = char_count.get(s[right], 0) + 1

        # Shrink window: too many distinct characters
        while len(char_count) > k:
            char_count[s[left]] -= 1
            if char_count[s[left]] == 0:
                del char_count[s[left]]
            left += 1

        # Update maximum
        max_length = max(max_length, right - left + 1)

    return max_length

Hash table advantage: O(1) distinct count vs O(k) with array.

Pattern 5: LRU Cache Implementation

When to use: Need to track recently used items with fast access and eviction.

class LRUCache:
    """
    Least Recently Used cache with O(1) operations.

    Combines hash table + doubly linked list:
    - Hash table: O(1) key lookup
    - Linked list: O(1) move to front / eviction
    """

    class Node:
        def __init__(self, key=0, value=0):
            self.key = key
            self.value = value
            self.prev = None
            self.next = None

    def __init__(self, capacity):
        self.capacity = capacity
        self.cache = {}  # key -> node

        # Dummy head and tail for easier insertion/deletion
        self.head = self.Node()
        self.tail = self.Node()
        self.head.next = self.tail
        self.tail.prev = self.head

    def _add_to_front(self, node):
        """Add node right after head."""
        node.next = self.head.next
        node.prev = self.head
        self.head.next.prev = node
        self.head.next = node

    def _remove_node(self, node):
        """Remove node from linked list."""
        node.prev.next = node.next
        node.next.prev = node.prev

    def get(self, key):
        """
        Get value by key. Mark as recently used.

        Time: O(1)
        """
        if key not in self.cache:
            return -1

        node = self.cache[key]

        # Move to front (most recently used)
        self._remove_node(node)
        self._add_to_front(node)

        return node.value

    def put(self, key, value):
        """
        Add or update key-value pair.

        Time: O(1)
        """
        if key in self.cache:
            # Update existing
            node = self.cache[key]
            node.value = value

            # Move to front
            self._remove_node(node)
            self._add_to_front(node)
        else:
            # Add new
            node = self.Node(key, value)
            self.cache[key] = node
            self._add_to_front(node)

            # Evict if over capacity
            if len(self.cache) > self.capacity:
                # Remove least recently used (node before tail)
                lru = self.tail.prev
                self._remove_node(lru)
                del self.cache[lru.key]

# Example usage
cache = LRUCache(2)
cache.put(1, 1)
cache.put(2, 2)
print(cache.get(1))    # 1
cache.put(3, 3)        # Evicts key 2
print(cache.get(2))    # -1 (not found)
  • Why this works
    • Hash table: O(1) lookup of nodes
    • Doubly linked list: O(1) move to front and eviction
    • Best of both worlds: Fast access + efficient eviction

Pattern 6: Prefix Sum with Hash Table

When to use: Need to find subarrays with specific sum.

def subarray_sum_equals_k(nums, k):
    """
    Count subarrays with sum equal to k.

    Example: nums = [1, 2, 3], k = 3
    Output: 2 - subarrays [1,2] and [3]

    Time: O(n) - One pass
    Space: O(n) - Prefix sum counts
    """
    # Hash table: prefix_sum -> count
    prefix_counts = {0: 1}  # Empty subarray has sum 0

    current_sum = 0
    count = 0

    for num in nums:
        current_sum += num

        # Check if (current_sum - k) exists
        # If yes, found subarray(s) with sum k
        if current_sum - k in prefix_counts:
            count += prefix_counts[current_sum - k]

        # Record current prefix sum
        prefix_counts[current_sum] = prefix_counts.get(current_sum, 0) + 1

    return count

# How it works:
# If prefix_sum[i] - prefix_sum[j] = k
# Then sum(nums[j+1:i+1]) = k
# So we look for: prefix_sum[j] = prefix_sum[i] - k

Key insight: Convert subarray problem to prefix sum lookup.

Time-Space Trade-offs

Understanding when the extra space is worth it:

def compare_approaches(nums, target):
    """Compare space vs time trade-offs."""

    # Approach 1: No extra space
    def two_sum_O1_space(nums, target):
        """
        Time: O(n²)
        Space: O(1)

        Good when: Memory very limited, small array
        """
        for i in range(len(nums)):
            for j in range(i + 1, len(nums)):
                if nums[i] + nums[j] == target:
                    return [i, j]
        return []

    # Approach 2: Hash table
    def two_sum_On_space(nums, target):
        """
        Time: O(n)
        Space: O(n)

        Good when: Array large, many queries
        """
        seen = {}
        for i, num in enumerate(nums):
            if target - num in seen:
                return [seen[target - num], i]
            seen[num] = i
        return []

    # For 1 million elements:
    # O(n²): ~1 trillion operations (~hours)
    # O(n) with hash: ~1 million operations (< 1 second)
    # Extra memory: ~8MB for hash table
    # 
    # Trade-off is almost always worth it!

Further Learning

The patterns above cover the most important hash table techniques for coding interviews and real-world applications. For additional perspectives:

  • Advanced Hash Table Internals: Some resources discuss collision resolution strategies (chaining vs open addressing) and hash function design
  • Language-Specific Optimizations: Different languages have different hash table implementations with varying performance characteristics
  • Theoretical Analysis: Academic sources cover formal analysis of hash table operations and probabilistic guarantees

The key is recognizing when O(1) lookup can eliminate nested loops and understanding the space trade-off is usually worthwhile.