Leetcode 804 "Unique Morse Code Words": Optimized Solution


I over-engineered my solution to this Leetcode easy problem and it's over twice as fast as the accepted optimal solution. I can't find anywhere else that mentions this approach and I thought my solution was cool, so I made a writeup on how it works.

Problem

Link to problem

International Morse Code defines a standard encoding where each letter is mapped to a series of dots and dashes, as follows:

For convenience, the full table for the 26 letters of the English alphabet is given below:

[".-","-...","-.-.","-..",".","..-.","--.","....","..",".---","-.-",".-..","--","-.","---",".--.","--.-",".-.","...","-","..-","...-",".--","-..-","-.--","--.."]

Given an array of strings words where each word can be written as a concatenation of the Morse code of each letter, how many unique Morse code encodings are there?

For example, "cab" can be written as "-.-..--...", which is the concatenation of "-.-.", ".-", and "-...". We will call such a concatenation the transformation of a word. Return the number of different transformations among all words we have.

  1. Example 1

    Input: words = ["gin","zen","gig","msg"]

    Output: 2

    Explanation: The transformation of each word is:

    "gin" -> "--...-."
    "zen" -> "--...-."
    "gig" -> "--...--."
    "msg" -> "--...--."
    

    There are 2 different transformations: "--...-." and "--...--.".

  2. Example 2

    Input: words = ["a"]

    Output: 1

Constraints:

Accepted Optimal Solution

A top ranked solution in terms of speed runs in 0ms on Leetcode's test cases.

Approach:

  1. Convert each input string into morse code
  2. Count the number of unique morse codes by adding them all to a set.
morse = [
    ".-","-...","-.-.","-..",".","..-.","--.",
    "....","..",".---","-.-",".-..","--","-.",
    "---",".--.","--.-",".-.","...","-","..-",
    "...-",".--","-..-","-.--","--.."
]

def toMorse(word: str) -> str:
    charsToMorseCode = [morse[ord(n) - ord('a')] for n in a]
    return ''.join(charsToMorseCode)

def uniqueMorseRepresentations(words: List[str]) -> int:
    morseCodes = set(toMorse(w) for w in words)
    return len(morseCodes)

My Solution

In the accepted solution, the input words go through the following transformations hash(toMorse(word)):

  1. Each character in the word is mapped to its morse code equivalent
  2. The morse code equivalents are concatenated to create the full morse-code encoding.
  3. The full morse code encoding is hashed (when added to the set)

My solution proposes that we can compute the hash code of the morse-code encoding directly from the original string (ie hashMorseEncoding(word)). We don't have to first translate the word into morse code.

I'm framing this mathematically because algebra is necessary for my argument.

Definitions

Let s be a string of characters, s=len(s)f ⁣:CharString,F(s)=f(s[0])f(s[1])f(s[s1])hash(s)=i=0s1s[i]pimodm\begin{aligned} &\text{Let } s \text{ be a string of characters, } |s| = \operatorname{len}(s) \\ &f \colon \text{Char} \to \text{String}, \quad F(s) = f(s[0]) \mathbin{\|} f(s[1]) \mathbin{\|} \cdots \mathbin{\|} f(s[|s|-1]) \\ &\operatorname{hash}(s) = \sum_{i=0}^{|s|-1} s[i] \cdot p^i \bmod m \end{aligned}

In our case, f(c) is the mapping from letter to morse code. The operator denotes string concatenation.

Note that hash(s) is the polynomial rolling hash function.

These definitions are the same as:

f = dict() # char -> str

def F(s: str) -> str:
    return "".join(f[c] for c in s)

def hash(s: str) -> int:
    output = 0
    pow = 1
    for i, c in enumerate(s):
        v = 2 if c == '-' else 1
        output = (output + v * pow) % m
        pow *= p
    return output

Theorem

hash(F(s))=i=0s1hash(f(s[i]))pj=0i1f(s[j])modm\operatorname{hash}(F(s)) = \sum_{i=0}^{|s|-1} \operatorname{hash}(f(s[i])) \cdot p^{\sum_{j=0}^{i-1} |f(s[j])|} \bmod m
def superPolynomialHash(s: str) -> int:
    acc = 0
    for i in range(len(s)):
        pow = sum(len(f(c)) for c in s[:i])
        acc += hash(f(s[i])) * p ** pow
    return acc

We can compute hash(F(s)) without ever having to call F(s) by using the expression on the right-hand side.

Proof

Suppose we want hash(F("abc")). Encoded in Morse code,

s=".-""-...""-.-."=".--...-.-."s' = \texttt{".-"} \mathbin{\|} \texttt{"-..."} \mathbin{\|} \texttt{"-.-."} = \texttt{".--...-.-."}
s=F(s)=f("a")f("b")f("c")=sasbsc=(sa,0sa,1)(sb,0sb,1sb,2sb,3)(sc,0sc,4)=s0s1s2s3s4s5s6s10hash(s)=s0+s1p+s2p2+s3p3+s4p4+s5p5+s6p6+s7p7+s8p8+s9p9+s10p10\begin{aligned} s' &= F(s) = \textcolor{red}{f(\text{"a"})} \mathbin{\|} \textcolor{blue}{f(\text{"b"})} \mathbin{\|} \textcolor{orange}{f(\text{"c"})} = \textcolor{red}{s_a} \mathbin{\|} \textcolor{blue}{s_b} \mathbin{\|} \textcolor{orange}{s_c} \\ &= \textcolor{red}{(s_{a,0} \mathbin{\|} s_{a,1})} \mathbin{\|} \textcolor{blue}{(s_{b,0} \mathbin{\|} s_{b,1} \mathbin{\|} s_{b,2} \mathbin{\|} s_{b,3})} \mathbin{\|} \textcolor{orange}{(s_{c,0} \mathbin{\|} \cdots \mathbin{\|} s_{c,4})} \\ &= \textcolor{red}{s'_0 \mathbin{\|} s'_1} \mathbin{\|} \textcolor{blue}{s'_2 \mathbin{\|} s'_3 \mathbin{\|} s'_4 \mathbin{\|} s'_5} \mathbin{\|} \textcolor{orange}{s'_6 \mathbin{\|} \cdots \mathbin{\|} s'_{10}} \\ \operatorname{hash}(s') &= \textcolor{red}{s'_0 + s'_1 p} + \textcolor{blue}{s'_2 p^2 + s'_3 p^3 + s'_4 p^4 + s'_5 p^5} + \textcolor{orange}{s'_6 p^6 + s'_7 p^7 + s'_8 p^8 + s'_9 p^9 + s'_{10} p^{10}} \end{aligned}

Each term in hash(s') is one Morse symbol. Group them by which letter they belong to (the first two belong to "a") and factor:

hash(s)=(s0+s1p)+(s2+s3p+s4p2+s5p3)p2+(s6+s7p+s8p2+s9p3+s10p4)p6=(sa,0+sa,1p)+(sb,0+sb,1p+sb,2p2+sb,3p3)p2+(sc,0+sc,1p+sc,2p2+sc,3p3+sc,4p4)p6\begin{aligned} \operatorname{hash}(s') &= \textcolor{red}{(s'_0 + s'_1 p)} + \textcolor{blue}{(s'_2 + s'_3 p + s'_4 p^2 + s'_5 p^3)} p^2 + \textcolor{orange}{(s'_6 + s'_7 p + s'_8 p^2 + s'_9 p^3 + s'_{10} p^4)} p^6 \\ &= \textcolor{red}{(s_{a,0} + s_{a,1} p)} + \textcolor{blue}{(s_{b,0} + s_{b,1} p + s_{b,2} p^2 + s_{b,3} p^3)} p^2 + \textcolor{orange}{(s_{c,0} + s_{c,1} p + s_{c,2} p^2 + s_{c,3} p^3 + s_{c,4} p^4)} p^6 \end{aligned}

Those grouped expressions are:

hash(f("a"))=sa,0+sa,1phash(f("b"))=sb,0+sb,1p+sb,2p2+sb,3p3hash(f("c"))=sc,0+sc,1p+sc,2p2+sc,3p3+sc,4p4\begin{aligned} \textcolor{red}{\operatorname{hash}(f(\text{"a"}))} &= \textcolor{red}{s_{a,0} + s_{a,1} p} \\ \textcolor{blue}{\operatorname{hash}(f(\text{"b"}))} &= \textcolor{blue}{s_{b,0} + s_{b,1} p + s_{b,2} p^2 + s_{b,3} p^3} \\ \textcolor{orange}{\operatorname{hash}(f(\text{"c"}))} &= \textcolor{orange}{s_{c,0} + s_{c,1} p + s_{c,2} p^2 + s_{c,3} p^3 + s_{c,4} p^4} \end{aligned}

Substituting back:

hash(s)=hash(f("a"))+hash(f("b"))p2+hash(f("c"))p6\operatorname{hash}(s') = \textcolor{red}{\operatorname{hash}(f(\text{"a"}))} + \textcolor{blue}{\operatorname{hash}(f(\text{"b"}))} p^2 + \textcolor{orange}{\operatorname{hash}(f(\text{"c"}))} p^6

The exponent on each p^n term is the number of Morse symbols before that letter in s'.

Unoptimized Implementation

We can use the literal code translation to implement this, but this solution will be very slow (in fact, slower than the original!).

def hash(s: str) -> int:
    output = 0
    pow = 1
    for i, c in enumerate(s):
        v = 2 if c == '-' else 1
        output = (output + v * pow) % m
        pow *= p
    return output

def superPolynomialHash(s: str) -> int:
    acc = 0
    for i in range(len(s)):
        pow = sum(len(f(c)) for c in s[:i])
        acc += hash(f(s[i])) * p ** pow
    return acc

def uniqueMorseRepresentations(words: List[str]) -> int:
    hashes = { superPolynomialHash(w) for w in words }
    return len(hashes)

Optimized Solution Part 1 — Precomputing

Before running the actual solution, compute certain values which we will hard-code into the actual solution.

morse_codes = [
    ".-","-...","-.-.","-..",".","..-.","--.",
    "....","..",".---","-.-",".-..","--","-.",
    "---",".--.","--.-",".-.","...","-","..-",
    "...-",".--","-..-","-.--","--.."
]

m = 2 ** 31 - 1
p = 3

lengths = [len(mc) for mc in morse_codes]
print("lengths:", lengths)

max_code_length = max(lengths)
p_powers = [p ** i for i in range(max_code_length + 1)]
print("powers of p:", p_powers)


def polynomialHash(s: str) -> int:
    output = 0
    pow = 1
    for i, c in enumerate(s):
        v = 2 if c == '-' else 1
        output = (output + v * pow) % m
        pow *= p
    return output

hashes = [polynomialHash(mc) for mc in morse_codes]
print("hashes:", hashes)
lengths: [2, 4, 4, 3, 1, 4, 3, 4, 2, 4, 3, 4, 2, 2, 3, 4, 4, 3, 3, 1, 3, 4, 3, 4, 4, 4]
powers of p: [1, 3, 9, 27, 81]
hashes: [7, 41, 50, 14, 1, 49, 17, 40, 4, 79, 23, 43, 8, 5, 26, 52, 71, 16, 13, 2, 22, 67, 25, 68, 77, 44]

Optimized Solution Part 2 — Solution

lengths = [2, 4, 4, 3, 1, 4, 3, 4, 2, 4, 3, 4, 2, 2, 3, 4, 4, 3, 3, 1, 3, 4, 3, 4, 4, 4]
hashes = [7, 41, 50, 14, 1, 49, 17, 40, 4, 79, 23, 43, 8, 5, 26, 52, 71, 16, 13, 2, 22, 67, 25, 68, 77, 44]

P = [1, 3, 9, 27, 81]
m = 2 ** 31 - 1

def superPolynomialHash(s: str) -> int:
    output = 0
    for i, c in reversed(s):
        ii = ord(c) - ord('a')
        l = lengths[ii]
        output = (output * P[l] + hashes[ii]) % m
    return output


def uniqueMorseRepresentations(words: List[str]) -> int:
    hashes = { superPolynomialHash(w) for w in words }
    return len(hashes)

Drawback

There are a lot of hash collisions with the p and m I chose, but it didn't seem to matter with the test cases Leetcode had.

Time and Space Complexity

N = len(words)

M = max(word[i].length)

H = maximum length of a string f(c) can output (4 in this case)

L = # of different letters that can be in a word (26 in this case)

SolutionTimeSpace
AcceptedO(NMH)O(HL)
My unoptimizedO(NM(M+H+log(H))) = O(NM^2H)O(L)
My optimizedO(NM)O(L)

Comparison

The Leetcode test cases are not very large so it's very easy to cap out at 0ms. I ran my own test cases 1,000,000 times.

CaseMineTop Solution
Default test case.096s.236s
arr[i].len > 1000040.59s96.18s