Leetcode 804 "Unique Morse Code Words": Optimized Solution

I over-engineered my solution to this Leetcode easy problem and it's over twice as fast as the accepted optimal solution. I can't find anywhere else that mentions this approach and I thought my solution was cool, so I made a writeup on how it works.

Problem

Link to problem

International Morse Code defines a standard encoding where each letter is mapped to a series of dots and dashes, as follows:

For convenience, the full table for the 26 letters of the English alphabet is given below:

[".-","-...","-.-.","-..",".","..-.","--.","....","..",".---","-.-",".-..","--","-.","---",".--.","--.-",".-.","...","-","..-","...-",".--","-..-","-.--","--.."]

Given an array of strings words where each word can be written as a concatenation of the Morse code of each letter, how many unique Morse code encodings are there?

For example, "cab" can be written as "-.-..--...", which is the concatenation of "-.-.", ".-", and "-...". We will call such a concatenation the transformation of a word. Return the number of different transformations among all words we have.

Example 1:

Input: words = ["gin","zen","gig","msg"]
Output: 2
Explanation: The transformation of each word is:

"gin" -> "--...-."
"zen" -> "--...-."
"gig" -> "--...--."
"msg" -> "--...--."

There are 2 different transformations: "--...-." and "--...--.".

Example 2:

Input: words = ["a"]
Output: 1


Constraints:

Accepted Optimal Solution

A top ranked solution in terms of speed runs in 0ms on Leetcode's test cases.

Approach:

  1. Convert each input string into morse code
  2. Count the number of unique morse codes by adding them all to a set.
morse = [
    ".-","-...","-.-.","-..",".","..-.","--.",
    "....","..",".---","-.-",".-..","--","-.",
    "---",".--.","--.-",".-.","...","-","..-",
    "...-",".--","-..-","-.--","--.."
]

def toMorse(word: str) -> str:
    charsToMorseCode = [morse[ord(n) - ord('a')] for n in a]
    return ''.join(charsToMorseCode)

def uniqueMorseRepresentations(words: List[str]) -> int:
    morseCodes = set(toMorse(w) for w in words)
    return len(morseCodes)

My Solution

In the accepted solution, the input words go through the following transformations hash(toMorse(word)):

  1. Each character in the word is mapped to its morse code equivalent
  2. The morse code equivalents are concatenated to create the full morse-code encoding.
  3. The full morse code encoding is hashed (when added to the set)

My solution proposes that we can compute the hash code of the morse-code encoding directly from the original string (ie hashMorseEncoding(word)). We don't have to first translate the word into morse code.

Theory

I'm framing this mathematically because algebra is necessary for my argument.

Definitions

Let s be a string of character, s=len(s) \text{Let $s$ be a string of character, $|s| = len(s)$}f:CharactersStrings, F(s)=i=0s1f(s[i]) f: Characters \rightarrow Strings, \ F(s) = \prod_{i=0}^{|s|-1} f(s[i])hash(s)=s[0]+s[1]p+s[2]p2+...+s[n1]pn1    mod m hash(s) = s[0] + s[1]*p + s[2]*p^2 + ... + s[n - 1] * p^{n-1} \space\space\space\space mod \space m=i=0n1s[i]pi    mod m = \sum_{i=0}^{n-1}s[i]* p^i \space\space\space\space mod \space m

In our case, f(c) is the mapping from letter to morse code.

Note that hash(s) is just the polynomial rolling hash function.

These definitions are the same as:

f = dict() # char -> str

def F(s: str) -> str:
    return "".join(f[c] for c in s)

def hash(s: str) -> int:
    output = 0
    pow = 1
    for i, c in enumerate(s):
        v = 2 if c == '-' else 1
        output = (output + v * pow) % m
        pow *= p
    return output

Theorem

hash(F(s))=i=0s1hash(f(s[i]))pj=0i1f(s[j])    mod m hash(F(s)) = \sum_{i=0}^{|s|-1}hash(f(s[i])) * p^{ \sum_{j=0}^{i-1}|f(s[j]) } \ \ \ \ mod \ m

def superPolynomialHash(s: str) -> int:
    acc = 0
    for i in range(len(s)):
        pow = sum(len(f(c)) for c in s[:i])
        acc += hash(f(s[i])) * p ** pow
    return acc

We can compute hash(F(s)) without ever having to call F(s) by using the expression on the right-hand side.

Proof

Suppose we wanted to find hash(F("abc")). Encoded in morse code, it is s' = ".-" + "-..." + "-.-." = ".--...-.-."

s=F(s)=f("a")+f("b")+f("c")=sa+sb+sc s' = F(s) = \textcolor{red}{f(\text{"a"})} + \textcolor{blue}{f(\text{"b"})} + \textcolor{orange}{f(\text{"c"})} = \textcolor{red}{s_a} + \textcolor{blue}{s_b} + \textcolor{orange}{s_c}=(sa[0]+sa[1])+(sb[0]+sb[1]+sb[2]+sb[3])+(sc[0]+sc[1]+sc[2]+sc[3]+sc[4]) = \textcolor{red}{(s_a[0] + s_a[1])} + \textcolor{blue}{(s_b[0] + s_b[1] + s_b[2] + s_b[3])} + \textcolor{orange}{(s_c[0] + s_c[1] + s_c[2] + s_c[3] + s_c[4])}=s[0]+s[1]+s[2]+s[3]+s[4]+s[5]+s[6]+s[7]+s[8]+s[9]+s[10] = \textcolor{red}{s'[0] + s'[1]} + \textcolor{blue}{s'[2] + s'[3] + s'[4] + s'[5]} + \textcolor{orange}{s'[6] + s'[7] + s'[8] + s'[9] + s'[10]}hash(s)=s[0]+s[1]p+s[2]p2+s[3]p3+s[4]p4+s[5]p5+s[6]p6+s[7]p7+s[8]p8+s[9]p9+s[10]p10 hash(s') = \textcolor{red}{s'[0] + s'[1]p} + \textcolor{blue}{s'[2]p^2 + s'[3]p^3 + s'[4]p^4 + s'[5]p^5} + \textcolor{orange}{s'[6]p^6 + s'[7]p^7 + s'[8]p^8 + s'[9]p^9 + s'[10]p^{10}}

Each term in hash(s') represents a single char. Group them by which char they belong to (ex. first two belong to "a") and factor them.

hash(s)=(s[0]+s[1]p)+(s[2]+s[3]p+s[4]p2+s[5]p3)p2+(s[6]+s[7]p+s[8]p2+s[9]p3+s[10]p4)p6 hash(s') = (\textcolor{red}{s'[0] + s'[1]*p}) + (\textcolor{blue}{s'[2] + s'[3]*p + s'[4]*p^2 + s'[5]*p^3})*p^2 + (\textcolor{orange}{s'[6] + s'[7]*p + s'[8]*p^2 + s'[9]*p^3 + s'[10]*p^4})p^6=(sa[0]+sa[1]p)+(sb[0]+sb[1]p+sb[2]p2+sb[3]p3)p2+(sc[0]+sc[1]p+sc[2]p2+sc[3]p3+sc[3]p4)p6 = (\textcolor{red}{s_a[0] + s_a[1]*p}) + (\textcolor{blue}{s_b[0] + s_b[1]*p + s_b[2]*p^2 + s_b[3]*p^3})*p^2 + (\textcolor{orange}{s_c[0] + s_c[1]*p + s_c[2]*p^2 + s_c[3]*p^3 + s_c[3]*p^4})p^6

Note that these groups expressions are equivalent to:

hash(f("a"))=sa[0]+sa[1]p \textcolor{red}{hash(f(\text{"a"})) = s_a[0] + s_a[1]*p}hash(f("b"))=sb[0]+sb[1]p+sb[2]p2+sb[3]p3 \textcolor{blue}{hash(f(\text{"b"})) = s_b[0] + s_b[1]*p + s_b[2]*p^2 + s_b[3]*p^3}hash(f("c"))=sc[0]+sc[1]p+sc[2]p2+sc[3]p3+sc[3]p4 \textcolor{orange}{hash(f(\text{"c"})) = s_c[0] + s_c[1]*p + s_c[2]*p^2 + s_c[3]*p^3 + s_c[3]*p^4}

So now we can swap these into the original equation like so:

hash(s)=hash(f("a"))+hash(f("b"))p2+hash(f("c"))p6 hash(s') = \textcolor{red}{hash(f(\text{"a"}))} + \textcolor{blue}{hash(f(\text{"b"}))}p^2 + \textcolor{orange}{hash(f(\text{"c"}))}p^6

It should be apparent that the n in the p^n factor for each term is the number of characters before it in the string.

Code

Unoptimized Implementation

We can use the literal code translation to implement this, but this solution will be very slow (in fact, slower than the original!).

def hash(s: str) -> int:
    output = 0
    pow = 1
    for i, c in enumerate(s):
        v = 2 if c == '-' else 1
        output = (output + v * pow) % m
        pow *= p
    return output

def superPolynomialHash(s: str) -> int:
    acc = 0
    for i in range(len(s)):
        pow = sum(len(f(c)) for c in s[:i])
        acc += hash(f(s[i])) * p ** pow
    return acc

def uniqueMorseRepresentations(words: List[str]) -> int:
    hashes = { superPolynomialHash(w) for w in words }
    return len(hashes)

Optimized Solution Part 1 - Precomputing

Before running the actual solution, compute certain values which we will hard-code into the actual solution.

morse_codes = [
    ".-","-...","-.-.","-..",".","..-.","--.",
    "....","..",".---","-.-",".-..","--","-.",
    "---",".--.","--.-",".-.","...","-","..-",
    "...-",".--","-..-","-.--","--.."
]

m = 2 ** 31 - 1
p = 3

lengths = [len(mc) for mc in morse_codes]
print("lengths:", lengths)

max_code_length = max(lengths)
p_powers = [p ** i for i in range(max_code_length + 1)]
print("powers of p:", p_powers)


def polynomialHash(s: str) -> int:
    output = 0
    pow = 1
    for i, c in enumerate(s):
        v = 2 if c == '-' else 1
        output = (output + v * pow) % m
        pow *= p
    return output

hashes = [polynomialHash(mc) for mc in morse_codes]
print("hashes:", hashes)
lengths: [2, 4, 4, 3, 1, 4, 3, 4, 2, 4, 3, 4, 2, 2, 3, 4, 4, 3, 3, 1, 3, 4, 3, 4, 4, 4]
powers of p: [1, 3, 9, 27, 81]
hashes: [7, 41, 50, 14, 1, 49, 17, 40, 4, 79, 23, 43, 8, 5, 26, 52, 71, 16, 13, 2, 22, 67, 25, 68, 77, 44]

Optimized Solution Part 2 - Solution

lengths = [2, 4, 4, 3, 1, 4, 3, 4, 2, 4, 3, 4, 2, 2, 3, 4, 4, 3, 3, 1, 3, 4, 3, 4, 4, 4]
hashes = [7, 41, 50, 14, 1, 49, 17, 40, 4, 79, 23, 43, 8, 5, 26, 52, 71, 16, 13, 2, 22, 67, 25, 68, 77, 44]

P = [1, 3, 9, 27, 81]
m = 2 ** 31 - 1

def superPolynomialHash(s: str) -> int:
    output = 0
    for i, c in reversed(s):
        ii = ord(c) - ord('a')
        l = lengths[ii]
        output = (output * P[l] + hashes[ii]) % m
    return output


def uniqueMorseRepresentations(words: List[str]) -> int:
    hashes = { superPolynomialHash(w) for w in words }
    return len(hashes)

Drawback

There are a lot of hash collisions with the p and m I chose, but it didn't seem to matter with the test cases Leetcode had.

Time and Space Complexity

N = len(words)

M = max(word[i].length)

H = maximum length of a string f(c) can output (4 in this case)

L = # of different letters that can be in a word (26 in this case)

Accepted Solution: O(NMH) time O(HL) space

My unoptimized solution: O(NM(M+H+log(H))) = O(NM^2H) time O(L) space`

My optimized solution: O(NM) time O(L) space

Comparison

The Leetcode test cases are not very large so it's very easy to cap out at 0ms. I ran my own test cases 1,000,000 times.

| Case               | Mine   | Top Solution |
| ------------------ | ------ | ------------ |
| Default test case  | .096s  | .236s        |
| arr[i].len > 10000 | 40.59s | 96.18s       |