About usWhy usInstructorsReviewsCostFAQContactBlogRegister for Webinar

Word Break Problem

Given a dictionary of words and a string, find the number of ways the string can be broken down into the dictionary words. Return the answer modulo 10^9 + 7.

Example

Input: Dictionary: [“kick", "start", "kickstart", "is", "awe", "some", "awesome”]. String: “kickstartisawesome”.

Output: 4

Here are all four ways to break down the string into the dictionary words:

1) kick start is awe some

2) kick start is awesome

3) kickstart is awe some

4) kickstart is awesome

4 % 1000000007 = 4 so the correct output is 4.

Notes:

Input Parameters: Function has two parameters. 1) an array of strings, the dictionary, 2) a string to be broken down in dictionary words.

Output: Return an integer, the number of distinct ways to break down the string into the dictionary words modulo 10^9 + 7.

Constraints:

● 1

● 1

● 1

● Dictionary words and the string all consist of lowercase latin characters (no whitespace, in particular).

Solutions

We've provided three solutions.

1) brute_force_solution

The brute force approach is pretty simple. Let’s start at the index 0 of the given string and check all the prefixes (of lengths 1, 2 and so on) if they are found in the dictionary. Once we find one that is, we can take note of that and do the same thing on the remaining part of the string (recursion can be used to “do the same thing”). Doing so if we end up with an empty string then it is a success. We will then backtrack our way and keep storing the segments we made.

Time Complexity:

O(dictionaryCount * len * 2^len) where dictionaryCount is the number of words in the dictionary and len is the length of the given input string txt.

To visualize this upper bound on the time complexity let’s think this way. In the worst case, there are O(2^len)  arrangements possible. For each arrangement, the total cost is the cost of building the arrangement + checking if all substrings are present in the dictionary. Length of any arrangement will be O(len), hence cost for building any arrangement will be O(len). Now let's see the cost of checking if substrings are present in the dictionary or not. Suppose in ith arrangement = txt1 + " " + txt2 + " " + txt3 + ... + txtx. Now for each substring txt1, txt2,...txtx we are checking, if all these substrings exist in our dictionary. To do so, we are linearly searching for our substring in the dictionary. For txt1 it will take O(dictionaryCount * len(txt1)) time ... for txtx it will take O(dictionaryCount * len(txtx)) time. So in total, it will take O(dictionaryCount * len(txt1)) + ... + O(dictionaryCount * len(txtx)) time. Now len(txt1) + ... + len(txtx) = len which is total length of txt. Hence total time for checking if substrings are present in the dictionary or not will be O(dictionaryCount * len). So, the total time for processing part will be O(dictionaryCount * len * 2^len).

Auxiliary Space Used:

O(len) where len is the length of the input string txt.

Talking the auxiliary space used by this approach. We aren’t using any temporary container to store the dictionary words. But the recursive backtrack algorithm will acquire O(len) stack size in the worst case and hence, the auxiliary memory used is O(len).

Space Complexity:

O(len + dictionaryCount * max_len) where dictionaryCount is the number of words in the dictionary, len is the length of the given input string txt and max_len is the max length of a word in the dictionary.

The initial input space of the program is dominated by the dictionary array i.e. O(dictionaryCount * max_len),  here max_len is the max length of the word in our dictionary. So, our space complexity is auxiliary space plus the input space O(len + dictionaryCount * max_len).


const int MOD = 1e9 + 7;
void countAllArrangements(string &txt, vector &dict, string &trailingPrefix,
                          int &arrangements)
{

    int len = txt.length();
    // breaking condition, as now we have reached the end
    // of the string
    if (len == 0)
    {
        // adding the segmentation arrangement to the result
        arrangements++;
        if (arrangements >= MOD)
        {
            arrangements %= MOD;
        }
        return;
    }

    // looping on all substring of the string txt starting from index 0
    for (int i = 0; i < txt.length(); i++)
    {
        // current segment ( substring [0,i] )
        string segment = txt.substr(0, i + 1);
        // checking if it exists in the dictionary
        if (find(dict.begin(), dict.end(), segment) != dict.end())
        {
            // appending current segment to the trailing prefix so far
            string newPrefix = trailingPrefix + segment + " ";
            segment = txt.substr(i + 1, txt.length() - i - 1);
            // recursively computing for substring of txt [i+1,len)
            countAllArrangements(segment, dict, newPrefix, arrangements);
        }
    }
}

int wordBreakCount(vector &dictionary, string &txt)
{
    // container to store all segmentations
    int arrangements = 0;
    string prefix = "";
    // computing all the arrangements
    countAllArrangements(txt, dictionary, prefix, arrangements);
    // return computed arrangements
    return arrangements;
}


2) other_solution:

Just by thinking the brute force way, we could have seen that we were recalculating many results over again and hence tells us that the problem has overlapping subproblems property. Consider the following representation for better visualization of overlapping cases:

{ hellomynameisjack }

hello { mynameisjack }

hello my { nameisjack }

hello my name { isjack }

hello my name is { jack }

hello my name is jack { }

We can see that at every step we are solving the same problem over again for the string in the curly braces.

Also, one thing worth noticing is that the segmentation for each text in the curly braces can be appended with its preceding substring. So, this also tells that the problem has an optimal substructure.

Hence, we can use dynamic programming to solve this problem.

So, let’s first calculate the number of different segmentations possible for the given string txt.

We will deal with the DP solution in steps:

Step 1: Decide if it is DP Problem

We have already done this.

Step 2: Define the State of the DP

Let’s start thinking of the states. It is clear that we need to uniquely identify a substring through our state. So, we can take two integers denoting the start and end point of our substring. 

Hence, now our DP state will be a 2-dimensional array.

So, our DP is something like DP[start][end]. Here, DP[start][end] tells us the number of segmented arrangements possible for the substring starting at start and ending at the end. Hence, if we want to know the total arrangements possible for our txt string, we simply ask the DP[0][len-1] state (len is the length of txt string).

Wait, let’s take a break and re-examine our state. Do we really need the end index in our state? No, we can directly omit it to redefine our DP state as DP[start], where DP[start] tells us the number of total segmented arrangements possible for substring starting at start index and ending at length(txt)-1. So, if we now want to know the total number of arrangements for the string txt, we simply go and ask the DP state DP[0]. Take a breath and re-read this current paragraph until you understand the significance of it.

By doing this, We reduced our state space from 2 dimensions to a linear dimension. This process is called state-space reduction.


Step 3: Deciding State Transaction

So far now we have a state like DP[start] which tells us the number of all segmented arrangements possible for the substring starting at index 0 and ending at length(txt)-1. 

Now, let’s see how we can compute value of state(idx)

Here, state(idx) means the same as DP[idx]. 

You can consider state(idx) as the computation and DP[idx] as the result of the computation.

Like the brute approach, we will keep on generating all the substring that starts at idx and check if it exists in the dictionary. If it exists in the dictionary we solve the same problem again for the remaining string

Pseudo code:

state(idx):

    result = 0

    for end = idx to length(txt)-1:

        If substring[idx,end] exists in Dictionary:

            result += state(end+1)    

Above was the pseudo code for the state transition. The variable result stores the total number of arrangements for the state(idx).

Step 4: Adding Memorization:

Do you find anything fishy in our state transitions? Yes, we were over again recalculating for each state. Hence, we apply memoization and store the computed value of each state so that next time if we need that state we do not recalculate it but efficiently look it up.

We do that by simply storing the value of the result variable in our DP[idx] array. Next time the state info is required we first check our DP[] array for that state if it has the result we simply return it or else we compute it, memorize it and then return it.

Here we store all the dictionary words in an unordered_set (HashSet) for fast lookup whether a string is part of the given list of the dictionary or not.

Time Complexity:

O(len^3 + dictionaryCount * max_len) where dictionaryCount is the number of words in the dictionary, len is the length of the given input string txt and max_len is the maximum length of a word in the dictionary.

As there are only len states and we are visiting all len states only once because of memoization. Now at each state, we are first generating all its substring starting from the leftmost index and then checking its presence in the dictionary. This process is O(len^2). (There are O(len) substrings starting from the leftmost index, and the length of each substring is O(len).) So, the time complexity of this DP solution will be: the number of states * computation time of each state i.e. len * len^2. Hence, it summarizes to O(len^3) plus the extra time taken to insert the words of list dictionary into the HashSet i.e. O(dictionaryCount * max_len). 

Auxiliary Space Used:

O(len + dictionaryCount * max_len) where dictionaryCount is the number of words in the dictionary, len is the length of the given input string txt and max_len is the max length of a word in the dictionary.

The auxiliary space complexity is the space taken by the DP memoization table i.e. O(len) + space taken by the dictionary hash set i.e. O(dictionaryCount * max_len). 

Space Complexity:

O(len + dictionaryCount * max_len) where dictionaryCount is the number of words in the dictionary, len is the length of the given input string txt and max_len is the max length of a word in the dictionary.

The total space complexity is the auxiliary space plus the input space O(len + dictionaryCount * max_len) + O(dictionaryCount * max_len) ~ O(len + dictionaryCount * max_len)


long long MOD = 1e9 + 7;

int solve(int idx, unordered_set &dict, string &text, int *dp) {
    int len = text.length();

    // break condition if no substring possible
    if (idx == len) {
        return 1;
    }

    // return if already memoized current state
    if (dp[idx] != -1) {
        return dp[idx];
    }

    // stores computation result for current state
    long long result = 0;

    // initializing current segment starting from idx
    string segment = "";
    for (int i = idx; i < len; i++) {
        // adding ith character to the segment
        segment.push_back(text[i]);

        // checking if segment exists in the dictionary
        if (dict.find(segment) != dict.end()) {
            // segmenting at this index
            // and getting number of possible segmentation arrangements
            // for  substring [i+1,len)
            long long numOfArrangements = solve(i + 1, dict, text, dp);
            // update result for current state
            result += numOfArrangements;
            result %= MOD;
        }
    }

    // memoize current state , so can be reused next time
    dp[idx] = result;

    // return current computed state
    return dp[idx];
}

int wordBreakCount(vector &dictionary, string &txt) {
    // hashmap to store to dictionary words
    unordered_set dict;

    // adding all dictionary words to an hash map
    // for faster searching.
    for (int i = 0; i < dictionary.size(); i++) {
        dict.insert(dictionary[i]);
    }

    const int MAXN = txt.length() + 1;
    // initializing the dp states
    // here -1 means uncomputed state
    int dp[MAXN];
    std::fill_n(dp, MAXN, -1);

    // check if segmentation is possible
    int totalArrangements = solve(0, dict, txt, dp);

    return totalArrangements;
}


3) optimal_solution:

Here we will use the same approach as used in the other solution, but instead of using HashSet to store all the dictionary words, we will use a prefix tree or trie. The trie data structure will guide us to pick the next character while building the substring and will prune the search at an early stage if no matching substring is found. Looking up of substring s (Index 0 to length(s)-1) in maintained HashSet will take more time in comparison to trie because we already have information of node which is matching with s[length(s)-2] and only needs to check for single character to find out whether the complete substring exists in the trie (i.e. given list dictionary) or not. 

Time Complexity:

O(len^2 + dictionaryCount * max_len) where dictionaryCount is the number of words in the dictionary, len is the length of the given input string txt and max_len is the max length of a word in the dictionary.

Same explanation as other_solution except for one thing: for each state calculation will not take O(len^2) but instead of that, it takes O(len). Hence our total complexity for this solution will be the number of states * computation time of each state i.e. len * O(len). Hence, it summarizes to O(len^2) + the extra time taken to insert the words of list dictionary into the trie i.e. O(dictionaryCount * max_len).

Auxiliary Space Used:

O(len + dictionaryCount * max_len) where dictionaryCount is the number of words in the dictionary, len is the length of the given input string txt and max_len is the max length of a word in the dictionary.

The auxiliary space complexity is the space taken by the DP memoization table i.e. O(len) plus space is taken by the dictionary trie i.e. O(dictionaryCount * max_len). 

Space Complexity:

O(len + dictionaryCount * max_len) where dictionaryCount is the number of words in the dictionary, len is the length of the given input string txt and max_len is the max length of a word in the dictionary.

The total space complexity is the auxiliary space + input space + output space. O(len + dictionaryCount * max_len) + O(dictionaryCount * max_len) + O(1) = O(len + dictionaryCount * max_len).


// TrieNode structure
struct TrieNode {
    // marks true if current node is end of word in trie
    bool isEnd;
    // stores references to all its children
    TrieNode *child[26];
    // paramtrized constructor
    TrieNode() : isEnd(false)
    {
        for (int i = 0; i < 26; i++) {
            child[i] = NULL;
        }
    }
};

// inserts key "k" in the trie
void insert(TrieNode *root, string &k) {
    TrieNode *tmp = root;
    int l = k.length();
    for (int i = 0; i < l; i++) {
        int idx = k[i] - 'a';
        if (!tmp->child[idx]) {
            tmp->child[idx] = new TrieNode();
        }
        tmp = tmp->child[idx];
    }
    tmp->isEnd = true;
}

int MOD = 1e9 + 7;

int solve(int idx, string &text, int *dp, TrieNode *root) {
    int len = text.length();

    // break condition if no substring possible
    if (idx == len) {
        return 1;
    }

    // return if already memoized current state
    if (dp[idx] != -1) {
        return dp[idx];
    }

    // stores computation result for current state
    long long result = 0;

    // initializing current segment starting from idx

    TrieNode *segment = root;
    for (int i = idx; i < len; i++) {
        if (segment->child[text[i] - 'a'] != NULL) {
            segment = segment->child[text[i] - 'a'];
        }
        else {
            break;
        }

        // checking if segment exists in the dictionary
        if (segment->isEnd == true) {
            // segmenting at this index
            // and getting number of possible segmentation arrangements
            // for  substring [i+1,len)
            long long numOfArrangements = solve(i + 1, text, dp, root);
            // update result for current state
            result += numOfArrangements;
            result %= MOD;
        }
    }

    // memoize current state , so can be reused next time
    dp[idx] = result;

    // return current computed state
    return dp[idx];
}

int wordBreakCount(vector &dictionary, string &txt) {
    TrieNode *root = new TrieNode();

    // adding all dictionary words to an hash map
    // for faster searching.
    for (int i = 0; i < dictionary.size(); i++) {
        insert(root, dictionary[i]);
    }

    const int MAXN = txt.length() + 1;
    // initializing the dp states
    // here -1 means uncomputed state
    int dp[MAXN];
    std::fill_n(dp, MAXN, -1);

    // check if segmentation is possible
    int totalArrangements = solve(0, txt, dp, root);

    return totalArrangements;
}


Attend our Free Webinar on How to Nail Your Next Technical Interview

Recommended Posts

All Posts