Posts Tagged ‘algorithms’

Contest Fail

Posted in Programming on January 16th, 2012 by Jamie – Be the first to comment

Last weekend I had a few hours to spare, so I decided to work on one of the problems from CodeSprint2, InterviewStreet’s contest to find the best programming talent.

Since I didn’t have long to work, I decided to try to complete the algorithm problem with the highest value. I chose to work on Count Strings (closed now). Given a regular expression, and a string length N, count the number of distinct strings of length N which the regular expression can match.

I looked at the sample test case, spent some time thinking about the problem and decided to try it. Although bounds for test cases were provided, I gave them only a cursory glance. This would turn out to be a mistake.

In my daily work, I typically focus on getting things working rather than spending time on determining the optimal way to do things. It is with this mindset that I decided that I would actually enumerate every string that a regular expression could produce, and then count them. I failed even before I began.

The regular expression language was limited to the alphabet {a, b}, with operators star, union, and concatenation.

I decided to use Haskell, which has the terseness of Ruby or Python, but the performance and type checking of a compiled language. I created a data type to represent the regular expressions. Recursive data types made this representation very straightforward.

data RegularExpression = Symbol Char
  | Concat RegularExpression RegularExpression
  | Union RegularExpression RegularExpression
  | Star RegularExpression
  deriving (Show, Eq)

I had the enumeration part complete fairly quickly. Pattern matching makes everything really intuitive and readable. The star operation required the most code.

listPossibilities (Symbol c) limit
  | limit > 0 = [c]
  | otherwise = []

listPossibilities (Concat r1 r2) limit = combos where
  o1 = listPossibilities r1 limit
  o2 = listPossibilities r2 limit
  combos = [ a ++ b | a <- o1, b <- o2, length (a++b) <= limit ]

listPossibilities (Union r1 r2) limit = possibilities where
  o1 = listPossibilities r1 limit
  o2 = listPossibilities r2 limit
  possibilities = o1 ++ o2

listPossibilities (Star _) 0 = [""]
listPossibilities (Star r1) limit = possibilities where
  opt = listPossibilities r1 limit
  possibilities = "": whileLimit opt limit optSet
  optSet = Set.fromList opt 

  -- | Uses nub right now, really inefficient. probably should use some sort of memoization
  whileLimit :: [String] -> Int -> [String] -> [String]
  whileLimit base lim acc = pos where
    new = List.nub [a++b | a <- base, b <- acc, length (a++b) <= lim]
    pos = if null new || length new < length acc then acc
          else whileLimit base lim (List.nub $ acc ++ new)

The next component I needed was a parser to convert input into these regular expression objects. I used the Parsec parser module to construct the parser. This part actually took a bit longer, as I tried to refine the parser to accept deeply nested expressions.

The regular expression union parser below parses a parenthesized expression or symbol, followed by a pipe, and another parenthesized expression or symbol, returning a RegularExpression object.

reUnion :: GenParser Char st RegularExpression
reUnion = do
  re1 <- pexpr <|> reSymbol
  char '|'
  re2 <- pexpr <|> reSymbol
  return (Union re1 re2)

Once I got the parser working, I glued it together with the enumeration portion, and submitted it to the contest. It immediately failed to compiled due to a dependency on Test.HUnit I had in my code. I reorganized the code so that I could easily remove this dependency, and resubmitted.

This time, it took, but when I went back after a few minutes to check the submission status, it had failed all but 1 of the test cases. It had taken too much time! I went back to the submission guidelines and noted that there was a time limit of 5 minutes.

Originally, I had used an algorithm which enumerated by checking distinctness using a List type. I knew at the time that this was very inefficient, but I focused on finishing the implementation at the time. I went back to this part and used a Set instead. I was optimistic that this would fix my problems and I would be able to knockout at least another test case.

I resubmitted and I failed 10 tests again. Then, I reread the problem, and looks at the boundaries for N. N could be up to 100,000 (IIRC)! Of course enumeration (especially the way I was doing it which would enumerate many strings of less than length N) would fail–I had taken a totally incorrect approach to this problem. I did not need to enumerate–I only needed to give a number of potential strings. This could’ve been done without enumeration. For example, given the expression a*, and a number N, there is only 1 possibility. Given (a|b)^*, 2^N. For example, one of the test cases was (a|b)^*, N=5, which has 32 possibilities. Another of the expressions was a^*ba^*. Since b can only appear once, the possible strings include strings of  (N-1) a’s and a single b. There are N such combinations (the number of positions b can be in, in an N length string): baa…, aba…, aab…, … . This is the approach I should’ve taken.

I left the solution there, and acknowledged my defeat. Source here: https://github.com/jamiely/count-strings

Word Jumble Game: Part 5

Posted in Software on March 22nd, 2010 by Jamie – Be the first to comment

I used jQuery for the UI. I am a recent convert to jQuery, having mostly used Prototype + Scriptaculous.

The word list is embedded into the page script as a javascript array. On document ready, html is generated, which writes the first and last word to the page, and creates blank input boxes for the intermediate words.

There is a keyup event bound on each input box, which will determine if the word is correct. If it is, a css class will be added which shows a green underline underneath the box. Otherwise, a red underline will be shown.

Finally, there are buttons on the page which are created dynamically and provides hints or reveal all of the answers.

Word Jumble Filled

Word Jumble Filled

Word Jumble Game: Part 4

Posted in Software on March 20th, 2010 by Jamie – Be the first to comment

Search

The problem of generating the chain of clues is a simple search problem. In this case, depth-first search was used, because the algorithm would attempt path depth-wise and only explore another branch if the generated chain was not long enough.

Another tactic would have been to use a breadth first search. To use breadth-first search, we could have modified the regex pattern to find all words that differed from the base word by just one letter.

Using water as the base word, that regular expression looks something like: /([^w]ater|w[^a]ter|wa[^t]er|wat[^e]r|wate[^r])/. This would find all words in the dictionary that differed by one word (let’s call this word set B).

If we were using breadth-first search, we would then repeat the process with all of the words we just found (word set B).

If you were to visualize the difference between breadth-first and depth-first search, breadth-first would look like a tree with wide but shallow roots. Depth-first search would look like a tree with few but deep roots.

Query Params

The flexibility of the puzzle is enhanced by optional query parameters that may be applied. The word param allows specification of the starting or seed word. The length param specifies the maximum length of the puzzle.

Recursion

The program uses recursion to perform the search. This almost goes without saying, for it is difficult to do general search without recursion (although you could do so with macros and similar programming constructs). Search may be done using loop control structures but I can’t imagine an elegant solution using loops.

The pseudocode for the recursion is basically:

function build(baseWord, chainWords, maxLength)

    regex = generateRandomRegex(baseWord)
    wordSetB = getPossibleWords(regex, notIn=chainWords)
    for(word in wordSetB)

        chain = build(word, chainWords+word, maxLength)
        if Length(chain) >= maxLength

            break

    return chain

Word Jumble Game: Part 3

Posted in Software on March 18th, 2010 by Jamie – Be the first to comment

The first thing I did was made sure that the word list would be cached on application start. This was as simple as creating an Application.cfc cfcomponent and implementing the onApplicationStart function.  This function reads the dictionary in (described in the last entry) and caches the word list in a ColdFusion array. There are other options for storing this data, but this had the best mix of speed and function considering the method of search I wanted to use against it.

Although the dictionary was only 52K, this caching probably helped performance a great deal.

To generate the word list, I decided on the following algorithm:

  1. Choose an initial starting word (at random, or via user entry)
  2. Use the word to generate a regular expression.
    Replace a random single letter with the Regex pattern [^L] (where L is the letter you have replaced).

    Example:

    word: water
    regex: w[^a]ter

  3. Next, iterate through all of the words, testing each word against the regular expression. Store all matches.
  4. With each match, one-by-one, repeat Step 2 until we get a chain of N words. (Where N is the maximum length of the chain.)
  5. Obviously, if we have no more matches, we stop. If we have at least a 3-word chain, we can use it.

There are a few considerations not discussed above in generating the puzzle:

  • If we match a word that is already in the chain, we should ignore that word to avoid duplicates.
  • Not implemented: we should not replace a letter in the same position twice. For example, if we replace the “w” in water, don’t replace the “h” hater (if hater is the 2nd word).
  • Depth-first versus Breadth-first searching…to be discussed

Gesture Recognition

Posted in Libraries on March 24th, 2009 by Jamie – Be the first to comment

I came across an awesome article on simple gesture recognition.

This page implements a “$1 Gesture Recognizer” that is easy, cheap, and usable almost anywhere. It requires under 100 lines of easy code and achieves 97% recognition rates with only one template defined for each gesture below. With 3+ templates defined, accuracy exceeds 99%.

http://depts.washington.edu/aimgroup/proj/dollar/



Switch to our mobile site