Sunday, November 21, 2010

Sudoku, CSP and playing with sets

Sudoku is solvable as a constraint satisfaction problem. It is described as a problem where you should populate a grid of 9x9 tiles with numbers ranging from 1-9 in such a way, that there are no duplicate numbers in a row, column or local tile-block. Because an empty grid would allow you to derive any combination imaginable, the puzzle starts with a reasonable amount of number already filled in, usually in the order of thirty something. A naive approach in solving the puzzle uses a brute-force like mechanism: - Start from the first empty tile, determine which valuations can be made, continue to the next, etc... A much better and clever way to solve the puzzle is to find places where only one value is possible, fill this in and see if this modifies the state space, such that another single valuation can be made in a different location. I've taken some time to try to solve this puzzle using the CSP solver that is running on the database. The first naive approach took over an hour and still did not return a single solution. The second approach however worked out to a solution under three seconds. It can solve very easy puzzles as shown here, but also the hardest of them. Harder ones just take slightly longer, because there is less information. Here's how my state space is organized and queries operate:
  1. Create a table to denote the grid. Each record denotes a single tile as a column, row combination and also records to which block that tile belongs.
  2. Create another table that records per record the valuation for that single tile, the tuple.
  3. Inspect the puzzle and create tuple records for each already given number. These constrain the valuations you can make from there.
  4. Then launch this query every time there's still some tile left to evaluate:
select sc.col, sc.row, count(*) as count from sudoku_cell sc
left join tuple t on sc.col = t.col and sc.row = t.row, my_values mv
where mv.some_assignment not in ( select t.assignment from tuple t where t.col = sc.col )
and mv.some_assignment not in ( select t.assignment from tuple t where t.row = sc.row )
and mv.some_assignment not in ( select t.assignment from tuple t where t.block = sc.block )
and t.assignment is null group by sc.col, sc.row order by count asc;
The result is an ordering of tiles, described by the number of valuations that can be made for each specific tile, given the current state of the puzzle. Of course, when you have a tile there where one valuation can be made, that brings you one step closer to the solution.
How does Sudoku help us to understand human reasoning better and how computer algorithms work in comparison to human reasoning?
All this work with SQL got me thinking about the method that computers use in reasoning from the perspective of set theory. Reasoning in computers is highly dependent on set theory, especially when you use SQL like I do. Logic in computer programs are basically running sets through algorithms that determine if a member is part or not part of a set (unions, intersections, etc.). Some sets are made to relate to one another through attributes or keys from either member of each set.

When you want to do anything useful in SQL, you need discriminators that determine the scope of your set and then operations from one set on another. Sometimes the relationship is a parent-child or a direct relationship from one member to another, so you can join the elements together to form a superset. Sometimes you wish to eliminate those values that are apparent in another set without attempting to make a full join. A lot of logic in SQL revolves around the use of "WHERE x NOT IN" or "WHERE x IN" kind of statements. So how does this all work in general? No matter what the query, you typically start out in a form like:
SELECT x,y,z,... FROM TABLE abc;
and very likely, the next improvement to this query is an attached WHERE statement, which reduces the returned set to those values that do not violate the conditions. Another mechanism used for querying is to join tables together on some suggested relation and on the basis of that relationship, further limit the returned results (mostly on attributes in the rows of the related table). This allows you to make rather complicated queries like: "only return those cats that have more than 5 kittens" or "query all orders that have line items higher than threshold X, created between now and three months ago".

The reason for mentioning this is that it may help in the future to construct queries. If one set is empty, it cannot have any effect on another set, because they are fully disjoint. So in order to do anything useful, your query must return rows, at least one.

A tougher example is what I had to do for the Zebra puzzle. There are cases where you have a rule that discriminates members of a set. There may also be no rule. So you can't depend on any useful result returned from the rule query. The idea is then to return all rows unless there's a rule that dictates some should not be returned. The eventual query for solving the puzzle turned out as a very complex beast with multiple IN, NOT IN statements. It's probably helpful to describe them a bit here:

SELECT all colors that can be assigned to the current house being considered, where
  1. the color is not already used in a tuple,
  2. the color is not in any rule that specifically binds the color to a different house,
  3. If there's a rule that dictates the house I'm currently considering has a certain color (a set that is composed of one color), the color must be in that set (return 1). If no such rule exists, return all possible colors (so if no rule exists, allow any color as long as this particular rule is concerned),
  4. the color is not in a set, composed by rules that are active on houses to which I'm not a neighbor. So if a rule dictates that a house must be blue and I am not a neighbor to that house, then exclude blue from the set,
  5. the color is not in a set, composed by all colors that are determined by rules for which the rule dictates the neighbor of the currently considered house should have that color. So, if a rule dictates that my neighbor's house should be blue, then the currently considered house cannot be blue.
You can understand that such reasoning causes programs to explode in complexity. The logic interpreters like Prolog hide a lot of these details (backtracking, generation of possible values), but at some point you may want to start optimizing how sets are generated, which is where you get similar complexity. And you can only be effective when you know how exactly the interpreter is handling things for you.

In Sudoku there's a similar issue that is easier to understand, but not often modeled in solvers. For example, given the above specific sudoku puzzle, the computer finds that column 4, row 1 and row 7 both have one possible valuation. When I solved this puzzle by hand however, I was not immediately drawn to those possibilities, because there are lots of openings around those tiles, so instinctively not a place that could lead to a quick inference. What I did notice very quickly is that tile (9,8) can only contain a 7 because of the two columns 7 and 8 both containing a 7 and the rest of the tile block being populated. The constraints that the computer is using however still do allow a valuation of 1, 2, 3, 6 or 7.

This is the kind of knowledge of experts. Reasoning very quickly over many different sets of data, instantly reasoning over intersections, unions and so on. There are some visual combinations of elements that are very quick to analyze for humans, but which pose challenges for computers. Those quick analyses are very interesting to look at, because it's mostly expert knowledge that is not easily converted into some problem statement or constraint:
For a specific tile block, determine the set of possible valuations in that tile block that are still to be made. This is set X. Then count for each valuation the number of tiles it can be legally assigned to in the tile block. If there is a valuation with only one legal tile allocation, this is the placement to make.
In order to understand this, we need to have an idea about the relationship of one tile with another. This relationship is that one tile is placed in the same block as another tile. Limiting domains is a good way to increase performance if the cost of doing that is less than traversing the possibilities in brute-force. Such complicated relationships are not often modeled in reasoners due to their complexity, but as you can now see they are often and quickly used by human reasoners.

It's difficult to combine these results in one particular query or reasoning mechanism. Another way of looking at this is that they are different strategies for resolving the problem. The problem is only concerned with the tuples, the more there are found the better. Trying out different constraint combinations and understanding how at a higher level these change the solution space is a very important concern for such problem solvers. A downside of this is that the state space and complexity goes up significantly, but at the benefit of finding solutions much faster or much better than a standard CSP can guarantee.

1 comment:

Sandra said...


I just came across this 'sudoku' related post on your blog. I'm interested in advertising on the following page by having the words "Sudoku online" added to the bottom of the post as a link leading to a page on my clients site (an online publication). This link could have the text "This post is supported by..." to show that my client supports your blog.

If you can help us out we would be happy to compensate you. Please email me back and let me know how we can make this work.