Muggleton: Progol (1994)

Background

Recall the discussion of theta-subsumption and generalisation: given

C as nat(X) => nat(s(X))

D as nat(X) => nat(s(s(X)))

C implies D, but C does not theta-subsume D

There is an interesting relationship between C and D: D is just C resolved with itself twice

Muggleton (1992) showed that this is a general situation: if C implies D, and if D is not a tautology, then there is a clause E, which can be got by resolving C with itself zero or more times (ie it might be C itself), and E theta-subsumes D.

A number of attempts have been made to use this for an algorithm for generalisation, by searching for an appropriate C

However there are too many sources of non-determinism for it to be a practical approach

Self-Saturation

An alternative approach finds the self-saturation D' of D

This clause has the property that any clause C which implies D also theta-subsumes D'

Again, there is a practicality problem: D' is often large, and for some clauses may be infinite

Progol's Search Space

We can view concept learning as an attempt, given background knowledge B and example(s) E, to find a hypothesis H satisfying

(B & H) => E

or alternatively

(B & ~E) => ~H

Progol attempts to find , the "most specific clause" corresponding to B and E

It may be defined semi-constructively:

~ is the conjunction of all clauses that are true in all models of (B & ~E)

It has the important property that all candidates for H theta-subsume some sub-saturant of

Unfortunately, is often infinite

Progol instead attempts to construct i, where i specifies the maximum "depth" (in Golem's terms) of variables in the clauses

Limiting the Search

The set of sub-saturants of i is enormous, as is the set of possible theta-subsumptions

Progol limits the search by only looking at theta-subsumptions of i itself, rather than of sub-saturants

Progol also imposes mode declarations similar to those of Golem

However progol's mode declarations further limit the range of possible values for variables by :

* requiring a type discipline: the type of each variable is declared

* permitting the user to specify the number of alternatives which may be tried in attempting to instantiate the particular atom

Searching the Subsumption Lattice

Progol searches the subsumption lattice for solutions H to

(B & H) => E

in general, there may be more than one such solution, so progol seeks the H with the greatest Occam compression

However, like Golem it uses the reduced version of the clause, not the original clause itself

Progol uses an A*-like search algorithm; in particular, it is an admissible algorithm (ie if the best H is reachable, progol will find it)

Unfortunately, the refinement operator used is not complete: there are cases in which H exists, but is not reachable (but this is not surprising: it has recently been shown that any refinement operator satisfying reasonable requirements must be incomplete)

Quinlan: FOIL (1990, 1991)

Language: like CIGOL and Golem, FOIL learns function-free Horn clauses (ie the datalog subset of prolog)

Many of the systems described so far descend from a tradition of learning logic programs

Thus to reduce the search space, they focus on heuristics that lead to program efficiency, on the assumption that learning would not be attempted if it were not assumed that an efficient program was available.

FOIL, on the other hand, is far more influenced by the traditions of attribute based learning, and emphasises the finding of true descriptions of the World, of logic programs

FOIL differs from many of the other systems in being essentially non-interactive (it does not assume an oracle able to answer queries)

It normally requires negative as well as positive examples

(FOIL will optionally use the Closed World Assumption, ie if a possible example is not specified as a positive example of the concept, it is assumed to be a negative example)

FOIL also differs from most of the previously considered systems in using a strongly typed language to reduce search space

FOIL Algorithm

FOIL uses a covering algorithm:

* Find a rule that covers some of the data in the dataset

* Store the rule and remove the covered data from the dataset

* Re-run the algorithm on the reduced dataset

FOIL's inner loop, which generates individual rules, is a general-to-specific algorithm

In attempting to learn a rule of the form Hyp => Class, FOIL begins with Hyp empty (ie with the rule that says everything is in Class), and gradually specialises Hyp

It does so by conjoining literals one by one to Hyp, using a greedy (ie non-backtracking hillclimbing) search algorithm.

FOIL Heuristic

FOIL's search metric is essentially information theoretic (familiar from Quinlan's earlier work with ID3 and C4.5)

The next literal conjoined to the right of the hypothesis will be that which produces the highest information gain relative to the current training set

The current training set is that subset of the data which satisfies the current hypothesis

However, FOIL fudges this metric slightly, using two heuristics which can help it to learn in situations where information gains are low

New Variables

It is often desirable for literals to introduce new variables into the hypothesis

FOIL encourages this by giving a small positive bonus for literals which introduce new variables

Determinacy

Golem has demonstrated the value of determinacy as a guide to learning

Later versions of FOIL incorporate a bias toward determinate literals, by allowing the user to specify an information gain threshold

If at any stage in its search, FOIL is unable to find a literal that produces information gain greater than the threshold, all determinate literals are added to the clause

To ensure that FOIL does not enter an infinite loop of adding new determinate literals at each step, it incorporates Golem's idea of a maximum allowable determinacy depth (recall the i of ij-determinacy), which FOIL sets at 5 (by default)

FOIL Pruning

Finally, FOIL incorporates a pruning stage similar to those of decision tree learners: each literal of the learnt rule is examined to see if it can be omitted without significantly compromising the accuracy of the learnt rule

Note that this pruning justifies the over-optimistic use of determinacy in the learning phase (where all determinate literals are added) since any irrelevant determinate literals will be removed in the pruning phase

Kietz & Wrobel: Mobal/RDT (1991,1993)

Language: function-free Horn clauses

Mobal

RDT is one component of a major project, Mobal

Mobal is described as 'an environment for incremental modelling'

Perhaps it is easiest to think of it as an environment for acquiring knowledge for expert systems

As such, Mobal provides many integrated tools for acquiring and regularising knowledge

RDT is the tool provided for acquiring relational knowledge from data

Thus RDT has a more pragmatic approach than previously discussed systems, even FOIL

RDT

Background Knowledge

RDT uses background knowledge in a manner similar to Golem, by transforming it to a set of ground facts

To ensure this can be done, background knowledge is constrained to consist of generative rules, as with Golem

Typing

Like FOIL, RDT is strongly typed, and uses this typing to restrict the search spaces (ie only type-consistent hypotheses are examined)

Topologies

It also allows the user to specify rule concept 'topologies' (reminiscent of the `determination' declarations of Golem).

Specification of the Hypothesis Language

The major contribution of RDT is in its use of declaratively specified restrictions of the hypothesis language ('rule models')

Rule models are restricted second order formulae: they consist of rules in which the predicate symbols are permitted to be variables

The first order language searched is the set of subclauses of substition instances of the rule models

Search Mechanism

RDT uses a breadth first, general-to-specific search of the specified language

Hypotheses that have already been either accepted as valid or pruned as too special are remembered

New candidate hypotheses are first checked to see if they are specialisations of remembered hypotheses (in which case there is no point in testing further)

Heuristics
Instead of FOIL's information theoretic heuristics, RDT provides simple user-settable heuristics:

The user may specify simple arithmetic functions of such variables as the number of instances correctly predicted by the hypothesis, the number incorrectly predicted by the hypothesis to be members of the class, the number incorrectly predicted not to be members of the class, etc

The formula so specified will be used as the heuristic to guide search

Generalisation and Second Order Variables

The main complexity in RDT stems from the use of second order variables
Second-Order Substitution and Generalisation
In first order logic, the relationship between variable substition and generality is simple: a substition instance of a formula is always less general than the original formula

For example, if we apply a substition that takes Y to X, to a formula p(X,Y), we get p(X,X), which is more specific than the original

On the other hand, if we apply a substitution that takes P2 to P1 to the formula P1(X,Y) & P2(X,Y), we get the formula P1(X,Y), which is more general than the original

Sequence of Instantiations
A somewhat related problem concerns the order in which the predicate variables within a rule model should be instantiated

With a model such as

P(Y) & R(X,Y) => c(X)

It makes little sense to instantiate P first, since the variable Y in the hypothesis is totally unrelated to the variable X in the conclusion, so the rule will have no predictivity

Variable Connection
RDT approaches this problem via the idea of the connection of a hypothesis variable to the conclusion

A literal is directly connected to another literal if they share a variable

Predicate variables are substituted in the order determined by these connection chains:

A predicate variable may only be substituted if it is directly connected to a literal which already has an instantiated predicate symbol

W Cohen: Grendel (1993)

Hypothesis language: function free Horn clauses.

Grendel is built on top of FOIL, and is principally concerned with investigating the declarative specification of inductive bias, both language and search bias

Grendel and Declarative Specification of Language Bias

Grendel can be seen as an extension of the declarative specification ideas of RDT

Instead of a rule model, Grendel provides an Antecedent Description Grammar (ADG)

An ADG is essentially a context free grammar in which the symbols are logical literals (ADGs are closely related to prolog Definite Clause Grammars, and indeed may be expressed in higher-order logic programming languages as higher-order DCGs)

The basic idea is that the only hypotheses considered are those which may be generated by the ADG

In this form, Grendel provides only language, not search, bias - but Cohen was able to demonstrate that virtually any language bias found in the literature could be expressed in this form

Antecedent Description Grammar Example

goal_formula(illegal(A,B,C,D,E,F)).

body(illegal(A,B,C,D,E,F)) -> rels(A,B,C,D,E,F).

rels(A,B,C,D,E,F) -> [].

rels(A,B,C,D,E,F) -> rel(A,B,C,D,E,F), rels(A,B,C,D,E,F).

rel(A,B,C,D,E,F) -> pred(X,Y)

where member(X,[A,B,C,D,E,F]), member(Y,[A,B,C,D,E,F]).

pred(X,Y) -> [X = Y].

pred(X,Y) -> [not X = Y].

pred(X,Y) -> [adj(X,Y)].

pred(X,Y) -> [not adj(X,Y)].

pred(X,Y) -> [less_than(X,Y)].

pred(X,Y) -> [not less_than(X,Y)].

Grendel and Specification of Search Bias

Search bias is represented in Grendel by the use of deferred grammar rules:

A grammar rule is deferred if it is prefixed with an exclamation mark

For example, Grendel might be informed to prefer positive literals by replacing the productions containing negative literals in our previous example with:

goal_formula(illegal(A,B,C,D,E,F)).

body(illegal(A,B,C,D,E,F)) -> rels(A,B,C,D,E,F).

rels(A,B,C,D,E,F) -> [].

rels(A,B,C,D,E,F) -> rel(A,B,C,D,E,F), rels(A,B,C,D,E,F).

rel(A,B,C,D,E,F) -> pred(X,Y)

where member(X,[A,B,C,D,E,F]), member(Y,[A,B,C,D,E,F]).

pred(X,Y) -> [X = Y].

!pred(X,Y) -> [not X = Y].

pred(X,Y) -> [adj(X,Y)].

!pred(X,Y) -> [not adj(X,Y)].

pred(X,Y) -> [less_than(X,Y)].

!pred(X,Y) -> [not less_than(X,Y)].

Using Deferred Rules

Grendel handles deferred clauses thus: learning is initially conducted using only the non-deferred portion of the grammar

Only when Grendel fails to find a literal with positive information gain is the deferred portion of the grammar used

Once a literal with positive information gain has been found, the algorithm reverts to using only the non-deferred portion of the grammar

This mechanism is extremely simple, and more complex ones are readily devised

Nevertheless, Cohen has demonstrated that because the interaction between ADGs and deferral provides a high level of complexity, virtually any previously used search bias can be expressed in this form

Summary

Relational learning, especially inductive logic programming, has advance enormously in the last decade

Relational learning tools are now practical and useful for many real-World problems

However their computational cost is very high, particularly if the dataset is large

Only FOIL can really claim good handling of noisy data

Use an attribute-based learner if the problem can be transformed into an attribute problem