Relational Learning
Mitchell, "Machine Learning", Ch 9.5 & 10
Muggleton: "Inductive Logic Programming"
Web Pointers
What is Relational Learning?
The learning approaches we have so far examined in detail - decision tree learning systems, feed-forward neural networks, genetic algorithms - use representation languages which are effectively propositional. That is, their search spaces are not able to represent relationships between data items. But many, perhaps most, real-World problems do involve relationships.
Propositional learning systems cannot learn relational knowledge, such as a definition of a sort or append relation; they cannot learn about family relationships, about geometric or temporal relationships; in my own field of interest, ecological problems, they cannot learn about spatial relationships. In fact, even relational learning is insufficient: without functions, it is difficult to express - and therefore learn - important concepts such as distance.
How do I tell if my Learning Problem is Relational?
In many cases, this is quite easy. For example, in natural resource domains one common application of machine learning is to attempt to predict species distribution from satellite or other mapping data. In a non-relational learning system, eg C4.5, each location would be treated separately - ie the system ignores any spatial relationships between data locations.
For species which can move about, eg animals, this often results in quite poor predictions. Greater gliders require both old trees with hollows (for nesting sites) and young, fast growing trees for feeding; these may be quite distant from each other - greater gliders will move up to a kilometer per night. If we had a dataset with one-hectare sample squares, a propositional learner would be unable to distinguish between a large area of new forest - where we would not expect to find many greater gliders - and a patchwork of new and old forest, where greater gliders are likely to be abundant. Of course we could synthesise new attributes for a propositional learning system - eg 'patchwork' vs 'uniform' - which record the relevant spatial attributes and incorporate them into each location, but we can only do this if we know ahead of time which spatial relationships are relevant. In general, it's simpler just to use a relational learning system and supply it with the means to compute the spatial relationships (ie functions to compute distance, orientation etc).
Conversely, if we were trying to predict the distribution of trees, we might hypothesise that spatial relationships were less important to trees because they don't move about, and hence that propositional learners would be adequate for predicting tree distributions. In general, we would be correct - non-relational learning systems have been used quite successfully for tree distribution prediction. But it is easy to underestimate the importance of relations - for example, we would expect a poor site for a particular species to have more of that species if it was near to a good site, simply because it would be continually receiving seed from the good site. Or to take another example, if our dataset included spot heights but not soil moisure, a relational learner could in theory use those soil moisture, whereas a propositional learner could not.
A second use of relationships is in comparing attributes. For example, suppose we have a dataset about companies (perhaps for a share portfolio system), and the data incorporates income and expenditure attributes but not profit. A relational system could readily distinguish between profitable and loss making companies through the relation income > expenditure, whereas a propositional system could not. Of course, any sensible financial advisor would have included profit/loss as an attribute, but other relationships may be less obvious.
Relational Learning Approaches
There are two principal relational learning approaches in use today:
Inductive Logic Programming (ILP)
ILP systems learn relations in the form of logic theories, usually expressed in a subset of the prolog language. The learning algorithms are typically determinstic. They can learn highly complex theories, certainly up to the level of undergraduate computing students. Their tolerance of noise ranges from poor to fair. They provide powerful mechanisms for incorporating user knowledge to help guide the search. They are relatively easy to re-target to new learning problems.
Genetic Programming (GP)
GP systems learn relations in the form of programs; originally, these were mostly in functional languages such as lisp, though recently there has been a lot of work on machine code as well (for obvious efficiency reasons). The learning algorithms, being evolutionary, are highly non-deterministic. They can learn relatively complex programs, though not perhaps as complex as those learnt by ILP. On the other hand, their noise tolerance is usually excellent. Mechanisms for incorporating user knowledge are weaker. They are more difficult to re-target to a new learning problem - the user has to supply the semantics of the problem and a fitness function, usually in the form of a program.
Applications of Relational Learning
Relational learning systems have been applied to a wide range of learning problems - diagnosing circuit faults, designing electronic circuits, analysis of protein folding, interpreting DNA, ecological modelling etc. However they have also been used in two other, less obvious, ways
Language Learning
Languages are just a special form of relational theory, hence relational systems may be used to learn languages. Exerimental and theoretical work in this area is casting light on the process of learning language, and on theoretical issues such as Chomsky's theory of innate language learning abilities.
Social Modelling
Relational learning systems provide reasonable simulations for rational beings, and hence may be used to model social processes and confirm or disprove theories. They have been used to demonstrate that a number of assumptions in economic theory are necessary.
Why is Relational Learning Difficult?
In propositional and attribute learning:
In relational learning:
Why Use Background Knowledge?
Inductive Logic Programming
Banerji: Predicate Logic (1964)
Use of predicate logic as a description language;
Incremental learning (learned concepts added to the hypothesis language).
Plotkin: Relative Least General Generalisation (1970)
Language: First Order Predicate Calculus.
We previously noted the connection between implication and generalisation
One way for one clause to imply another is for the first to contain all the literals of (subsume) the second
Another way is for the second to be a substitution instance (via a substitution theta) of the first
Combining these, we get the definition of theta-subsumption: C q -subsumes D iff there is a substitution q such that q applied to C gives a subset of D.
The aim of Least General Generalisation (LGG) is to climb up the q -subsumption hierarchy the minimum distance necessary to generalise the specific clauses under consideration (the learning data)
LGG can't handle background knowledge: extension to Relative LGG
Requires a definition of generalisation relative to background knowledge: C generalises D relative to theory T iff C q -subsumes D together with the negations of the literals in T
Computational Problems
Theoretically nice, but dropped for a time because of computational problems:
Note that theta-subsumption is not an exact inverse of implication: let
C be the rule nat(X) => nat(s(X))
D be the rule nat(X) => nat(s(s(X)))
then C implies D, but C does not theta-subsume D.
Michalski: Induce (1973,1978,1980)
Used a very limited, special purpose subset of the predicate calculus.
Relied on the 'star' induction algorithm to find descriptions of a class: special-to-general learner
Computational complexity of the star induction algorithm is unsatisfactory for lage datasets
Vere: Thoth (1975, 1977)
Language: literals consist of strings of constant and variable symbols (so effectively a higher order logic); provides conjuntion, disjunction and restricted use of negation
Generalisation by matching: literals from different instances are matched as well as possible; non-matching constants are replaced by variables.
B Cohen: Confucius (1978, 1982)
Language: subset of predicate calculus
Sammut: Marvin (1980, 1986)
Language: initially attribute-value based, eventually Horn clause
Generalises using background knowledge, stored as Horn clauses
Mixed specialisation and generalisation process
Shapiro: MIS (1981)
Language: prolog
Aimed at debugging prolog programs:
If the theory is too general (ie inconsistent with examples), try to trace the part of the theory leading to the inconsistency, and delete it
If the theory is too specific (ie does not cover all examples), add to the hypothesis by refining an already refuted hypothesis
Refinement operators provided included:
Muggleton: DUCE (1987)
Language: Propositional, not relational
Critical to later relational work
In lecture 2, we mentioned that machine deduction mostly uses a method called resolution
DUCE is based on the idea of 'inverting resolution' to give generalisation operators
The Resolution Method
Resolution is most readily understood as a method of proving unsatisfiability: rather than prove P valid, the resolution method attempts to prove that ~P is unsatisfiable.
Conjunctive Normal Form
A literal is an atomic formula or the negation of an atomic formula
A formula is in conjunctive normal form (CNF) if it is the conjunction of disjunctions of literals
Each formula has (up to equivalence) only one CNF (the engineers in the class might prefer to think of minterms)
Clauses
Because of commutativity, the order of formulae in a disjunction or conjunction is irrelevant.
So it is often convenient to just consider the unordered set of formulae making up the conjunction or disjunction
define:
Clause
A clause is a finite set of literals, possibly empty (representing a finite disjunction of literals)
We represent the empty clause as
Clause Set
A clause set is a set of clauses, possibly empty (denoted ø) or even infinite, and possibly containing
Using Clause Sets
Every finite nonempty clause set not containing
corresponds to a formula in CNF in an obvious wayEvery formula in CNF has such a representation
From a propositional formula, we can first construct its CNF, and then the corresponding clause set
Resolution
Resolvents
Let C1 and C2 be two clauses, and A an atomic formula with A being in C1 and ~A being in C2
Then D = (C1 \ {A})
ª (C2 \ {~A}) is called a resolvent of C1 and C2That is, a resolvent is obtained by removing a complementary pair of literals from two clauses, and merging the resultant sets
The Resolution Rule
If S is a clause set, and D is a resolvent of two clauses in S, then S is logically equivalent to S
ª {D}That is, adding a resolvent of a clause set to the set does not change its satisfiability
Resolution Theorem
Clause set S is unsatisfiable if and only if by the empty clause
is obtainable by repeated application of resolution to SApplication
The Resolution Theorem thus gives us a method for deciding the validity of a proposition P:
Form the negation ~P
Find the CNF C of this
Convert to a clause set S
Perform all possible resolutions
Check whether the empty clause
occursHowever in performing all possible resolutions, we may have computed far more than is actually required:
we only need to find one way to construct the empty clause
Thus the aim is to find an algorithm which will efficiently find such a deduction if there is one.
There is good reason to believe that no efficient general method can be found, but polynomial time algorithms for some important restricted classes of formulae have been discovered
Inverting Resolution
V Operators
The V operators invert one resolution step:
Given two clauses
A = {A1,...,Am,L}
B = {B1,...,Bn,~L}
resolution creates the resolvent
D = {A1,...,Am,B1,...,Bn}.

The V operators run this backward
Absorption
creates
B = {B1,...,Bn,~L}
from
D = {A1,...,Am,B1,...,Bn}
A = {A1,...,Am,L}.
Identification
creates
A = {A1,...,Am,L}
from
D = {A1,...,Am,B1,...,Bn}
B = {B1,...,Bn,~L}.
W Operators
The W operators invert a pair of 'back-to-back' resolution steps:
Given three clauses
A = {A1,...,Am,L}
B = {B1,...,Bn,L}
C = {C1,...,Cr,~L}
resolution can create the resolvents
AC = {A1,...,Am, C1,...,Cr}
BC = { B1,...,Bn, C1,...,Cr}.

Intra-construction
from
AC = {A1,...,Am, C1,...,Cr}
BC = { B1,...,Bn, C1,...,Cr}
construct the clauses
A = {A1,...,Am,L}
B = {B1,...,Bn,L}
C = {C1,...,Cr,~L}
Inter-construction
Almost the same, but replaces L by ~L:
from
AC = {A1,...,Am, C1,...,Cr}
BC = { B1,...,Bn, C1,...,Cr}
construct the clauses
A = {A1,...,Am,~L}
B = {B1,...,Bn,~L}
C = {C1,...,Cr,L}
A critical point to note is that W operators can invent new literals:
The literal L occurs in the generalisation, but was not present in the original clauses AC and BC
Muggleton & Buntine: Cigol (1992)
Language: function-free prolog
Builds on DUCE and Buntine's work on relative subsumption and relative generalisation
Theoretically, the only difference from DUCE is the propositional/predicate change: extending resolution to the predicate calculus requires replacing identity by unification
The Resolution Method in the Predicate Calculus
Predicate Normal Forms
Prenex Normal Form
Just as any propositional formula can be reduced to an equivalent CNF, any predicate formula can be reduced to prenex normal form
In prenex normal form, all the quantifiers have been moved to the front: the formula consists of a string of quantifiers over variables, followed by the body (known as the matrix) of the formula, which contains no quantifiers, and is in CNF
Full Normal Form
Further magic (involving the introduction of lots of new function symbols) allows the removal of all the existential quantifiers, and once only universal quantifiers are left, they may be taken as implicit, so the formula can again be treated as a clause set
Resolution & Unification
As with the propositional calculus, resolution may be applied to derive the empty clause
But there's a complication that doesn't arise with propositional logic:
What is the resolvent of (say)
{king(X,louis)}
{~king(france,Y)}
Propositionally, there isn't one
In predicate deduction, we can unify
X with france
Y with louis
so that the two clauses then become
{king(france,louis)}
{~king(france,louis)}
which give the resolvent
A conclusion which will no doubt satisfy the French republicans
Inverse Resolution and Subsumption
Extending inverse resolution to the predicate calculus requires replacing identity by substitution
In fact, the full generality of the V and W operators would generate unacceptably large search spaces, so only a restricted subset is implemented (refer to the diagrams under DUCE)
V operators:
identification
not implemented
absorption
clauses A and B contain no common literals (other than literals that specialise to L)
W operators:
inter-construction
not implemented
intra-construction
clauses A and B contain only one literal ('unit clauses')
Truncation
The final inverse resolution operator, truncation, is effectively a degenerate case of the W operators
Given two negative singleton clauses
{ ~L1 }
{ ~L2 }
there may be a literal L which, with appropriate variable substitutions, can resolve separately with each of them, giving the resolution diagram :

Given { ~L1 } and { ~L2 }, we can try to generate an appropriate L
In fact, a good choice is the LGG of the two, if it happens to exist
So from
member(1,cons(1,2))
member(red,cons(red,green))
truncation would generate
member(A,cons(A,B))
In fact, Cigol extends truncation to produce the LGG of an arbitrarily large set of unit clauses
Search Restrictions
Despite the restricted form of the V and W operators, the search space still requires narrowing
Cigol imposes an additional requirement that at each stage of the search, the new theory is syntactically smaller than the original
Muggleton & Feng: Golem (1992)
Golem, or RLGG revisited: Recall the problems demonstrated by Plotkins 1960s work on RLGGs
Muggleton and Feng found that a set of reasonable restrictions on
combined with
could improve the tractability of the RLGG approach
Restricted Background Knowledge
Golem restricts the background knowledge to be syntactically generative
A clause
A1 &....& Am => B
is syntactically generative if the varables in B are a subset of the variables in A1,...., Am
For example,
member(X,Ys) =>
member(X,[Y|Ys])
is not syntactically generative (since Y occurs in the conclusion but not the antecedent of the rule), but
member(X,Ys) &
integer(Y) =>
member(X,[Y|Ys])
is syntactically generative
In fact, a non-generative clause may always be made generative by adding typing or other information
Finite Ground Model
Muggleton & Feng combine this restriction with the use of a finite subset of a ground model of the background theory
Specifically, they define an atom A of a theory K to be h-easy if A can be derived from K in at most h applications of resolution
They then define Mh(K) to be the Herbrand instantiations of h-easy atoms of K (the Herbrand instantiations can be thought of as the instantiations that have to exist just because of the symbols in the language)
Mh(K) is thus a finite subset of a ground model of K (in fact, its at most exponential in K)
Restricted Hypothesis Language
The use of h-easy models and generative background theories improved Plotkins results by at least ensuring that RLGGs were finite
But with a length exponential in the number of examples, the RLGG was still not computationally tractable
Determinacy in the body of a clause can improve learnability
Speaking carelessly, a variable V in a literal is determinate if for every combination of values of the other variables in the literal, there is exactly one value of V for which the literal is true
Thus Z is determinate in
append(X,Y,Z)
but X is not in
member(X,Y)
Muggleton & Feng further extended this definition to the idea of ij-determinacy
They were able to prove that, for fixed i and j, the size of the RLGG of n examples was not only not exponential in n, but was independent of n
ij-determinacy
Muggleton & Feng illustrate the idea of ij-determinacy with the clause:
decrement(B,D &
multiply(A,D,E) &
plus(A,E,C) =>
multiply(A,B,C)
In the literal multiply(A,D,E), the value of E is determined only if both the values A and D are fixed; E is said to be of degree 2
In ij-determinacy, the j specifies the maximum degree of any variable in any literal of a hypothesised clause which does not occur in the conclusion of the clause
The i specifies the maximum depth at which a determinate variable may occur
The depth of a variable which occurs in the conclusion of the clause is defined as 0
If a literal determines its variables in terms of other variables whose maximum depth is n, then the depth of that literal is n+1
So in the example
Thus the example clause is 22-determinate
Clause Reduction
While Golems restrictions on background knowledge and hypothesis language do allow the computation of relatively small RLGGs, they do not remove the need for clause reduction
Rather than following Plotkins approach of logical reduction, Muggleton & Feng used more semantically based approaches to clause reduction: functional and negative-based reduction
Functional Reduction
Golem supports the idea of mode declarations, which specify which variables of a predicate will be used for input values, and which for output
For example, append might be declared as
mode(append(+,+,-))
which says that (in this context) append will only be used to join two lists together to form a third
From such mode declarations, Golem can construct a functional graph showing the dependency of predicates in a clause - eg for quicksort:
mode(qsort(+,-))
mode(partition(+,+,-,-))
partition(A,B,E,F) &
qsort(F,G) &
qsort(E,H) &
append(H,[A|G],[C|D]) =>
qsort([A|B],[C|D])

Given a candidate RLGG, Golem attempts to construct a functional graph of the hypothesis using only literals within the RLGG
If it succeeds, any literals not occurring in the functional graph are deleted from the RLGG
Note that this functional reduction relies on the assumption of determinacy
Negative-based Reduction
So far, we have discussed learning only in the presence of positive examples of a concept
Golem also makes use of negative examples (ie possible tuples that are not part of the concept)
These are held in a negative example file
Negative-based reduction of the RLGG relies on the removal from the RLGG of literals, so long as removal of the literal does not allow the derivation of a negative example of the concept