Relational Learning

Mitchell, "Machine Learning", Ch 9.5 & 10

Muggleton: "Inductive Logic Programming"

Web Pointers

What is Relational Learning?

The learning approaches we have so far examined in detail - decision tree learning systems, feed-forward neural networks, genetic algorithms - use representation languages which are effectively propositional. That is, their search spaces are not able to represent relationships between data items. But many, perhaps most, real-World problems do involve relationships.

Propositional learning systems cannot learn relational knowledge, such as a definition of a sort or append relation; they cannot learn about family relationships, about geometric or temporal relationships; in my own field of interest, ecological problems, they cannot learn about spatial relationships. In fact, even relational learning is insufficient: without functions, it is difficult to express - and therefore learn - important concepts such as distance.

How do I tell if my Learning Problem is Relational?

In many cases, this is quite easy. For example, in natural resource domains one common application of machine learning is to attempt to predict species distribution from satellite or other mapping data. In a non-relational learning system, eg C4.5, each location would be treated separately - ie the system ignores any spatial relationships between data locations.

For species which can move about, eg animals, this often results in quite poor predictions. Greater gliders require both old trees with hollows (for nesting sites) and young, fast growing trees for feeding; these may be quite distant from each other - greater gliders will move up to a kilometer per night. If we had a dataset with one-hectare sample squares, a propositional learner would be unable to distinguish between a large area of new forest - where we would not expect to find many greater gliders - and a patchwork of new and old forest, where greater gliders are likely to be abundant. Of course we could synthesise new attributes for a propositional learning system - eg 'patchwork' vs 'uniform' - which record the relevant spatial attributes and incorporate them into each location, but we can only do this if we know ahead of time which spatial relationships are relevant. In general, it's simpler just to use a relational learning system and supply it with the means to compute the spatial relationships (ie functions to compute distance, orientation etc).

Conversely, if we were trying to predict the distribution of trees, we might hypothesise that spatial relationships were less important to trees because they don't move about, and hence that propositional learners would be adequate for predicting tree distributions. In general, we would be correct - non-relational learning systems have been used quite successfully for tree distribution prediction. But it is easy to underestimate the importance of relations - for example, we would expect a poor site for a particular species to have more of that species if it was near to a good site, simply because it would be continually receiving seed from the good site. Or to take another example, if our dataset included spot heights but not soil moisure, a relational learner could in theory use those soil moisture, whereas a propositional learner could not.

A second use of relationships is in comparing attributes. For example, suppose we have a dataset about companies (perhaps for a share portfolio system), and the data incorporates income and expenditure attributes but not profit. A relational system could readily distinguish between profitable and loss making companies through the relation income > expenditure, whereas a propositional system could not. Of course, any sensible financial advisor would have included profit/loss as an attribute, but other relationships may be less obvious.

Relational Learning Approaches

There are two principal relational learning approaches in use today:

Inductive Logic Programming (ILP)

ILP systems learn relations in the form of logic theories, usually expressed in a subset of the prolog language. The learning algorithms are typically determinstic. They can learn highly complex theories, certainly up to the level of undergraduate computing students. Their tolerance of noise ranges from poor to fair. They provide powerful mechanisms for incorporating user knowledge to help guide the search. They are relatively easy to re-target to new learning problems.

Genetic Programming (GP)

GP systems learn relations in the form of programs; originally, these were mostly in functional languages such as lisp, though recently there has been a lot of work on machine code as well (for obvious efficiency reasons). The learning algorithms, being evolutionary, are highly non-deterministic. They can learn relatively complex programs, though not perhaps as complex as those learnt by ILP. On the other hand, their noise tolerance is usually excellent. Mechanisms for incorporating user knowledge are weaker. They are more difficult to re-target to a new learning problem - the user has to supply the semantics of the problem and a fitness function, usually in the form of a program.

Applications of Relational Learning

Relational learning systems have been applied to a wide range of learning problems - diagnosing circuit faults, designing electronic circuits, analysis of protein folding, interpreting DNA, ecological modelling etc. However they have also been used in two other, less obvious, ways

Language Learning

Languages are just a special form of relational theory, hence relational systems may be used to learn languages. Exerimental and theoretical work in this area is casting light on the process of learning language, and on theoretical issues such as Chomsky's theory of innate language learning abilities.

Social Modelling

Relational learning systems provide reasonable simulations for rational beings, and hence may be used to model social processes and confirm or disprove theories. They have been used to demonstrate that a number of assumptions in economic theory are necessary.

 

Why is Relational Learning Difficult?

In propositional and attribute learning:

In relational learning:

Why Use Background Knowledge?

Inductive Logic Programming

Banerji: Predicate Logic (1964)

Use of predicate logic as a description language;

Incremental learning (learned concepts added to the hypothesis language).

Plotkin: Relative Least General Generalisation (1970)

Language: First Order Predicate Calculus.

We previously noted the connection between implication and generalisation

One way for one clause to imply another is for the first to contain all the literals of (subsume) the second

Another way is for the second to be a substitution instance (via a substitution theta) of the first

Combining these, we get the definition of theta-subsumption: C q -subsumes D iff there is a substitution q such that q applied to C gives a subset of D.

The aim of Least General Generalisation (LGG) is to climb up the q -subsumption hierarchy the minimum distance necessary to generalise the specific clauses under consideration (the learning data)

LGG can't handle background knowledge: extension to Relative LGG

Requires a definition of generalisation relative to background knowledge: C generalises D relative to theory T iff C q -subsumes D together with the negations of the literals in T

 

Computational Problems

Theoretically nice, but dropped for a time because of computational problems:

Note that theta-subsumption is not an exact inverse of implication: let

C be the rule nat(X) => nat(s(X))

D be the rule nat(X) => nat(s(s(X)))

then C implies D, but C does not theta-subsume D.

Michalski: Induce (1973,1978,1980)

Used a very limited, special purpose subset of the predicate calculus.

Relied on the 'star' induction algorithm to find descriptions of a class: special-to-general learner

Computational complexity of the star induction algorithm is unsatisfactory for lage datasets

 

Vere: Thoth (1975, 1977)

Language: literals consist of strings of constant and variable symbols (so effectively a higher order logic); provides conjuntion, disjunction and restricted use of negation

Generalisation by matching: literals from different instances are matched as well as possible; non-matching constants are replaced by variables.

B Cohen: Confucius (1978, 1982)�

Language: subset of predicate calculus

Sammut: Marvin (1980, 1986)

Language: initially attribute-value based, eventually Horn clause

Generalises using background knowledge, stored as Horn clauses

Mixed specialisation and generalisation process

 

Shapiro: MIS (1981)

Language: prolog

Aimed at debugging prolog programs:

If the theory is too general (ie inconsistent with examples), try to trace the part of the theory leading to the inconsistency, and delete it

If the theory is too specific (ie does not cover all examples), add to the hypothesis by refining an already refuted hypothesis

Refinement operators provided included:

Muggleton: DUCE (1987)

Language: Propositional, not relational

Critical to later relational work

In lecture 2, we mentioned that machine deduction mostly uses a method called ‘resolution’

DUCE is based on the idea of 'inverting resolution' to give generalisation operators

 

The Resolution Method

Resolution is most readily understood as a method of proving unsatisfiability: rather than prove P valid, the resolution method attempts to prove that ~P is unsatisfiable.

Conjunctive Normal Form

A literal is an atomic formula or the negation of an atomic formula

A formula is in conjunctive normal form (CNF) if it is the conjunction of disjunctions of literals

Each formula has (up to equivalence) only one CNF (the engineers in the class might prefer to think of minterms)

Clauses

Because of commutativity, the order of formulae in a disjunction or conjunction is irrelevant.

So it is often convenient to just consider the unordered set of formulae making up the conjunction or disjunction

define:

Clause

A clause is a finite set of literals, possibly empty (representing a finite disjunction of literals)

We represent the empty clause as

Clause Set

A clause set is a set of clauses, possibly empty (denoted ø) or even infinite, and possibly containing

Using Clause Sets

Every finite nonempty clause set not containing corresponds to a formula in CNF in an obvious way

Every formula in CNF has such a representation

From a propositional formula, we can first construct its CNF, and then the corresponding clause set

Resolution

Resolvents

Let C1 and C2 be two clauses, and A an atomic formula with A being in C1 and ~A being in C2

Then D = (C1 \ {A}) ª (C2 \ {~A}) is called a resolvent of C1 and C2

That is, a resolvent is obtained by removing a complementary pair of literals from two clauses, and merging the resultant sets

The Resolution Rule

If S is a clause set, and D is a resolvent of two clauses in S, then S is logically equivalent to S ª {D}

That is, adding a resolvent of a clause set to the set does not change its satisfiability

Resolution Theorem

Clause set S is unsatisfiable if and only if by the empty clause is obtainable by repeated application of resolution to S

Application

The Resolution Theorem thus gives us a method for deciding the validity of a proposition P:

Form the negation ~P

Find the CNF C of this

Convert to a clause set S

Perform all possible resolutions

Check whether the empty clause occurs

However in performing all possible resolutions, we may have computed far more than is actually required:

we only need to find one way to construct the empty clause

Thus the aim is to find an algorithm which will efficiently find such a deduction if there is one.

There is good reason to believe that no efficient general method can be found, but polynomial time algorithms for some important restricted classes of formulae have been discovered

Inverting Resolution

V Operators

The V operators invert one resolution step:

Given two clauses

A = {A1,...,Am,L}

B = {B1,...,Bn,~L}

resolution creates the resolvent

D = {A1,...,Am,B1,...,Bn}.

The V operators run this backward

 

Absorption

creates

B = {B1,...,Bn,~L}

from

D = {A1,...,Am,B1,...,Bn}

A = {A1,...,Am,L}.

Identification

creates

A = {A1,...,Am,L}

from

D = {A1,...,Am,B1,...,Bn}

B = {B1,...,Bn,~L}.

W Operators

The W operators invert a pair of 'back-to-back' resolution steps:

Given three clauses

A = {A1,...,Am,L}

B = {B1,...,Bn,L}

C = {C1,...,Cr,~L}

resolution can create the resolvents

AC = {A1,...,Am, C1,...,Cr}

BC = { B1,...,Bn, C1,...,Cr}.

Intra-construction

from

AC = {A1,...,Am, C1,...,Cr}

BC = { B1,...,Bn, C1,...,Cr}

construct the clauses

A = {A1,...,Am,L}

B = {B1,...,Bn,L}

C = {C1,...,Cr,~L}

 

Inter-construction

Almost the same, but replaces L by ~L:

from

AC = {A1,...,Am, C1,...,Cr}

BC = { B1,...,Bn, C1,...,Cr}

construct the clauses

A = {A1,...,Am,~L}

B = {B1,...,Bn,~L}

C = {C1,...,Cr,L}

A critical point to note is that W operators can invent new literals:

The literal L occurs in the generalisation, but was not present in the original clauses AC and BC

Muggleton & Buntine: Cigol (1992)

Language: function-free prolog

Builds on DUCE and Buntine's work on relative subsumption and relative generalisation

Theoretically, the only difference from DUCE is the propositional/predicate change: extending resolution to the predicate calculus requires replacing identity by unification

 

The Resolution Method in the Predicate Calculus

Predicate Normal Forms

Prenex Normal Form

Just as any propositional formula can be reduced to an equivalent CNF, any predicate formula can be reduced to prenex normal form

In prenex normal form, all the quantifiers have been moved to the front: the formula consists of a string of quantifiers over variables, followed by the body (known as the matrix) of the formula, which contains no quantifiers, and is in CNF

Full Normal Form

Further magic (involving the introduction of lots of new function symbols) allows the removal of all the existential quantifiers, and once only universal quantifiers are left, they may be taken as implicit, so the formula can again be treated as a clause set

Resolution & Unification

As with the propositional calculus, resolution may be applied to derive the empty clause

But there's a complication that doesn't arise with propositional logic:

What is the resolvent of (say)

{king(X,louis)}

{~king(france,Y)}

Propositionally, there isn't one

In predicate deduction, we can unify

X with france

Y with louis

so that the two clauses then become

{king(france,louis)}

{~king(france,louis)}

which give the resolvent

A conclusion which will no doubt satisfy the French republicans

Inverse Resolution and Subsumption

Extending inverse resolution to the predicate calculus requires replacing identity by substitution

In fact, the full generality of the V and W operators would generate unacceptably large search spaces, so only a restricted subset is implemented (refer to the diagrams under DUCE)

V operators:

identification

not implemented

absorption

clauses A and B contain no common literals (other than literals that specialise to L)

W operators:

inter-construction

not implemented

intra-construction

clauses A and B contain only one literal ('unit clauses')

 

Truncation

The final inverse resolution operator, truncation, is effectively a degenerate case of the W operators

Given two negative singleton clauses

{ ~L1 }

{ ~L2 }

there may be a literal L which, with appropriate variable substitutions, can resolve separately with each of them, giving the resolution diagram :

Given { ~L1 } and { ~L2 }, we can try to generate an appropriate L

In fact, a good choice is the LGG of the two, if it happens to exist

So from

member(1,cons(1,2))

member(red,cons(red,green))

truncation would generate

member(A,cons(A,B))

In fact, Cigol extends truncation to produce the LGG of an arbitrarily large set of unit clauses

Search Restrictions

Despite the restricted form of the V and W operators, the search space still requires narrowing

Cigol imposes an additional requirement that at each stage of the search, the new theory is syntactically smaller than the original

Muggleton & Feng: Golem (1992)

Golem, or RLGG revisited: Recall the problems demonstrated by Plotkin’s 1960s work on RLGGs

Muggleton and Feng found that a set of reasonable restrictions on

combined with

could improve the tractability of the RLGG approach

Restricted Background Knowledge

Golem restricts the background knowledge to be syntactically generative

A clause

A1 &....& Am => B

is syntactically generative if the varables in B are a subset of the variables in A1,...., Am

For example,

member(X,Ys) =>

member(X,[Y|Ys])

is not syntactically generative (since Y occurs in the conclusion but not the antecedent of the rule), but

member(X,Ys) &

integer(Y) =>

member(X,[Y|Ys])

is syntactically generative

In fact, a non-generative clause may always be made generative by adding typing or other information

 

Finite Ground Model

Muggleton & Feng combine this restriction with the use of a finite subset of a ground model of the background theory

Specifically, they define an atom A of a theory K to be h-easy if A can be derived from K in at most h applications of resolution

They then define Mh(K) to be the Herbrand instantiations of h-easy atoms of K (the Herbrand instantiations can be thought of as the instantiations that have to exist just because of the symbols in the language)

Mh(K) is thus a finite subset of a ground model of K (in fact, it’s at most exponential in K)

Restricted Hypothesis Language

The use of h-easy models and generative background theories improved Plotkin’s results by at least ensuring that RLGGs were finite

But with a length exponential in the number of examples, the RLGG was still not computationally tractable

Determinacy in the body of a clause can improve learnability

Speaking carelessly, a variable V in a literal is determinate if for every combination of values of the other variables in the literal, there is exactly one value of V for which the literal is true

Thus Z is determinate in

append(X,Y,Z)

but X is not in

member(X,Y)

Muggleton & Feng further extended this definition to the idea of ij-determinacy

They were able to prove that, for fixed i and j, the size of the RLGG of n examples was not only not exponential in n, but was independent of n

 

ij-determinacy

Muggleton & Feng illustrate the idea of ij-determinacy with the clause:

decrement(B,D &

multiply(A,D,E) &

plus(A,E,C) =>

multiply(A,B,C)

In the literal ‘multiply(A,D,E)’, the value of E is determined only if both the values A and D are fixed; E is said to be of degree 2

In ij-determinacy, the j specifies the maximum degree of any variable in any literal of a hypothesised clause which does not occur in the conclusion of the clause

The i specifies the maximum depth at which a determinate variable may occur

The depth of a variable which occurs in the conclusion of the clause is defined as 0

If a literal determines its variables in terms of other variables whose maximum depth is n, then the depth of that literal is n+1

So in the example

Thus the example clause is 22-determinate

Clause Reduction

While Golem’s restrictions on background knowledge and hypothesis language do allow the computation of relatively small RLGGs, they do not remove the need for clause reduction

Rather than following Plotkin’s approach of logical reduction, Muggleton & Feng used more semantically based approaches to clause reduction: functional and negative-based reduction

Functional Reduction

Golem supports the idea of mode declarations, which specify which variables of a predicate will be used for input values, and which for output

For example, append might be declared as

mode(append(+,+,-))

which says that (in this context) append will only be used to join two lists together to form a third

From such mode declarations, Golem can construct a functional graph showing the dependency of predicates in a clause - eg for quicksort:

mode(qsort(+,-))

mode(partition(+,+,-,-))

partition(A,B,E,F) &

qsort(F,G) &

qsort(E,H) &

append(H,[A|G],[C|D]) =>

qsort([A|B],[C|D])

Given a candidate RLGG, Golem attempts to construct a functional graph of the hypothesis using only literals within the RLGG

If it succeeds, any literals not occurring in the functional graph are deleted from the RLGG

Note that this functional reduction relies on the assumption of determinacy

 

Negative-based Reduction

So far, we have discussed learning only in the presence of positive examples of a concept

Golem also makes use of negative examples (ie possible tuples that are not part of the concept)

These are held in a negative example file

Negative-based reduction of the RLGG relies on the removal from the RLGG of literals, so long as removal of the literal does not allow the derivation of a negative example of the concept