Extracted Text

differentiable-genetic-programming.pdf

Dierentiable Genetic Programming
Dario Izzo
1
, Francesco Biscani
2
, and Alessio Mereta
1
Advanced Concepts Team, European Space Agency, Noordwijk 2201AZ, The
Netherlands
dario.izzo@esa.int
Abstract.We introduce the use of high order automatic dierentiation,
implemented via the algebra of truncated Taylor polynomials, in genetic
programming. Using the Cartesian Genetic Programming encoding we
obtain a high-order Taylor representation of the program output that is
then used to back-propagate errors during learning. The resulting ma-
chine learning framework is called dierentiable Cartesian Genetic Pro-
gramming (dCGP). In the context of symbolic regression, dCGP oers
a new approach to the long unsolved problem of constant representation
in GP expressions. On several problems of increasing complexity we nd
that dCGP is able to nd the exact form of the symbolic expression as
well as the constants values. We also demonstrate the use of dCGP to
solve a large class of dierential equations and to nd prime integrals
of dynamical systems, presenting, in both cases, results that conrm the
ecacy of our approach.
Keywords:genetic programming, truncated Taylor polynomials, ma-
chine learning, symbolic regression, back-propagation
1 Introduction
Many of the most celebrated techniques in Articial Intelligence would not be
as successful if, at some level, they were not exploiting dierentiable quantities.
The back-propagation algorithm, at the center of learning in articial neural
networks, leverages on the rst and (sometimes) second order dierentials of the
error to update the network weights. Gradient boosting techniques make use of
the negative gradients of a loss function to iteratively improve over some initial
model. More recently, dierentiable memory access operations were successfully
implemented [8] [9] and shown to give rise to new, exciting, neural architectures.
Even in the area of evolutionary computations, also sometimes referred to as
derivative-freeoptimization, having the derivatives of the tness function is im-
mensely useful, to the extent that many derivative-free algorithms, in one way
or another, seek to approximate such information (e.g. the covariance matrix
in CMA-ES as an approximation of the inverse Hessian [10]). In the context of
Genetic Programming too, previous works made use of the dierential proper-
ties of the encoded programs [19] [22] [21] [7], showing the potential use of such
information, though a systematic use of the program dierentials is still lacking
in this eld.

In this paper, we introduce the use of high-order automatic dierentiation
as a tool to obtain a complete representation of the dierential properties of a
program encoded by a genetic programming expression and thus of the model it
represents. We make use of the truncated Taylor polynomial algebra to compute
such information eciently in one forward pass of the encoded program. The non
trivial implementation of the necessary algebraic manipulations, as well as the
resulting framework, is oered to the community as an open source project called
AuDi [12] (a C++11 header only library and a python module) allowing the ma-
chine learning community to gain access to a tool we believe can lead to several
interesting advances. We use AuDi to evaluate Cartesian Genetic Programming
expressions introducing a number of applications where the dierential informa-
tion is used to back-propagate the error on a number of model parameters. The
resulting new machine learning tool is called dCGP and is also released as an
open source project [11]. Using dCGP we study in more detail three distinct
application domains where the Taylor expansion of the encoded program is used
during learning: we perform symbolic regression on expressions including real
constants, we search for analytical solutions to dierential equations, we search
prime integrals of motion in dynamical systems.
The paper is structured as follows: in Section 2 we describe the program
encoding used, essentially a weighted version of the standard CGP encoding. In
Section 3 we introduce, the dierential algebra of truncated Taylor polynomial
used to obtain a high order automatic dierentiation system. Some examples
(Section 5) are also given to help the reader go through the possibly unfamiliar
notation and algebra. In Section 6, using dCGP, we propose two dierent new
methods to perform symbolic regression on expressions that include real con-
stants. In Section 7 we use dCGP to nd the analytical solution to dierential
equations. In the following Section 8, we propose a method to systematically
search for prime integrals of motions in dynamical systems.
2 Program encoding
To represent our functional programs we use the Cartesian Genetic Programming
framework [16]. CGP is a form of Genetic Programming in which computations
are organized using a matrix of nodes as depicted in Figure 1. In the original
formulation, nodes are arranged inrrows andccolumns, and each is dened
by a functionFiwith arityaand by its connectionsCij,j= 1::aindicating
the nodes whose outputs to use to compute the node output viaFi. Connec-
tions can only point to nodes in previous columns and a maximum oflcolumns
back (levels back). Input and output nodes are also present. By evaluating all
nodes in cascade starting from theNininput nodes, theNoutoutputs nodes are
computed. They represent a complex combination of the inputs able to easily
represent computer programs, digital circuits, mathematical relations and, in
general, computational structures. Depending on the variousCij, not all nodes
aects the output so that only the active nodes need to be actually computed.
The connectionsCijare also here associated to multiplicative factors (weights)

F1n+ 1C1;1C1;aF2n+ 2C2;1C2;aFrn+rCr;1Cr;aFr+1n+r+ 1Cr+1;1Cr+1;aFr+2n+r+ 2Cr+2;1Cr+2;aF2rn+r+rC2r;1C2r;aFcr+1n+cr+ 1Ccr+1;1Ccr+1;aFcr+2n+cr+ 2Ccr+2;1Ccr+2;aFr(c+1)n+cr+rCr(c+1);1Cr(c+1)1;a12no1o2omF1; C1;1::C1;a; F2; C2;1::C2;a; :::; o1::om; w1;1::w1;a; :::; w2;1::w2;a
Fig. 1.The general form of a CGP as described in [16]. In our version weights are also
assigned to the connectionsCijso that the program is dened by some integer values
(the connections, the functions and the outputs) and by some oating point values (the
connection weights)
wi;j. In this way, a CGP is allowed to represent highly successful models such
as feed forward articial neural networks as suggested in [13], [23].
3 The algebra of truncated polynomials
Consider the setPnof all polynomials of ordernwith coecients inR.
Under the truncated polynomial multiplicationPnbecomes a eld (i.e., a ring
whose nonzero elements form an abelian group under such multiplication). The
meaning of this last statement is, essentially, that we may operate inPnusing
four arithmetic operations +;;; =as we normally do in more familiar elds
such asRorC. In order to formally dene the division operator, we indicate the
generic element ofPnasp=p0+ ^pso that the constant and the non-constant
part of the polynomial are clearly separated. The multiplicative inverse ofp2 Pn
is then dened as:
p
1
= 1=p=
1
p0

1 +
m
X
k=1
(1)
k
(^p=p0)
k
!
(1)
As an example, to compute, inP2, the inverse ofp= 1 +xy
2
we writep0= 1
and ^p=xy
2
. Applying the denition we getp
1
= 1^p+^p
2
= (1(xy
2
)+x
2
).
It is then trivial to verify thatpp
1
= 1.
3.1 The link to Taylor polynomials
We make use of the multi-index notation according to which= (1; ::; n)
andx= (x1; ::; xn) are n-tuples and the Taylor expansion around the pointato

ordermof a multivariate functionfofxis written as:
Tf(x) =
m
X
jj=0
(xa)

!
(@

f)(a) (2)
where:
@

f=
@
jj
f
@
1x1@
2x2: : : @
nxn
! =
n
Y
j=1
j!
and
jj=
n
X
j=1
j
The summation
P
n
jj=0
must then be taken over all possible combinations of
j2Nsuch that
P
n
j=1
j=jj. The expression in Eq.(2),i.e.the Taylor
expansion truncated at ordermof a generic functionf, is a polynomial2 Pn
in themvariablesdx=xa. It follows from Eq.(2) that a Taylor polynomial
contains the information on all the derivatives off.
The remarkable thing about the eldPnis that its arithmetic represents
also Taylor polynomials, in the sense that ifTf; Tg2 Pnare truncated Taylor
expansions of two functionsf; gthen the truncated Taylor expansions off
g; fg; f=gcan be found operating on the eldPnsimply computingTfTg; Tf
Tg; Tf=Tg. We may thus compute high order derivatives of multivariate rational
functions by computing their Taylor expansions inPnand then extracting the
desired coecients.
Example - A divisionConsider the following rational function of two vari-
ables:
h= (xy)=(x+ 2xy+y
2
) =f=g
Its Taylor expansionTh2 P2around the pointx= 0,y= 1 can be computed
as follows:
Tx= 0 +dx
Ty= 1 +dy
Tg=Tx+ 2TxTy+TyTy= 1 + 3dx+ 2dy+ 2dxdy+dy
2
applying now Eq.(1), we get:
T
1=g= (1^p+ ^p
2
)
where ^p= 3dx+ 2dy+ 2dxdy+dy
2
, hence:
T
1=g= 13dx2dy+ 10dxdy+ 9dx
2
+ 3dy
2

and,
Th= (1 +dxdy)T
1=g=1 + 4dx+dy9dxdy12dx
2
dy
2
which allows to compute the value of the function and all derivatives up to the
second order in the pointx= 0; y= 1:
h=1; @xh= 4; @yh= 1; @xyh=9; @xxh=24; @yyh=2;
3.2 Non rational functions
If the functionfis not rational,i.e.it is not the ratio between two polynomials,
it is still possible, in most cases, to compute its truncated Taylor expansion
operating inPn. This remarkable result is made possible leveraging on the nil-
potency property of ^pinPn,i.e.^p
k
= 0 ifk > n. We outline the derivation in the
case of the functionf= exp(g), other cases are treated in details in [2]. We write
Tf= exp(Tg) = exp(p0+ ^p) = expp0exp ^p. Using the series representation of the
function exp we may writef= expp0
P
1
i=0
^p
i
i!
. As per the nil-potency property
this innite series is, inPnnite and we can thus write exp(p) = expp0
P
n
i=0
^p
i
i!
,
or equivalently:
Tf= expp0(1 + ^p+ ^p
2
+::+ ^p
n
)
With similar derivations it is possible to dene most commonly used functions
including exp;log;sin;cos;tan;arctan;arcsin;arccos, abs as well as exponentia-
tion. Operating in this algebra, rather than in the common arithmetical one, we
compute the programs encoded by CGP obtaining not only the program output,
but also all of its dierential variations (up to ordern) with respect to any of the
parameters we are interested in. This idea is leveraged in this paper to propose
new learning methods in genetic programming.
4 Computer Implementation
The computer implementation of a CGP system computing in the truncated
Taylor polynomial algebra (a dCGP) follows naturally introducing of a new
data type representing a truncated polynomial and by overloading all arithmetic
operations and functions. Such type is here calledgeneralized dual numberto
explicitly indicate the link with thedual numbersoften used in computer science
to implement rst order automatic dierentiation. We implemented generalized
dual numbers in C++ and Python in the open source project AuDi [12] and a
version of CGP able to compute with them in the open source software dCGP
[11]. Besides the basic four arithmetic operations +;;; =the following function
are, at the moment of writing, available: exp, log, pow, sin, cos, tan, atan, acos,
asin, sinh, cosh, tanh, atanh, acosh, asinh, erf, sqrt, cbrt, abs. Two commercial
codes are also available that implement the algebra of truncated Taylor polyno-
mials. They were developed for use in beam physics [15] and in astrodynamics
[18] and can be compared to our own implementation in AuDi, developed for

Table 1.Speedup obtained by our implementation of the truncated Taylor polynomial
dierential algebra (AuDi) when computingon batches of dierent sizesb, with
respect to the library DACE [18]. For small batch sizes and low orders DACE is still
two times faster than AuDi.
.
order1 2 3 4 5 6 7 8 9
batch size
16 0:59 0:61 0:65 0:601.41 1.52 1.86 1.61 2.03
64 2.26 2.07 2.07 1.91 1.51 2.13 2.91 3.04 3.58
256 5.74 5.39 4.82 4.44 3.01 3.64 4.75 4.64 7.18
1024 14.21 10.82 8.94 7.45 6.70 6.86 7.80 7.52 8.27
4096 20.02 13.26 9.77 7.44 4.58 5.92 6.15 5.14 4.76
16384 19.03 8.50 5.34 4.00 3.74 4.71 4.39 4.11 5.01
machine learning applications. When implementing such an algebra, the most
sensitive choices are to be made in the implementation of the truncated multi-
plication between polynomials. Assuming such an operator available, the rest of
the algebra is implemented by computing the power series dening the various
operators (for example the division would be implemented directly from Eq.(1)).
With respect to the mentioned codes our implementation, diers on the imple-
mentation of the truncated multiplication, introducing an approach that allows
to parallelize and vectorize the operation. The use of vectorization is particularly
useful in the context of machine learning as algorithms often operate on batches
to construct a learning error which is typically obtained summing the result of
the same polynomial computation repeated over multiple input points.
4.1 The truncated multiplication
Consider the Taylor expansion of a function onmvariables inPn. The number
Dof terms in such a Taylor polynomial is the total number of derivatives of
maximum orderminnvariables. Such number is given by the formula:
D=
m
X
k=1

k+n1
k

In Table 2 we reportDfor varyingnandm. It is clear that the explosion of the
number of monomials will make the computations unpractical for high values
ofnandm. Note how all of the existing machine learning methods exploit the
upper part of the table, wheremis large and, necessarily,nis low (typically
maximum two). While the area on Table 2 that corresponds to low values ofm
and high orders is unexplored even if the resulting Taylor expansion carries the
same number of terms,i.e.results in a similar computational complexity.
The details on how to eciently implement the truncated polynomial mul-
tiplication go beyond the scope of this paper, the nal algorithm here used is
made available as part of the open source algebraic manipulator Piranha [4]. It
is, essentially, based on the work on algebraic manipulators by Biscani [3] and ex-
ploits the possible sparsity of the polynomials as well as ne-grained parallelism

to alleviate the intense CPU load needed at high orders/many variables. For the
purpose of using Piranha with AuDi, we added the possibility to compute the
truncated multiplication over polynomials whose coecients are vectorized, ob-
taining a substantial advantage in terms of computational speed as the overhead
introduced to order all monomials in the nal expression can be accounted only
once per vectorized operation.
PerformancesIn order to preliminarily assess the performance of our imple-
mentation of the dierential algebra of the truncated Taylor polynomials in
AuDi, we measure the CPU time taken to compute the following ctitious train-
ing error:
= (c+w1+w2+w3+w4+w5+::+wn)
N
which is here taken as representative of the algebraic manipulations necessary to
compute the training error of a program (or model) withndierent weights to
be learned. We repeat the same computation on a batch of sizebusing DACE
1
[18] and the AuDi forN= 20 andn= 7. The truncation ordermas well as
the batch sizebare instead varied. The results are shown in Table 1 where it is
highlighted how for low batch sizes and derivation order AuDi does not oer any
speedup and is, instead, slower. As soon as larger batch sizes are needed (already
atb= 64) or higher orders are desired, AuDi takes instead full advantage of the
underlying vectorized truncated polynomial multiplication algorithm and oers
substantial speedup opportunities. The table reported was obtained on a 40
processors machine equipped with two Intel(R) Xeon(R) CPU E5-2687W v3
@ 3.10GHz. Dierent architectures or error denitions may change the speedup
values also considerably, but not the conclusion that at high orders or batch sizes
AuDi is a very competitive implementation of the truncated Taylor polynomials
dierential algebra.
5 Example of a dCGP
Let us consider a CGP expression havingn= 3,m= 1,r= 1,c= 10,l= 3,
a= 2,wi;j= 1 and the following kernel functions: +,*,/,whereis the sigmoid
function. Using the same convention used in [16] for the node numbering, the
chromosome:
x= [2;1;1;0;2;0;1;2;2;1;2;1;3;1;2;3;6;3;3;4;2;2;8;0;2;7;2;2;3;10;10]
will produce, after simplications, the following expression at the terminal node:
o0=
(yz+ 1)
x
1
While a real comparison does not exist in the literature, informal communications
with the authors of DACE revealed that the other existing implementation of the
Taylor polynomials dierential algebra, COSY [15] results in comparable CPU times
to DACE

m= number of variables
1 2 3 4 5 6 7 8 9 10 11 12
n= order
11 2 3 4 5 6 7 8 9 10 11 12
22 5 9 14 20 27 35 44 54 65 77 90
33 9 19 34 55 83 119 164 219 285 363 454
44 14 34 69 125 209 329 494 714 1000 1364 1819
55 20 55 125 251 461 791 1286 2001 3002 4367 6187
66 27 83 209 461 923 1715 3002 5004 8007 12375 18563
77 35 119 329 791 1715 3431 6434 11439 19447 31823 50387
88 44 164 494 1286 3002 6434 12869 24309 43757 75581 125969
99 54 219 714 2001 5004 11439 24309 48619 92377 167959 293929
1010 65 285 1000 3002 8007 19447 43757 92377 184755 352715 646645
1111 77 363 1364 4367 12375 31823 75581 167959 352715 705431 1352077
1212 90 454 1819 6187 18563 50387 125969 293929 646645 1352077 2704155
Table 2.Number of monomials in a Taylor polynomial for varying ordernand number
of variablesm.
in Figure 2 the actual graph expressing the above expression is visualized. If we
now consider the pointx= 1,y= 1 andz= 1 and we evaluate classically the
CGP expression (thus operating on oat numbers), we get, trivially,O1= 0:881
where, for convenience, we only report the output rounded up to 3 signicant
digits. Let us, instead, use dCGP to perform such evaluation. One option could
be to denex; y; zas generalized dual numbers operating inP2. In this case,
the output of the dCGP program will then be a Taylor polynomial inx; y; z
truncated at second order:
o0= 0:8810:881dx+ 0:105dy+ 0:105dz+ 0:881dx
2
0:0400dy
2
0:0400dz
2
0:105dxdz+ 0:025dydz0:105dxdy
carrying information not only on the actual program output, but also on all of
its derivatives (in this case up to order 2) with respect to the chosen parameters
(i.e.all inputs, in this case). Another option could be to dene some of the
weights as generalized dual numbers, in which case it is convenient to report the
expression at the output node with the weights values explicitly appearing as
parameters:
o0=w10;0
(w3;0w8;1=w3;1+w6;0w6;1w8;0yz)
w10;1x
If we denew3;1andw10;1as generalized dual numbers operating inP3then,
evaluating the dCGP will result in:
o0= 0:8810:104dw3;10:881dw10;1+0:104dw10;1dw3;1+0:0650dw
2
3;1+0:881dw
2
10;1
0:0315dw
3
3;10:881dw
3
10;10:104dw
2
10;1dw3;10:0650dw10;1dw
2
3;1
6 Learning constants in symbolic regression
A rst application of dCGP we study is the use of the derivatives of the expressed
program to learn the values of constants by, essentially, back-propagating the

Fig. 2.A weighted CGP graph expressing, (assumingwi;j= 1)o0=
(yz+1)
x
at the
output node.
error to the constants' value. In 1997, during a tutorial on Genetic Programming
in Paolo Alto, John Koza [14] one of the fathers of Genetic Programming research
noted that \nding of numeric constants is a skeleton in the GP closet ... an area
of research that requires more investigation". Years later, the problem, while
investigated by multiple researchers, is still open [17]. The standard approach
to this issue is that of theephemeral constantwhere a few additional input
terminals containing constants are added and used by the encoded program to
build more complex representations of any constant. Under this approach one
would hope that to approximate, say,evolution would assemble, for example,
the block
22
7
= 3:1428 from the additional terminals. In some other versions of the
approach the values of the input constants are subject to genetic operators [7], or
are just random. Here we rst present a new approach that back-propagates the
error on the ephemeral constants values, later we introduce a dierent approach
using weighted dCGP expressions and back-propagating the error on the weights.
6.1 Ephemeral constants approach
The idea of learning ephemeral constants by back-propagating the error of a GP
expression was rst studied in [21] where gradient descent was used during evo-
lution of a GP tree. The technique was not proved, though, to be able to solve
the problems there considered (involving integers), but to reduce the RMSE at
a xed number of generations with respect to a standard genetic programming
technique. Here we will, instead, focus on using dCGP to solve exactly sym-
bolic regression problems involving real constants such asande. Consider
the quadratic errorq(c) =
P
i
(yi(c)^yi)
2
whereccontains the values of the
ephemeral constants,yiis the value of the dCGP evaluated on the input pointxi
and ^yithe target value. Dene the tness (error) of a candidate dCGP expression

as:
= min
c
q(c)
This "inner" minimization problem can be eciently solved by a second order
method. If the number of input constants is reasonably small, the classic formula
for the Newton's method can be conveniently applied iteratively:
ci+1=ciH
1
(ci)rq(ci)
starting from some initial valuec0. The HessianHand the gradientrare
extracted from the Taylor expansion of the errorcomputed via the dCGP.
Note that if the constantscappear only linearly in the candidate expression, one
single step will be sucient to get the exact solution to the inner minimization
problem, asq(c) is the quadratic error. This suggests the use of the following
approximation of the error dened above, where one single learning step is taken:
=q(c0H
1
(c0)rq(c0)) (3)
wherec0is the current value for the constants. In those cases where this Newton
step does not reduce the error (whenHis not positive denite we do not have
any guarantee that the search direction will lead to a smaller error) we, instead,
take a few steps of gradient descent (learning rate set to 0.05). The valuec0is
initialized to some random value and, during evolution, is set to be the best found
so far. Note that when one ospring happen to be the un-mutated parent, its
tness will still improve on the parents' thanks to the local learning, essentially
by applying one or more Newton step. This mechanism (also referred to as
Lamarckian evolution in memetic research [21]) ensures that the exact value,
and not an approximation, is eventually found for the constants even if only one
Newton step is taken at each iteration. Note also that the choice of the quadratic
error function in connection with a Newton method will greatly favour, during
evolution, expressions such as, for example,c1x+c2rather than the equivalent
c
2
1c2x+
1
c2
.
ExperimentsWe consider seven dierent symbolic regression problems (see
Table 3) of varying complexity and containing the real constantsande. Ten
pointsxiare chosen on a uniform grid within some lower and upper bound. For
all problems the error dened by Eq.(3) is used as tness to evolve an unweighted
dCGP withr= 1,c= 15,l= 16,a= 2,Nin= 2 or 3 according to the number
of constants present in the target expression andNout= 1. As Kernel functions
we use: +;;; =for problems P1, P2, P3, P5, P6 and +;;; =;sin;cos for
problems P4, P7. The evolution is driven by a (1+)-ES evolution strategy with
= 4 where theithmutant of theospring is created by mutatingiactive
genes in the dCGP. We run all our experiments
2
100 times and for a maximum of
gmaxgenerations. The 100 runs are then considered as multi starts of the same
2
The exact details on all these experiments can be found in two IPython notebooks
available here:https://goo.gl/iH5GAR,https://goo.gl/0TFsSv

Table 3.Learning the ephemeral constants: experiments denition and results. In all
cases the constants and the expressions are found exactly (nal error is <10
14
).
target expression foundconstants found ERTboundsgmax
P1:x
5
x
3
+x cx
3
+x
5
+xc= 3:1415926 19902[1,3]1000
P2:x
5
x
3
+
2
x
cx
3
2c=x+x
5
c=3:1415926 124776[0.1,5]5000
P3:
ex
5
+x
3
x+1
cx
5
+x
3
x+1
c=2:7182818 279400[-0.9,1]5000
P4:sin(x) +
1
x
sin(cx+x) +
1
x
c= 2:1415926 105233[-1,1]5000
P5:ex
5
x
3
+x c1x
3
c2x
5
+xc1=3:1415926 45143[1,3]2000
c2=2:7182818
P6:
ex
2
1
(x+2)
c1+c2x
x+2
c1=0:3183098 78193[2:1;1]10000
c2= 0:86525597
P7:cos(x) + sin(ex)sin(c
2
1c2x+c
2
1x)c1= 1:7724538 210123[-1,1]5000
+ cos(c
2
1x) c2=0:1347440
algorithm and are used to compute the Expected Run Time [1] (ERT), that
is the ratio between the overall number of dCGP evaluations made (fevals)
made across all trials and the number of successful trialsns:ERT=
f evals
ns
. As
shown in Table 3 our approach is able to solve all problems exactly (nal error
is <10
14
) and to represent the real constants with precision.
6.2 Weighted dCGP approach
While the approach we developed above works very well for the selected prob-
lems, it does have a major drawback: the number of ephemeral constants to be
used as additional terminals must be pre-determined. If one were to use too few
ephemeral constants the regression task would fail to nd a zero error, while if
one were to put too many ephemeral constants the complexity would scale up
considerably and the proposed solution strategy would quickly lose its ecacy.
This is particularly clear in the comparison between problems P1 and P5 where
the only dierence is the use of a multiplying factorein front of the quintic term
of the polynomial. Ideally this should not change the learning algorithm. As de-
tailed in Sections 2 and 5, a dCGP gives the possibility to associate weights to
each edge of the acyclic graph (see Figure 2) and then compute the dierential
properties of the error w.r.t. any selection of the weights. This introduces the
idea of performing symbolic regression tasks using a weighted dCGP expression
with no additional terminal inputs but with the additional problem of having to
learn values for all the weights appearing in the represented expression. In par-
ticular we now have to dene the tness (error) of a candidate dCGP expression
as:
= min
w
q(w)
whereware the weights that appear in the output terminal. Similarly to what we
did previously, we need a way to solve this inner minimization problem. Applying
a few Newton steps could be an option, but since the number of weights may
be large we will not follow this idea. Instead, we iteratively select, randomly,nw

Table 4.Learning constants using the weighted dCGP: experiments denition and
results. In all cases the constants and the expressions are found exactly (nal error is
<10
14
)
target expression found constants foundERTboundsgmax
P1:x
5
x
3
+x c1x
5
+c2x
3
+c3x c1= 1:0 106643[1,3]200
c2=3:1415926
c3= 0:9999999
P2:x
5
x
3
+
2
x
c1x
5
+c2x
3
+
c3
c4x
c1= 0:9999999271846[0.1,5]200
c2=3:1415926
c3= 6:2831853
c4= 1:0
P3:
ex
5
+x
3
x+1
c1x
5
+c2x
3
c3x+c4
c1= 4:37468921935500[-0.9,1]200
c2= 1:6093582
c3= 1:6093582
c4= 1:6093582
P4:sin(x) +
1
x
c1sin(c2x) +
c3
c4x
c1= 1:0 135107[-1,1]200
c2= 3:1415926
c3= 1:0
c4= 1:0
P5:ex
5
x
3
+x c1x
5
+c2x
3
+c3x c1= 2:7182818122071[1,3]200
c2=3:1415926
c3= 1:0
P6:
ex
2
1
(x+2)
c1x
2
+c2
c3x+c4
c1= 1:5963630628433[-2.1,1]200
c2=0:5872691
c3= 1:8449604
c4= 3:6899209
P7:cos(x) + sin(ex)c1sin(c2x) +c3cos(c4x)c1= 1:0 243629[-1,1]200
c2= 2:7182818
c3= 1:0
c4= 3:1415926
weights~wactive in the current dCGP expression and we update them applying
one Newton step:
~wi+1=~wi~H
1
(~wi)r~wq(~wi)
where the tilde indicates that not all, but only part of the weights are selected.
If no improvement is found we discard the step. At each iteration we select ran-
domly new weights (no Lamarckian learning). This idea (that we call weight
batch learning), is eective in our case as it avoids computing and inverting
large Hessians, while also having more chances to actually improve the error at
each step without the use of a learning rate for a line search. For each candi-
date expression we performNiterations of weight batch learning starting from
normally distributed initial values for the weights.
ExperimentsWe use the same experimental set-up employed to perform sym-
bolic regression using the ephemeral constants approach (see Table 4). For each

iteration of the weight batch learning we use a randomly selectednw2 f2;3g
and we performN= 100 iterations
3
. The initial weights are drawn from a zero
mean, unit standard deviation normal distribution. By assigning weights to all
the edges, we end up with expressions where every term has its own constant
(i.e., a combination of weights), hence no distinction is made by the learning al-
gorithm between integer and real constants. This gives the method a far greater
generality than the ephemeral constants approach as the number of real con-
stants appearing in the target expression is not used as an information to design
the encoding. It is thus not a surprise that we obtain, overall, higher ERT val-
ues. These higher values mainly come from the Newton steps applied on each
weighted candidate expression and not from the number of required generations
which was, instead, observed to be generally much lower across all experiments.
Also note that for problems P3 and P6 the nal expression found can have
an innite number of correct constants, obtained by multiplying the numerator
and denominator by the same number. Overall, the new method here proposed
to perform symbolic regression on expressions containing constants was able to
consistently solve the problems considered and is also suitable for applications
where the number of real constants to be learned is not known in advance.
7 Solution to Dierential Equations
We show how to use dCGP to search for expressionsS(x1; ::; xn) =S(x) that
solve a generic dierential equation of ordermin the form:
f(@

S;x) = 0;jj m (4)
with some boundary conditionsS(x) =Sx;8x2 B. Note that we made use of
the multi-index notation for high order derivatives described previously. Such
formal representation includes initial value problems for ordinary dierential
equations and boundary value problems for partial dierential equations. While
this representation covers only the case of Dirichelet boundary conditions (values
ofSare specied on a border), the system devised here can be used also for
Neumann boundary conditions (values of@Sare specied on a border). Similarly,
also systems of equations can be studied. AssumeSis encoded by a dCGP
expression: one may easily compute Eq.(4) over a number of control points placed
in some domain, and boundary valuesS(x) may also be computed on some
other control points placed overB. It is thus natural to compute the following
expression and use it as error:
=
X
i
f
2
(@

S;xi) +
X
j
(S(x)Sx)
2
(5)
which is, essentially, the sum of the quadratic error of the violation of the dier-
ential equations plus the quadratic error of the violation on the boundary values.
3
The exact details on all these experiments can be found in the IPython notebook
available here:https://goo.gl/8fOzYM

Symbolic regression was studied already in the past as a new tool to nd solu-
tions to dierential equations by Tsoulos and Lagaris [22] who used grammatical
evolution to successfully nd solutions to a large number of ordinary dieren-
tial equations (ODEs and NLODEs), partial dierential equations (PDE), and
systems of ordinary dierential equations (SODEs). To nd the derivatives of
the encoded expression Tsoulos and Lagaris add additional stacks where basic
rules for dierentiation are applied in a chain. Note that such system (equivalent
to a basic automatic dierentiation system) is not easily extended to the com-
putation of mixed derivatives, necessary for example when Neumann boundary
conditions are present. As a consequence, all of the dierential test problems
introduced in [22] do not involve mixed derivatives. To test the use of dCGP
on this domain, we consider a few of those problems and compare our results
to the ones obtained by Tsoulos and Lagaris. From the paper it is possible to
derive the average number of evaluations that were necessary to nd a solution
to the given problem, by multiplying the population size used and the average
number of generations reported. This number can be then compared to the ERT
[1] obtained in our multi-start experiments.
ExperimentsFor all studied problems we use the error dened by Eq.(5) to
train an unweighted dCGP withr= 1,c= 15,l= 16,a= 2,Ninequal
to the number of input variables andNout= 1. As Kernel functions we use
the same as that used by Tsoulos and Lagaris: +;;; =;sin;cos;log;exp. The
evolution is made using a (1+)-ES evolution strategy with= 10 where the
ithmutant of theospring is created mutatingiactive genes in the dCGP.
We run all our experiment
4
100 times and for a maximum of 2000 generation.
We then record the successful runs (i.e.the runs in which the error is reduced
to10
16
) and compute the expected run-time as the ratio between the
overall number of evaluation of the encoded program and its derivatives (fevals)
made across successful and unsuccessful trials and the number of successful trials
ns:ERT=
f evals
ns
[1]. The results are shown in Table 5. It is clear how, in
all test cases, the dCGP based search is able to nd solutions very eciently,
outperforming the baseline results.
8 Discovery of prime integrals
We now show how to use dCGP to search for expressionsPthat are prime inte-
grals of some set of dierential equations. Consider a set of ordinary dierential
equations (ODEs) in the form:
8
>
<
>
:
dx1
dt
=f1(x1; ; xn)
.
.
.
dxn
dt
=fn(x1; ; xn)
4
The full details on these experiments can be found in an IPython notebook available
here:https://goo.gl/wnCkO9

Table 5.Expected run-time for dierent test cases taken from [22]. The ERT can be
seen as the average number of evaluation of the program output and its derivatives
needed (on average) to reduce the error to zero (i.e.to nd an exact solution)
Problemd-CGPTsoulos [22]
ODE1 8123 130600
ODE2 35482148400
ODE5 22600 88200
NLODE3 896 38200
PDE2 24192 40600
PDE6 327020797000
Solutions to the above equations are denoted withxi(t) to highlight their time
dependence. Under relatively loose conditions on the functionsfi, the solution
to the above equations always exists unique if initial conditionsxi(t0) are speci-
ed. A prime integral for a set of ODEs is a function of its solutions in the form
P(x1(t); ; xn(t)) = 0;8t. Prime integrals, or integral of motions, often express
a physical law such as energy or momentum conservation, but also the conser-
vation of some more \hidden"quantity as its the case in the Kovalevskaya top
[5] or of the Atwood's machine [6]. Each of them allows to decrease the number
of degrees of freedom in a system by one. In general, nding prime integrals is a
very dicult task and it is done by skillful mathematicians using their intuition.
A prime integral is an implicit relation between the variablesxiand, as discussed
by Schmidt and Lipson [20], when symbolic regression is asked to nd such im-
plicit relations, the problem of driving evolution towards non trivial, informative
solutions arises. We thus have to devise an experimental set-up that is able to
avoid such a behaviour. AssumingPto be a prime integral, we dierentiate it
obtaining:
dP
dt
=
X
i
@P
@xi
dxi
dt
=
X
i
@P
@xi
fi= 0
The above relation, and equivalent forms, is essentially what we use to compute
the error of some candidate expression representing a prime integralP. In more
details, we deneNpoints in the phase space and denote them withx
j
i
; j= 1::N.
Thus, we introduce the error function:
=
X
j
X
i

@P
@xi
(x
j
1
; :::; x
j
n)fi(x
j
1
; :::; x
j
n)

2
(6)
Since the above expression is identically zero ifPis, trivially, a constant we
introduce a \mutation suppression" method during evolution. Every time a new
mutant is created, we compute
@P
@xi
;8iand we ignore it if
@P
@xi
= 0;8i, that is
if the symbolic expression is representing a numerical constant. Our method
bears some resemblance to the approach described in [19] where physical laws
are searched for to t some observed system, but it departs in several signi-
cant points: we do not use experimental data, rather the dierential description

of a system, we compute our derivatives using the Taylor polynomial algebra
(i.e.automatic dierentiation) rather than estimating them numerically from
observed data, we use the mutation suppression method to avoid evolving trivial
solutions, we need not to introduce avariable pairing[19] choice and we do not
assume variables as not dependent on all others. All the experiments
5
are made
using a non-weighted dCGP withNout= 1,r= 1,c= 15,lb= 16,a= 2.
We use a (1+)-ES evolution strategy with= 10 where theithmutant
of theospring is created mutatingiactive genes in the dCGP. A set ofN
control points are then sampled uniformly in some bounds. Note how these are
not belonging to any actual trajectory of the system, thus we do not need to
numerically integrate the ODEs. We consider three dierent dynamical systems:
a mass-spring system, a simple pendulum and the two body problem.
Mass-spring systemConsider the following equations:
MSS:

_v=kx
_x=v
describing the motion of a simple, frictionless mass-spring system (MSS). We use
N= 50 and create the points at random as follows:x2[2;4],v2[2;4] andk2
[2;4]. For the error, rather than using directly the form in Eq.(6) our experiments
indicate that a more informative (and yet mathematically equivalent) variant is,
in this case,=
P
j
h
@ P
@ r
@ P
@ v
+
fv
fr
i
2
.
Simple pendulumConsider the following equations:
SP:

_!=
g
L
sin
_
=!
describing the motion of a simple pendulum (SP). We useN= 50 and create
the points at random as follows:2[5;5],!2[5;5] andc=
g
L
2(0;10].
Also in this case, we use a variant for the error expression:=
P
j
h
@ P
@
@ P
@ !
+
f!
f
i
2
.
Two body problem Consider the following equations:
T BP:
8
>
>
<
>
>
:
_v=

r
2+r!
2
_!=2
v!
r
_r=v
_
=!
describing the Newtonian motion of two bodies subject only to their own mu-
tual gravitational interaction (TBP). We useN= 50 random control of points
5
The full details on our experiments can be found in an IPython notebook available
here:https://goo.gl/ATrQR5

Table 6.Results of the search for prime integrals in the mass-spring system (MSS),
the simple pendulum (SP), and two body problem (TBP). Some prime integrals found
are also reported in both the original and simplied form. The Expected Run Time
(ERT),i.e.the number of evaluations of the dCGP expression that is, on average,
required to nd a prime integral, is also reported.
MSS: Energy Conservation (ERT=50000)
Expression found Simplied
k((xx)k) +k+ (vv) k
2
+kx
2
+k+v
2
(xx) + (v=k)v x
2
+v
2
=k
(k((xx)=v)=(k(xx)=v) +v)=(xx) k=(kx
2
+v
2
)
(v+x)((v+x)x)x(((v+x)x)(xk))kx
2
+v
2
SP: Energy Conservation (ERT=114000)
Expression found Simplied
((((!!)=c)cos())cos())=c 2 cos()=c+!
2
=c
2
(!=(c+c))(!(cos()=(!=(c+c)))) cos() +!
2
=c
((!!)((c+c)cos())) 2ccos() +!
2
(((c+c)cos())(!!))sin((=)) 2ccos()!
2
sin(1)
TBP: Angular momentum Conservation (ERT=5270)
Expression found Simplied
(((=r)=(r=))=!)
2
=(!r
2
)
((((rr)!) +)=) 1 +!r
2
=
((!)((rr)))
2
!r
2
((=(+))((r!)r)) !r
2
+ 1=2
TBP: Energy Conservation (ERT=6144450)
Expression found Simplied
((vr)(v+r))(((r) + (r))=r((r(r!))
(r(r!))))
2=r+!
2
r
2
2!r
2
+v
2
+2
((=r)((vv)(=r)))((r!)(r!)) 2=r!
2
r
2
v
2
(((!r)(r(!r))) + ((=r)((vv)(=r))))2=r!
2
r
2
+!r
2
v
2
(((r!)!)r)(((=r) + (=r))(vv)) 2=r+!
2
r
2
+v
2
sampled uniformly as follows:r2[0:1;1:1],v2[2;4],!2[1;2] and2[2;4]
and2[1;2] (note that these conditions allow for both elliptical and hyperbolic
motion). The unmodied form for the error (Eq.(6)) is used at rst and leads
to the identication of a rst prime integral (the angular momentum conserva-
tion). Since the evolution quickly and consistently converges to that expression,
the problem arises on how to nd possibly dierent ones. Changing the error
expression to=
P
j
h
@ P
@ r
@ P
@ v
+
fv
fr
+
@ P
@
@ P
@ v
f
fr
+
@ P
@ !
@ P
@ v
f!
fr
i
2
forces evolution away from
the angular conservation prime integral and thus allows for other expressions to
be found.
ExperimentsFor each of the above problems we run 100 independent experi-
ments letting the dCGP expression evolve up to a maximum of 2000 generations
(brought up to 100000 for the most dicult case of the second prime integral in

the two-body problem). We record the successful runs and the generation num-
ber to then evaluate the expected run time (ERT) [1]. The results are shown in
Table 8 where we also report some of the expressions found and their simplied
form. In all cases the algorithm is able to nd all prime integrals with the no-
table case of the two-body problem energy conservation being very demanding
in terms of the ERT. In this case we note how the angular momentum!r
2
is
often present in the nal expression found as it there acts as a constant. The
systematic search for prime integrals is successful in these cases and makes use
of no mathematical insight into the system studied, introducing a new computer
assisted procedure that may help in future studies of dynamical systems.
9 Conclusions
We have introduced a novel machine learning framework called dierentiable
genetic programming, which makes use of a high order automatic dierentiation
system to compute any-order derivatives of the program outputs and errors.
We test the use of our framework on three distinct open problems in symbolic
regression: learning constants, solving dierential equations and searching for
physical laws. In all cases, we nd our model able to successfully solve selected
problems and, when applicable, to outperform previous approaches. Of particular
interest is the novel solution proposed to the long debated problem of constant
nding in genetic programming, here proved to allow to nd the exact solution
in a number of interesting cases. Our work is a rst step towards the systematic
use of dierential information in learning algorithms for genetic programming.
References
1.
lutionary algorithm. In: 2005 IEEE congress on evolutionary computation. vol. 2,
pp. 1777{1784. IEEE (2005)
2.
Press (1999)
3.
tectures. In: Proceedings of the 37th International Symposium on Symbolic and
Algebraic Computation. pp. 83{90. ISSAC '12, ACM, New York, NY, USA (2012),
http://doi.acm.org/10.1145/2442829.2442845
4.

Certk, O.: piranha: 0.2 (Jun 2016),https://doi.org/
10.5281/zenodo.56369
5.
tions of integrable systems. Regular and Chaotic Dynamics 6(1), 1{16 (2001)
6.
and dynamics. Journal de Physique 51(16), 1693{1702 (1990)
7.
gression and numerical constant creation. In: Proceedings of the 10th annual con-
ference on Genetic and evolutionary computation. pp. 1195{1202. ACM (2008)
8.
arXiv:1410.5401 (2014)

9.
Barwinska, A., Colmenarejo, S.G., Grefenstette, E., Ramalho, T., Agapiou, J.,
et al.: Hybrid computing using a neural network with dynamic external memory.
Nature 538(7626), 471{476 (2016)
10.
evolutionary computation, pp. 75{102. Springer (2006)
11. https://doi.org/10.5281/zenodo.
164627
12. https://doi.org/10.5281/
zenodo.164628
13.
works using cartesian genetic programming. Neurocomputing 121, 274{289 (2013)
14.
Palo Alto, CA (1997)
15.
Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated
Equipment 558(1), 346{350 (2006)
16.
pp. 17{34. Springer (2011)
17.
gramming. Genetic Programming and Evolvable Machines 11(3-4), 339{363 (2010)
18.
Valles, C., Ortega, G.: Dierential algebra space toolbox for nonlinear uncertainty
propagation in space dynamics. In: Proceedings of the 6th International Conference
on Astrodynamics Tools and Techniques (ICATT), Darmstadt (2016)
19.
science 324(5923), 81{85 (2009)
20.
Programming Theory and Practice VII, pp. 73{85. Springer (2010)
21.
search of numeric leaf values. In: Proceedings of the Genetic and Evolutionary
Computation Conference (GECCO-2001). vol. 155162 (2001)
22.
ming. Genetic Programming and Evolvable Machines 7(1), 33{54 (2006)
23.
networks: a comparison using three benchmarks. In: Proceedings of the 15th annual
conference on Genetic and evolutionary computation. pp. 1005{1012. ACM (2013)