Help, my paths aren't smooth

For a little over three years at this point, I’ve been working on a project called Datasig. This is a combined effort between several groups at different universities and institutions in the UK aimed at making the tools and techniques from rough path theory available for data scientists. So what on earth is a rough path? It has taken me a very long time to understand what exactly is a rough path, and how they might be useful to data scientists. Even further, I’d say that I’ve well and truly drunk the rough path cool-aid, and I’m not a firm believer that the methodologies provided from this theory is generally the “correct” way to deal with certain kinds of data. (We shall explain exactly what I mean by this later.)

To explain where rough paths come from, we must first dive in and talk about differential equations. Differential equations are a means of mathematically describing the way that a certain value (or values) evolve (typically over time or space) given some kind of environment. There are various different types of differential equations, which I shan’t describe in detail in order to keep this post reasonably short. The specific kind of differential equation I want to talk about is the so-called “controlled differential equation”. For these differential equations, we model the response of a system to a control path - hence the name. To put this a different way, a controlled differential equation models the evolution of a signal with respect to some incoming sequence of events (in the form of a path). Of course, the control path might be rather simple - it could just be time $X_t = t$ in which case the equation is usually called an ordinary differential equation - or it could be something very complicated. But what does complicated mean here, and at what point does a path become too complicated for us to be able to effectively solve the equation (numerically). (Warning, some maths content ahead!)

This is where the aformentioned rough path enters the picture. Usually, one would insist on some degree of “smoothness” - the existence of a given number of continuous derivatives - in order to make available the maths that is necessary for solving the equation. To understand why this is the case, let’s first imagine that our path, denoted $X$, is continuous and has a continuous first derivative. Let’s look at the controlled differential equation written

$$ dY_t = f(Y_t) dX_t $$

where $f$ is a suitably smooth function defined on the image of $Y$ and $Y$ is the response trajectory. This notation might be slightly unusual for readers who are familiar with ordinary differential equations, but it simply means that there is a constant $Y_0$ such that

$$ Y_s = Y_0 + \int_0^s f(Y_t) dX_t $$

for all $t$ in the desired range. A standard Picard iteration argument provides a unique solution for such equations, provided that the integal can be evaluated. In our case, since $X$ is differentiable, we can rewrite the integral as

$$ \int_0^s f(Y_t) dX_t = \int_0^s f(Y_t)X_t^\prime dt $$

which is just an ordinary integral in $t$, and so can be evaluated (abstractly, at least).

This is all fine, and everything I’ve described so far is nothing beyond undergraduate mathematics. However, this technique completely relies on the fact that $X$ was a “smooth” path, so we can replace integrating against $X$ by an integral against $t$. What if, instead, $X$ was not a smooth function, or indeed isn’t regular enough for the integral on the right hand side of the equation to even make sense using the theory we know. Well, we might look at this problem and decide that it is intractible and try to simplify the problem. Alternatively, we can relax the condition on $X$ by instead requiring more of the integrand $f$. (We can borrow some regularity of $f$ to suppliment the regularity of $X$.) This generalisation is called the Young integral (in its greatest generality). Provided that the control path $X$ is of finite $p$-variation for some $p \geq 1$, we can solve the controlled differential equation provided that $f$ is sufficiently regular. (I shall spare the details here.)

The motivation for generalising in this manner is to accommodate random processes such as Brownian motion, whose paths do not typically have bounded variation. Thus the classical theory breaks down for these paths, although - as we know from Itô - it is possible to differential equations driven by Brownian motion. Indeed, such a differential equation is usually called a stochastic differential equation.

Sounds great, but what does this have to do with data science?

So far, I’ve talked about an abstract problem that has very little - at least on the surface - to do with data science. However, as we shall see shortly, this is not actually accurate. It is all about framing the problem correctly. A substantial amount of data that we collect regularly is time dependent, streaming data. We record a sequence of events and try to antipicate what effect these events will have on a larger system. For instance, we might record the temperature in different places in a building and use these data to understand the environment within the whole building. From this, we can make decisions about whether the building needs heating or cooling. Here it matters whether, over time, the general movement of the temperature in the space is increasing or decreasing over time. We don’t want to heat a place that is cold if the temperature is already increasing.

Of course, this is a very simple example but it illustrates the scenario quite neartly, especially if there are complex interdependencies between different temperatures. If we take for a moment that the stream of data recorded by the sensors are samples taken from a (rough) path, then we can start to model the temperature throughout the space as a controlled differential equation. Now we can use the tools of controlled differential equations (and their rough path friends) to reason about the evolution. The principal tool in our arsenal is the signature.

The signature of a path is an abstract description of the evolution of the path over a given interval of time. It is actually the solution to a very simple controlled differential eqauation, and is somewhat akin to the exponential function from classical differential equation theory. The signature of a path $X$, denoted $S_{0, t}$, is the solution of the equation

$$ dS_{0, t} = S_{0, t}\mathop{} dX_t $$

with the intital condition $S_0 = 1$. It turns out that the signature plays the same role as the exponential in that all solutions to controlled differential equations are built out of the signature.

This symbolic equation tells us what the signature is mathematically, but it does nothing to help us actually work with it. For that, we need to figure out how to construct it. The signature lives in the so-called tensor algebra, which is the space containing all of the tensor powers of the space in which that path $X$ takes its values. We can flatten this sequence of tensors into a single sequence of numbers - much like we do when representing a matrix in computer memory - to obtain something much more readily used in computation. This sequence of values is divided into blocks, coming from each of the tensor powers, that are each indexed by a sequence of integers. (I promise this is going somewhere, please bear with me.)

Let’s suppose that our path $X$ lives in $d$-dimensional space. Then the tensor algebra consists of sequence of numbers as follows:

$$ (a^{()}, a^{(1)}, a^{(2)}, \dots, a^{(d)}, a^{(1,1)}, a^{(1,2)}, \dots, a^{(d, d)}, a^{(1,1,1)}, a^{(1,1,2)}, \dots) $$

For the signature of a path, each of these constituent numbers is given by an iterated integral. If we write $X^{(j)}$ for the the $j$th component path (taking scalar values), the the signature terms are given as follows:

$$ S_{0, t}^{(i_1,i_2,\dots,i_k)} = \int_{0 < s_1 < s_2 < \dots < s_k < t} dX_{s_1}^{(i_1)}dX_{s_2}^{(i_2)}\dots dX_{s_k}^{(i_k)} $$

where the $i_j$ are a taken from the integers $1, 2,\dots, d$. When $k$ is zero, $S_{0, t}^{()}$ is taken to be $1$. (The first time of the signature is always $1$.)

This signature of a path uniquely determines the path, up to something called tree-like equivalence. Roughly speaking, this means that a translated version of the path has the same signature as the original path, but there are other cases like this. The signature itself captures changes in the path, and the orders in which these occur, rather than the values of a path; of course, these are somewhat releated. We can use this as a kind of feature map to capture the history of a path when we’re training models for inference - making predictions of the future or for times where no observations are recorded.

The signature itself can actually be computed directly from the increments of the data. (Here increment means the change in the value of the path over the interval in question.) If we take the changes in value $\Delta^{(1)},\dots,\Delta^{(d)}$, suitably embedded into the tensor algebra as the sequence

$$ \mathbf{\Delta} = (0, \Delta^{(1)},\Delta^{(2)}, \dots, \Delta^{(d)}, 0, \dots) $$

and tensor exponentiate, defined using the familiar power series for $\exp{}$, thne we obtain a sequence whose values are approximately the same as the intgrals shown above. (Exactly equal if the path is sufficiently smooth.) Usually a path is formed of lots of small increments (i.e. the change in value when a reading of a sensor occurs), in which case we can get a better estimate by taking the tensor product of these exponentials.

Of course, the signature itself is an infinite sequence, which is not very practical. So the usual procedure is to truncate the signature to contain only those terms whose indices have at most a given length (the $k$ value above). The truncated signature is not a full description of the behaviour of the path, but over short enough intervals, and for paths that are not too rough, the approximation is good enough to be useful.

At this point, you might be wondering why such a representation is better than considering the values sampled from the path. The first reason is the size. The signature, truncated to a particular level $N$, always has the same number of terms:

$$ \sum_{i=0}^N d^i = \frac{d^{N+1} - 1}{d - 1} \qquad (\text{when } d > 1) $$

Regardless of how many increments are used to calculate the signature over a given interval, the number of terms is always the same. This is adventageous for many machine learning models that require the number of features to be known in advance.

The second reason is that the signature captures the order of events, but is not necessarily sensitive to the speed at which the events are recorded. This means that it can handle irregularly sampled data much better than other techniques (like, for instance, fast Fourier transform). In practice, it might be possible for a sophisticated model to infer the order of events from a timestamp, but the truth of the matter is that such techniques will yield models the are less easy to understand, and are not guarenteed to have solid mathematical foundations.

This is all abstract nonsense, can you give me an example?

Yes! Let me borrow an idea from the the RoughPy documentation and construct a path from the letters in a word. Let me explain. If we take the letters of the english alphabet and consider each letter to be a different “direction” of a space - i.e. each letter is a basis vector - then we obtain a 26 dimensional space, and words consisting of these letters form a path in this space. To make things a little more clear, let’s write out a short word as an example. Consider the word “the”. As a parameterised sequence of letters, this word is a mapping from $0 \to ’t’$, $1 \to ‘h’$, and $2 \to ’e’$. (A path is a mapping from paramemters taking its values in some space.)

To write this out in a vector frorm, we can take ‘a’ to the vector whose first component is 1, and the rest are zero. Following the obvious pattern, we can write out the word “the” as the following mapping from parameter to vector:

1
2
3
t: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
h: [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
e: [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Some people might call this the vector respresentation of letters as vectors the one-hot encoding of the alphabet. In this form each vector represents the change that occurs at each “time” (0, 1, or 2) which is exactly what we need to compute the signature.

Now let’s think about the space in which our signature is going to live. The path takes it’s values in the 26-dimesional real space $V = \Bbb R^{26}$, so the signature of this path will live in the tensor algebra over $V$, usually denoted $T(V)$. Actually, we’re going to truncate at level 2, so our tensor algebra will actually be

$$ T^{(2)}(V) = \Bbb R \oplus \Bbb R^{26} \oplus (\Bbb R^{26} \otimes \Bbb R^{26}). $$

This is a 703-dimensional space, so I’m only going to write out the non-zero components of any signatures that we compute. The first thing to do is to embed each of our letter increments into the tensor algebra. The first letter is ’t’, whose vector has a 1 in the 19th position (starting from 0) corresponding to the tensor letter “20” and zeros elsewhere, so we represent it’s embedding in the tensor algebra by $\mathbf t = \{ 1(20) \}$. This notation does not mean a “set”, but instead a vector whose only non-zero element is at the position corresponding to the length 1 tensor index (20), where it has coefficient 1. Similarly, we can embed ‘h’ as $\mathbf h = \{ 1(8) \}$ and ’e’ as $\mathbf e = \{ 1(5) \}$. The signature of our path is now given by the product of the tensor exponential of these three letter increments:

$$ S_{0, 3} = \exp(\mathbf t) \otimes \exp(\mathbf h) \otimes \exp(\mathbf e) $$

Since the first element (the coefficient of $()$) is zero in all three of these letters, we can actually compute the exponentials exactly up to any truncation level using the truncated power series:

$$ \exp(\mathbf a) = \mathbf 1 + \sum_{i=1}^N \frac{1}{i!}\mathbf{a}^i $$

In this very simple scenario, these are not too hard to work out to get $\exp(\mathbf{t}) = \{ 1()\ 1(20)\ 1/2(20,20) \}$, $\exp(\mathbf{h}) = \{ 1()\ 1(8)\ 1/2(8,8) \}$, and $\exp(\mathbf{e}) = \{ 1()\ 1(5)\ 1/2(5,5) \}$. Now we take their tensor product to get the signature:

$$ S_{0, 3} = \{ 1()\ 1(5)\ 1(8)\ 1(20)\ 1/2(5,5)\ 1(8, 5)\ 1/2(8, 8)\ 1(20,5)\ 1(20,8)\ 1/2(20,20)\ \} $$

We can notice a few things about the signature. The first thing is that the terms of degree 1 (where there is only a single number in the brackets) are precisely counts of the letters that appear in the path. This means the words that have the same first level signature terms (ignoring higher order terms) are anagrams of one another. The second thing is that the level two terms, those with two numbers in brackets such as $1(8,5)$ indicate the order in which certain pairs of letters occur; in this instance, there is one instance of an ‘h’ appearing before an ’e’. In fact, from the standard Linux dictionary (at least as it appears on my system) there are precisely 2 pairs of words that have the same signatures up to level 2: “anna” and “naan” and “otto” and “toot”. These are words that are not only anagrams but also have all pairs of letters occurring in the same order. If we computed the siganture to level 3, we actually find that these words are distinguished by their third level signatures. The full example is given in the roughpy documentation.

Are there any practical applications of this?

Playing with toy examples like the words above is all well and good, but they ae far from practical. The truth, and the partly the point of the previous section, is that streaming data appears everywhere. It’s in our temperature data, medical records, radio frequency data, sound and speech, and even in the fabric of our language itself. The DataSig project was started to explore the applications of signatures after they were sucessfully used in Chinese character recognition. The point is that most data occurs over time, and is not static; most data is streaming. Now the problem with many techniques in data science is that they are not designed to work with data that appears over time, with a few exceptions. The temporal dimension can be incorporated into the data itself, but it is rather tricky to train models that properly account for this fact. I’m not saying that using signatures and rough path theory will completely solve these problems, but at least it is thinking about the problem in the correct way.

Recently there has been a lot of hype about large language models (LLMs)like GPT. These are somewhat revolutionary, but not because they can construct long streams of plausible sounding text after training on vast quantities of human written text. The reason that LLMs are interesting is the combination of techniques that are hidden behind these models, primarily transformers. A transformer is a stream-to-stream mapping that usually takes a hugely complicated and very high dimensional stream to a much smaller dimensional stream. This makes the computations involved in training models much more plausible. However, this comes at a cost. The transformer is not a well-understood mapping. It is the product of a brute-force approach to a complex problem, obtained by throwing most of the internet’s text into the mixing pot with many thousands of GPU hours (the true quantity is unknown).

LLMs are impressive, and they are nice “proof of concept”, but this iteration is neither sustainable nor practical. Training models takes huge quantities of data and compute power, and even using them is not a walk in the park. Imagine if we could implement a large language model using an approach that makes full use of the structure of the stream of words, properly backed by mathematics rather than brute force and data. I’m not saying this will be better, or even work, but it would be a step in the right direction.

Stepping back to reality for a moment, we can look at one place where rough path techniques have been very succesful: detecting when a stream is anomalous compared to a corpus of “normal” streams. This might seem like a simple problem, but consider that “normal” streams might look quite difference superficially. Events might occur at different rates, and on different time scales, in normal operating behaviour, but an anomalous stream might have events that occur out of order compared to normal (for example).

I already mentioned that signatures were used for real-time Chinese character handwriting recognition. This is also a fruitful area for the datasig group. We have several examples of “action recogintion” from landmark data, including a fully worked example notebook on our website. We’ve also dabbled in full-body human action recognition using landmark data which has been moderately succesful.

Summary

Rough path methods are a methematical approach to looking at streaming data. We’ve got a large set of examples from across the spectrum to just how useful they are, and this only grows. That being said, they are not a panacea; they won’t solve all the problems with streamed data. They only work with path-like data - a parameterised sequence of events - and not easily with data like images. At the moment, high dimensional data also causes computational problems. Computing the signature of a 1920x1080-dimensional stream composed of the frames of a video is not going to go well. At least not yet.

One of the reasons the DataSig project exists is to explora nd document the uses of rough paths in data science and communicating our work to others. There is significant bar to entry for people outside our group who may wish to experiment with the tecnhiques. At the momemnt, our package RoughPy (see also the project on GitHub) still isn’t quite finished, and doesn’t yet have good interoperability with other libraries. The truth is that, while rough path based methods have great potential, they still need a lot of work until they’re ready to compete with the ease and availability of more “traditional” methods, especially since these players have access to all the compute resources and data that they could ever need.