Neural networks have seen much success in computer vision over the last decade. Only more recently has there been an increase in interest in using them for tasks that have traditionally been in the purview of computational scientists and engineers. The one task I’d like to discuss today is approximating solutions of PDEs. To my knowledge, the first use of neural nets for this task was by Dissanayake and Phan-Thien in [1]. They proposed learning solutions by minimizing the interior and boundary residuals associated with the PDE measured in the norm.
As computational power and optimization techniques have improved, neural networks have only grown more in popularity. For solving the forward PDE problem, most works in the literature today are similar to the approach in [1]. It is also worth noting that neural networks offer clear advantages for high-dimensional PDEs and problems for which one would like to incorporate measured data alongside a physics-based model, see for instance physics-informed neural networks.
However, in practice there are a number of issues that arise when using neural networks to approximate PDEs. The most glaring is that while neural networks are universal approximators, their associated objective functions are highly nonconvex and one often observes that the approximation error stagnates after a certain point irrespective of how many layers or neurons are added to the network. This is in sharp contrast with traditional methods such as finite elements which have theoretical guarantees on the error that are realized in practice.
Our goal when developing Galerkin Neural Networks was to have an approach that allows for rigorous control of the approximation error. Leveraging techniques from functional analysis, we train a sequence of shallow networks which learn Riesz representers of weak residuals of the PDE. These Riesz representers form a finite-dimensional subspace from which to approximate the PDE. The result is a method that I believe stands apart quite distinctly from other approaches in the literature. The full analysis and results may be found in my paper [2] (co-authored with my advisor Mark Ainsworth), but this post serves as an accessible introduction with accompanying code.
2. Feedforward Neural Networks
Mathematically, a feedforward neural network is best described as a linear combination of parameterized nonlinear functions. To be precise:
Definition 1.Given an input , the feedforward neural network with hidden layers and widths is described by the sequence of linear transformations
and the function given by
where , , , and .
In the standard nomenclature, and are the weights and biases, respectively, of the th hidden layer; are the linear or activation coefficients; and is the activation function. Often is used to denote all of the network parameters and .
Figure 1: Visualization of a feedforward neural network with 4 hidden layers and 8 neurons per layer. The lines between nodes (neurons) represent the weights of the linear transformation connecting one layer to the next.
As an example, for we can write as
which allows us to more easily see that this is really just a linear combination of nonlinear functions. Some common choices of are the ReLU function , the hyperbolic tangent function, and the sigmoid function . For our work, we only use the hyperbolic tangent function.
The goal of a neural network is to determine, or “learn”, the parameters such that is a good approximation to some given target function . For instance, suppose we simply want to fit a feedforward neural network to . A common approach is to compute
in order to learn the optimal parameters . A problem such as is typically approximated using a first-order method, such as stochastic gradient descent, or a second-order method, such as BFGS. We’ll discuss some particular optimization methods for training neural networks later in this post.
where and are differential operators. They propose the minimization problem
in which one seeks to minimize both the interior and boundary residuals evaluated at a discrete set of points in the domain. A similar approach is followed in physics-informed neural networks.
There are a few things worth noting before we continue.
The optimization problem only makes sense if and . There are many instances where we might have , for example where is the Dirac-Delta distribution. In this case, a pointwise interpretation of the residuals does not make sense.
Despite neural networks having excellent approximation power in theory, their approximation error often stagnates after a certain point and increasing network width and/or depth does not typically solve this issue. This stagnation can be due to stalling in local minima, which has been written about in the context of function approximation here, but in general, one can’t expect to specify a target approximation error – say, maybe we want our error to satisfy – and expect to to be able to achieve this. This is in contrast to methods such as finite elements where, provided a stable discretization is chosen, machine precision error can be reached even if getting there is computationally infeasible.
Example 1. To demonstrate the second point above, we attempt to learn the solution to the differential equation with homogeneous Dirichlet boundary condition (i.e. ). The true solution is taken to be
and is chosen to satisfy this true solution. The loss history over each epoch is provided in Figure 2 alongside a table of parameters for the implementation.
Figure 2: Loss history for example 1 (top). Parameters for the neural network and training (bottom).
We can repeat this example with deeper or wider networks and find that while initially, such architectures might yield an increase in accuracy, they eventually won’t decrease the approximation error further. In fact, haphazardly using very deep networks can yield worse results due to difficulty in training.
Before I talk about Galerkin neural networks and how they attempt to skirt around this issue, we’ll need to (briefly) cover variational formulations and weak solutions of PDEs.
3. Overview of Variational Formulations for PDEs
We’ll start out with a motivating example here. Let be a general Hilbert space (we’ll be more concrete in a second, but is just a space of functions with some restrictions on their behavior). Let’s multiply the equation from Example 1 () by a function and integrate both sides over the interval :
The integral on the left-hand side can be integrated by parts to obtain
The Hilbert space is typically chosen according to the highest order of derivatives appearing in as well as the boundary conditions for the equation. In this case, we’ll work with the Sobolev space :
The space consists of function whose derivatives up to first-order are square integrable and whose value on the boundary of the domain is zero. We note that if , then the boundary terms in will vanish and we’re left with
The variational or weak formulation of Example 1 is to seek a solution such that
If you haven’t taken a course on PDEs before will probably be quite new to you, so I want to point out a few key observations.
Any which satisfies for and will automatically satisfy . This is clear by following the derivation steps above. So what does this mean? If the data is smooth enough, it means that the solution of the original equation and the solution of coincide.
only involves first derivatives of and while our original equation involved second derivatives. Since requiring the existence of second derivatives is a stronger condition, we can surmise that solutions of are “weaker” than solutions of the original equation, hence where the term weak solution comes from. For most cases, when the data is smooth enough (and in two dimensions and higher, when the domain is regular enough), this weakening of the required smoothness of the solutions is irrelevant. However, there will be genuine cases that occur in practice, especially in solid and fluid mechanics, where solutions contain discontinuities, boundary layers, and sharp gradients which reduce regularity for which weak formulations such as are necessary. We’ll touch more on this later.
Rather than requiring an equation to be satisfied for every value of , only considers averaged quantities. That is, as long as the integrals in are finite, its solution is well-defined. This is in contrast to the original equation (whose solutions we’ll refer to as classic solutions) which requires to be defined for all in the domain in order for a solution to make sense.
Even when it’s not strictly necessary to use a weak/variational formulation like , numerical methods such as finite elements work with the variational formulation since its approximation is usually equivalent to solving a sparse linear system of equations. As well, there are a number of mathematical tools that allow for proving nice error estimates when working with the weak formulation.
I. Approximating Weak Solutions
With all that out of the way, how do we actually solve a variational equation like ? To generalize things a bit more nicely, let’s consider an abstract variational problem on a Hilbert space :
Here, is a bilinear operator on and is a linear operator on . In the previous example, we would have and . Typically, can’t be solved exactly and we’ll have to seek an approximate solution.
Definition (Galerkin method). Given a finite-dimensional subspace , the Galerkin method seeks an approximate solution to which is an orthogonal projection (with respect to the bilinear operator ) of the true solution onto . This is equivalent to seeking a solution such that
What does the Galerkin method actually entail in practice? If is a finite-dimensional space of , then it is spanned by a (finite) set of basis functions which we’ll denote by . In other words, we can write . This means we can write as for some coefficients , i.e. is simply a linear combination of the basis functions of and solving is equivalent to finding the linear coefficients so that satisfies .
In order for to solve , we must have for each . Since there are finitely many basis functions in , this is then equivalent to the system of equations
To solve for , we note that is linear in the first argument so can be written as the linear system
Notably, the Galerkin method yields a solution which is the “best approximation” to from the space , a fact which was proved by Jean Céa in his 1964 dissertation.
Theorem (Céa’s Lemma).Let be a Hilbert space with norm . Let ve a bilinear operator which is both continuous and coercive in , i.e.
There is a constant such that for all (continuity),
There is a constant such that for all (coercivity).
Then defined by satisfies the estimate
The continuity and coercivity conditions relate to the well-posedness of . They are sufficient conditions for to have a unique solution. Céa’s Lemma is useful because allows us to leverage approximation properties of the space to derive error estimates for the PDE approximation . What exactly do we mean by this? Let’s demonstrate with an example.
In finite element methods, one takes to be the space of piecewise continuous polynomial functions. In one dimension, the basis for piecewise linears are the hat functions. The approximation error of any continuous function by such functions may be quantified by the size of the support of each hat function (i.e. the so-called mesh size), which in turns tells exactly how fast the error decreases with respect to the number of degrees of freedom in the problem. For example, let’s consider the domain which as subdivided into equal subintervals of width . Given a function and its interpolation onto the space of piecewise continuous linear functions over our subintervals ( is an interpolation operator here which takes functions in and interpolates them to piecewise continuous linear functions on subintervals of width ; the precise definition is not important for this discussion), we have the well-known error estimate
The right-hand side of tells us that if we reduce the width of our subintervals by a factor of, say 2 (i.e. we use more piecewise continuous linears to approximate ), we should expect the error to also reduce by a factor of 2. Estimates like also tell us that, if we use the same space of piecewise continuous linears to try and approximate the solution of , we should expect the approximation error to converge in the same manner as thanks to Céa’s Lemma:
Note that we get from the first line to the second line of by picking , the interpolation of our true solution onto the space of piecewise continuous linears. The infimum of |u-v|_{H^{1}} over all choices of is always bounded from above by for any specific choice of !
With Galerkin neural networks, which we’re now finally in a position to introduce, we will utilize the Galerkin method and Céa’s Lemma to develop a framework for deep learning of PDEs which retains many of the nice approximation properties of the Galerkin method and is naturally adaptive to the PDE under consideration.
4. Galerkin Neural Network Framework
I’m sorry for how long the introductory sections of this post were, but I figured that if you’re a mathematician interested in learning more about Galerkin neural networks, you either already read through the published paper [2] or are able to read through it without much difficulty. This blog post is really geared towards CS folks and others who might not be familiar with a lot of the mathematical machinery presented thus far and are looking for a more informal introduction than is presented in [2].
Main Idea.Galerkin neural networks seeks to approximate the solution of by projecting it onto a finite-dimensional subspace whose basis functions are outputs of a sequence of shallow neural networks. That is, we approximate with
With this idea in mind, it all boils down to how to generate the basis functions , since once we have the basis functions in hand, we simply follow to obtain the coefficients .
Figure 3 provides a visualization of the overall idea. You’ll notice I’ve labeled the basis function coming from the smallest network as the “coarse scale” basis function while the one coming from the widest network is labeled a “fine scale” basis function. This has to do with how we can interpret the basis functions and I’ll discuss this more once we’ve actually covered how to generate the basis functions.
Figure 3: Visualization of a the Galerkin neural network framework.
And the way that we’re going to generate these basis functions has to do with learning Riesz representations of something called the weak residual. Let’s suppose we start off with some initial approximation to the true solution . I want to emphasize here that can be any element of . In our paper, we always made the choice in the examples, which is just to say that we don’t have to have any prior knowledge of the solution at all to pick .
Definition. Given , the weak residual of with respect to is the linear functional given by
The notation denotes the evaluation of at . The quantity will be zero by definition if is the true solution. Otherwise, it is just a nonzero number.
I. Riesz Representation of Weak Residuals
We note that by , so we can rewrite as
Rewriting it in this way allows us to make some observations on .
Theorem (Riesz representation).Let be a Hilbert space with norm and inner product and let be a bounded linear functional on . There exists an element such that
for all . Moreover, we have , where denotes the operator norm.
The Riesz representation theorem says that in a Hilbert space, linear functionals can be “identified” with elements of the Hilbert space itself. Namely, if we want to evaluate a linear functional for some argument , this is equivalent to taking the inner product of with some element . This kernel is known as the Riesz representation of .
This suggests that the Riesz representation of the functional is the error , at least when we consider the norm on to be defined by (this definition of means that ). We can define the norm and inner product on in this way if is symmetric and positive-definite, so for now we’re going to assuming that is symmetric positive-definite. The norm that’s induced by the operator has a special name: the energy norm. It’s standard to use triple bars to denote this norm: .
Thus, we can explicitly say what the Riesz representation of the weak residual is, but who cares? Why is this important? Let’s introduce the variable to denote the normalized Riesz representation of , i.e. . Now suppose we could compute this . Then we could easily take a linear combination and recover the true solution – just take and .
But typically, we won’t be able to compute exactly. This is akin to solving exactly, which can only be done in very simple cases. Otherwise, we’ll have to approximate , and the key to doing this is actually a very simple property of the Riesz representation of the weak residual.
Proposition 1. Let denote the normalized Riesz representation of . Then satisfies
Proposition 1 says that the normalized Riesz representation is actually the maximizer of the weak residual over all elements of with unity energy norm. In theory, all we need to do to compute is solve the maximization problem laid out in . We already know we can’t solve this maximization problem over all of (which is infinite-dimensional). Instead, what if we replace with the set of all functions that can be represented by a particular neural network architecture?
II. Learning a Basis Function
Let’s define the set to be the set of all functions which can be realized by a feedforward neural network with a single hidden layer of neurons and activation function :
And let’s replace in with . We’ll call the maximizer of this problem :
The maximization problem in is now tractable. In fact, we can approximate it using the typical optimizers we would use for other deep learning problems. Before we go and do this though, we would do well to quantify how far off is from the true Riesz representation .
Theorem 1.Given , there exists and defined by such that
The proof is somewhat technical and provided in [2]. It piggybacks off of standard universal approximation results such as those of Cybenko, Hornik, and Leshno. As such, Theorem 1 doesn’t quantify how the error converges with respect to the degrees of freedom in the network (such as in ).
As an aside, it’s worth noting that there have been recent efforts to quantify both the minimum width and depth of a neural network necessary to reach a target accuracy, and how the approximation error in neural networks varies with respect to the width and depth of the network a la . Some examples of this are the excellent work of De Ryck, Lanthaler, and Mishra; Yarotsky; and Opschoor, Petersen, and Schwab. Such results can be integrated with Theorem 1 when appropriate to obtain explicit error bounds for our method.
Once we have computed (see §4.IV for implementation details), we can use it to “correct” the initial approximation . Let’s form the two-dimensional subspace and use Galerkin’s method to project onto :
is equivalent to solving a linear system to obtain . We can derive an error estimate for based on Theorem 1.
Theorem 2. Given and as defined in , there holds
This result says that if we supplement the initial approximation with the approximate Riesz representation , we should expected to reduce the error in the initial approximation by a factor of ~.
Remark.Theorems 1 and 2 only quantify the approximation error accrued when using the set as a stand-in for the infinite-dimensional space . Error in the optimization method used to compute is another significant source of error. Eventually stagnation of these methods is ubiquitous in deep learning. Galerkin neural networks addresses this issue by conceding that we won’t be able to use a single large DNN to capture all features of the solution. Rather, we sequentially compute basis functions as corrections to the error accrued in the previous steps in order to improve accuracy incrementally.
III. Adaptive Subspace Generation
We now have an improved approximation , but what if this still isn’t good enough? We can naturally iterate the procedure thus far to generate more basis functions. This is done by replacing with and computing its maximizer in . To generalize, the th basis function is generated by computing
We’re using here to denote the width of the neural network for . The th approximation is computed by forming the subspace and solving
Corollary 1. Given , , defined by , and defined by , there holds
and
Each basis function approximates the Riesz representation of the th weak residual, which shouldn’t be too surprising. The approximation is exponentially convergent with respect to the number of basis functions, and in fact each basis function added to the subspace reduces the original error by a factor of ~.
We could go on generating these basis functions indefinitely. For problems where we have access to the true solution, we could compute exact errors of our approximation to determine when to stop adding basis functions to the subspace. For most realistic problems though, we won’t have access to such metrics and will need to estimate the error on the fly. Conveniently, Galerkin neural networks has a very nice way to do this.
We already know that the Riesz representation is the maximizer of . More importantly, the maximum value of is actually the energy error :
Thus, when replacing with , it is reasonable to expect . We can quantify just how well the weak residual approximates the energy norm with the following result.
Theorem 3. Given , there holds
This says that if we evaluate the residual at the computed basis function , it should be a reasonable a posteriori estimate of the energy error, which is typically a physical quantity of interest!
In effect, this means we can choose a desired tolerance and use as a stopping ctierion for determining when to stop generating new basis functions.
i. An Illustrative Example
Before discussing the finer details of implementation and showcasing some broader examples, I want to provide a straightforward one-dimensional example which demonstrates all of the key features of Galerkin neural networks. Let’s consider the Poisson equation:
The parameter is designed to approximate a homogeneous Dirichlet boundary condition: as , the solution of the above BVP converges to the solution when . The reason for having this parameter is due to the fact that weak formulations of this problem typically contain information about the boundary condition in the Hilbert space , e.g. in we had , but in general the Galerkin neural network basis functions won’t be in , only .
The variational formulation for is
We’ll take in which case the true solution is
I’ll show first what the Riesz representations and basis functions look like. The different parameters used in this example are provided in Table 2.
Table 2: Parameters for approximating 1D Poisson equation using Galerkin neural networks. GL = Gauss-Legendre quadrature.
Figure 4: True Riesz representation \varphi_{i} (blue lines) and basis function approximation \varphi_{i}^{NN} (orange lines).
One striking observation is that the first couple of Riesz representations consist of mainly low frequency oscillations while the later Riesz representations consist of very high frequency oscillations, and generally speaking each Riesz representation is mono-frequency. Thus, Galerkin neural networks essentially decomposes the problem into a series of subproblems, each of which encodes different frequencies of the solution. We also observe that each neural network is able to very closely approximate the true Riesz representation (granted, this is a very simple problem).
Utilizing all of these basis functions to approximate the solution results in a highly accurate approximation as seenin Figure 5 ( was the notation we used in [2] to denote the weak residual). As predicted by Theorem 3, the weak residual almost perfectly approximates the energy error . Similarly, we can actually take as an approximation to the error, which is pretty accurate as well. It’s also clear that with each basis function added to the subspace, we reduce the approximation error by a fixed amount (Corollary 1).
Figure 5: Approximation errors for the 1D Poisson equation.
IV. Implementation Details
Before moving on to some more examples, I’ll touch of some implementation details in this section. The first order of business is how to actually compute . Rather than seeking a normalized maximizer, we divide through by the energy norm:
Next, we note that consists of many inner products (integrals) which must be evaluated. We will approximate these integrals with standard Gaussian quadrature rules. To be concrete, let’s consider the 1D Poisson example from the previous section. A quadrature rule consists of a set of nodes and weights which are used to approximate the integral as a weighted sum of function evaluations over the nodes:
For the 1D Poisson example, the weak residual is approximated by
For all examples here, we use a Gauss-Legendre quadrature rule in which the nodes and weights are obtained from Legendre polynomials.
To perform the numerical optimization, we use a hybrid first-order least squares optimization approach. This approach considers the weights of the activation layer (i.e. the parameters in ) to be expansion coefficients of the nonlinear basis . It’s similar in concept to nonlinear least squares methods such as Levenberg-Marquardt which obtain by solving a linear least squares system.
In [2] we proposed the following training procedure. We update the hidden parameters using a gradient-based optimizer such as gradient descent:
The linear parameters are updated according to the least squares solve:
This linear system is equivalent to taking the orthogonal projection of the Riesz representation with respect to onto the space .
Lastly, there is the matter of how to initialize the parameters and . In computer vision applications, it’s common to draw values of and from a probability distribution (often uniform or Gaussian with mean and variance depending in some way on the width of the layer). These initializers are sometimes not optimal for function and PDE approximation. In [3], the authors propose a new initializer (the “box initialization”) which distributes the hyperplanes of the network more evenly throughout the computational domain . While this initialization was developed for very deep ResNets, it is sufficient for our very shallow networks.
Figure 6 (from [3]): Hyperplanes x\cdot W_{i} + b_{i} for the box initialization vs. several standard initializers in the unit square.
A full implementation (using Google’s Jax framework) for some simple 1D and 2D problems is provided on my GitHub. The rest of this post will be dedicated to showcasing more numerical examples.
5. Numerical Examples
I. Approximation of a Nonsmooth Function
The next example we’ll look at is a two-dimensional function approximation problem. Suppose we have target data and wish to fit this function with a neural network. The variational formulation for this problem is
which is essentially just an projection. The data for this example will be space-time data describing the volume fraction of spherical particles in a suspension flow. Figure 4 shows the target function that we’re going to approximate with Galerkin neural networks.
Figure 6: Target function f.
This data was generated using something called the force-coupling method with data provided by Amanda Howard. The application of Galerkin neural networks to this problem was carried out during my time as an intern at Pacific Northwest National Laboratory. The precise details of the force-coupling method aren’t important for demonstrating the main ideas of Galerkin neural networks.
The different parameters used in this simulation are provided in Table 2.
Table 3: Parameters for approximating f using Galerkin neural networks. GL = Gauss-Legendre quadrature.
I’ll first show the exact Riesz representations for the first two residuals ( and ) as well as their neural network approximations ( and ).
Figure 7:\varphi_{1} and \varphi_{2} (top row) and \varphi_{1}^{NN} and \varphi_{2}^{NN} (bottom row).
What we observe is that the first basis function is a very coarse approximation to the true Riesz representation. It captures the low frequency behavior and “sees” that the Riesz representation is a symmetric function about the -centerline but otherwise misses all of the fine-scale behavior. The same is true for which essentially looks like a smoothed out version of , albeit containing higher frequencies than .
Let’s take a look at the next couple of Riesz representations and their approximations.
Figure 8:\varphi_{3} and \varphi_{4} (top row) and \varphi_{3}^{NN} and \varphi_{4}^{NN} (bottom row).
We observe that and encode higher frequency behavior of the error. In all of these basis functions however, it’s clear that we never approximate the true Riesz representations perfectly. Indeed, the first couple of basis functions don’t look very good at all. Each basis function is only meant to reduce particular frequencies in the error, much like in the first example.
When we take a linear combination of these basis functions, we can see more clearly the effect each one has on the overall approximation.
Figure 9:u_{1}, u_{3}, and u_{7} (top row) and corresponding one-dimensional slices along \gamma=20 (bottom row).
A common question I get asked during conference talks (and more recently, job talks) is what happens if we only use the widest neural network in our sequence to approximate the solution. The width of the network for is for this example, so which utilizes networks of widths 20, 40, 80, 160, 320, 640, and 1280 to approximate the component basis functions. If we generate only a single basis function coming from a network with width , we don’t come close to approximating the target function to the same degree of accuracy obtained from the full ensemble of basis function.
The analogous result using a single wide neural network is shown in Figure 7. While is certainly more accurate than (the initial results shown in Figure 6), it’s clear that the former doesn’t come close to the accuracy obtained by using seven basis functions, the last of which comes from . Sure, we can add a second basis function, say if we continue with the same rule for the network widths, but this is computationally inefficient.
I’ll discuss computational costs more in §4.IV, but roughly speaking, the computational cost for Galerkin neural networks is on the same order as the computational cost of computing the basis function coming from the widest network. This means that computing takes about the same amount of time as and yet is several orders of magnitude less accurate. All this to say that the coarse basis functions we learn initially are important! It might not seem like it since the initial basis functions aren’t great approximations to their respective Riesz representations (Figure 5), but they encode the lower frequencies of the solution and allow the later basis functions to focus on higher frequencies.
Figure 10:u_{1} (left) and corresponding one-dimensional slice along \gamma=20. u_{1} is obtained using a single neural network with width n_{1}=1280.
II. Poisson Equation with Line Source Data
The last example I want to showcase is one which contains singular data. We take
corresponds to a Dirac-delta distribution over a circle, and the weak solution to the problem
in the domain is
The parameters used for this example are given in Table 4 and the results in Figure 11.
Table 4: Parameters for approximating the solution to the line source data problem.
Figure 10: Errors with respect to number of basis functions (left). u_{7} approximation (right).
We observe that we’re able to accurately capture the corner in the solution that appears along . Moreover, even for this reduced regularity problem, we observe convergence at a fixed rate as the theory suggests. This rate of convergence is slower due to the reduced regularity as one might also expect. In the presence of singular behavior, there are several approaches to improve convergence. The first is to use a smarter formulation of the problem designed to better capture these features. An example of this can be found in our follow-up work, Galerkin neural network approximation of singularly-perturbed elliptic systems (preprint can be found here). The second is to utilize activation functions and/or a choice of which capture the singular behavior. In the case of boundary layers, corner singularities, etc. much is known about the structure of solutions in the vicinity of these features and is readily available in the literature. Our current work focuses on such approaches.
6. Key Takeaways
Galerkin neural networks approximate the solution of a PDE by expressing it as the sum of several shallow neural networks, each of which encode features of different frequencies. The result is an approximation that is more than the sum of its parts would suggest, as using only the most expressive network in the sequence is unable to come close to replicating the same result.
There are many papers out there purporting to combine Galerkin methods with neural networks yet neither use a Galerkin method nor benefit from any of its properties. We leveraged the nice approximation properties of Galerkin methods with the expressivity of neural networks in a way that feels true to the source material and performs well for a number of difficult test problems.
I hope you found this post informative. If you’re interested in trying out this method for yourself, feel free to check out the GalerkinNN repo on GitHub.