Physics-informed machine learning is gaining attention, but suffers from training issues.
Chris Edwards
Physical scientists and engineering research and development (R&D) teams are embracing neural networks in attempts to accelerate their simulations. From quantum mechanics to the prediction of blood flow in the body, numerous teams have reported on speedups in simulation by swapping conventional finite-element solvers for models trained on various combinations of experimental and synthetic data.
At the company’s technology conference in November, Animashree Anandkumar, Nvidia’s director of machine learning research and Bren Professor of Computing at the California Institute of Technology, pointed to one project the company worked on for weather forecasting. She claimed the neural network that team created could achieve results 100,000 times faster than a simulation that used traditional numerical methods to solve the partial differential equations (PDEs) on which the model relies.
Nvidia has packaged the machine learning techniques that underpin the weather-forecasting project into the Simnet software package it provides to customers. Its engineers have used the same approach to model the heatsinks that cool the graphics processing units (GPUs) that power many other machine learning systems.
Other engineering companies are following suit. Both Ansys and Siemens Digital Industries Software are working on their own implementations to support their mechanical simulation product lines, adding to a growing body of open source initiatives such as the DeepModeling community.
A key reason for using machine learning for scientific simulations is that a collection of fully connected artificial neurons can act as a universal function approximator. Though training those neurons is computationally intensive, during the inference phase the neural network often will provide faster results than running simulators based on finite-element or numerical approximations to PDEs.
One approach to training a neural network for scientific simulation is to record experimental data and augment that with simulated data using numerical methods. For example, a simulation of the motion of a shock wave in a fluid-filled pipe might use a combination of sensor recordings and the solutions of the Bateman-Burgers equation.
The simulated data can be used to supply usable data for points where it is impossible to place a sensor to record pressure or simply to provide a higher density of data points. In principle, the machine learning model then will interpolate reasonable values for points where no data has been supplied. But the learned approximation can easily diverge from reality when checked against traditional models. The neural network likely will not learn the underlying patterns, just those that let it approximate the data points used for training.
In the 1990s, researchers found one way to train neural networks on scientific data is to incorporate the PDEs that describe a process into the machine learning model. The PDEs in effect become inductive biases for the neural network. As with other work on neural networks at the time, the technique proved difficult to employ on large-scale problems.
Paris Perdikaris, associate professor of mechanical engineering and applied mechanics at the University of Pennsylvania, explains, “When people in the 1990s tried to use neural networks to solve PDEs, they had to manually derive forward and backpropagation formulas. If you apply those manual methods to more complex PDEs, the calculations become too complicated.”
Published in 2017, the physically informed neural network (PINN) approach developed by Maziar Raissi and George Em Karniadakis at Brown University together with Perdikaris takes advantage of the automatic differentiation tools that now exist. In this method, the PDE forms part of the loss function that is used to recalculate the neuron weights at each training step.
Because relevant PDE can simply be incorporated into the loss function, scientists and engineers have found the PINN to be easy to use. Perdikaris notes, “One of the main reasons for the current popularity of PINNs is the ease of implementation. It takes approximately 100 lines of Python code to implement a new PINN, or about an afternoon’s work. Another reason is that a PINN is often more tolerant to assumptions than conventional solvers.”
Not only can the PDE-enhanced approach work well for high-dimensionality problems with more relaxed assumptions, it also works for complex equations that contain integral operators that are difficult to solve using traditional finite difference methods.
One downside of using PINNs is that training them can be far from straightforward. Their behavior is quite different from the neural networks that utilize conventional loss functions that do not have multiple higher-order differential terms. The stochastic gradient descent approach used across many deep-learning applications often fails on PINNs. “We found that to make these optimizers work, we have to do significant hand-tuning and resort to non-standard tricks and techniques,” says Amir Gholami, a post-doctoral research fellow working at the Berkeley AI Research Lab in California.
In many cases, the solution space is too complex for training to converge automatically. In a situation such as the modeling of beta-advection in fluids, which is used in the simulation of hurricanes, the beta value itself proves to be a roadblock. “We tested with different betas. As soon as you go to higher levels of beta, things begin to break,” Gholami notes.
Neuron-weight initialization at the beginning of training is similarly troublesome. The techniques developed for deep neural nets by Xavier Glorot and Yoshua Bengio at the University of Montréal in 2010 and now used widely do not work for PINNs because they do not operate in a conventional supervised-learning environment. “All the assumptions that are used for initialization in classical networks are violated,” Perdikaris says, adding that data distribution has a significant effect on the convergence of training. “The assumptions we use for deep learning need to be revised or tailored for the PINN framework.”
Researchers have developed some workarounds to deal with problems they have encountered with PINNs. One is to adapt the data to make training more consistent early, which can be achieved by limiting the range of datasets used in the early stages of training before extending the range covered. Colby Wight and Jia Zhao of Utah State University described this method in a paper published in the summer of 2020.
Presenting at NeurIPS last December, Gholami and colleagues proposed adapting the PDE during training in a form of curriculum learning. “We found that starting from a simple PDE and then progressively making the PDE regularization more complex makes the loss landscape easier to train,” Gholami says.
One contributor to the difficulty of training PINNs comes down to what researchers call spectral bias. This is likely also a key cause of the tendency of PINNs to align with low-frequency patterns in data more easily than those with higher frequencies. The progression of simple to complex PDEs makes it easier overlay the higher-frequency contributors once any low-frequency contributors have been identified by the early training.
Perdikaris and colleagues explored these issues working with a technique developed several years ago by Arthur Jacot and colleagues at France’s École Polytechnique Fédérale de Lausanne. This work employed a conceptual neural network with an infinitely wide hidden layer driven by a loss function to calculate the elements of an algebraic kernel that describes the neural network’s training properties. Common matrix analysis techniques such as eigenvalues and eigenvectors of these neural tangent kernels were used to see how they differed across neural networks that encountered problems. Typically, components of the loss function that correspond to neural tangent kernels with larger eigenvalues will be learned faster, the others far more slowly.
The difference in responsiveness for the different terms in the PDEs used for loss functions has a dramatic effect on trainability. What stood out to the University of Pennsylvania was the wide discrepancy of convergence rates in different PINN loss functions. It underlined the common observation that wave equations tend to be particularly problematic. The terms that encode boundary conditions also turn out to be more problematic for training, which helps explain why other researchers found relaxing those conditions helped neural networks converge more quickly when trying to find heuristics to make their PINNs easier to train.
The simplest way around the problem of training failure is to slow the gradient-update rate dramatically, giving the stiffer terms a better chance of updating appropriately between consecutive batches. But this risks slowing the process down so far that it fails to yield a useful model.
Perdikaris sees the information from the neural tangent kernel being used to tune training rates for each of the terms to improve convergence. “This is not confined to PINNs: it applies to any multitask training situation,” he says. “In general, we should think about developing more specialized architectures and methods for these problems.”
The distribution of training data also plays a major role in shaping the spectrum of the neural tangent kernel, which hints at other techniques that might be used to improve trainability across a variety of neural-network applications.
The structure of the neural network itself may not be optimal for all but a subset of the problems on which PINNs can be deployed. “From intuition, it would make sense that we would need different types of architectures depending on the nature of the underlying PDE,” Gholami says.
Perdikaris says the emergence of PINNs has revealed many issues, but they could go a long way to informing the theory of neural networks. “It’s a very exciting field. It’s how deep learning was before 2010. We have a gut feeling that it should work. But as we push to realistic applications, we run into limitations. It’s not the PINN framework itself that’s the problem; it’s how we set it up.”
“What we need is to develop a rigorous understanding of what goes wrong as we increase the complexity of the problem, and come up with ways to deal with them,” Perdikaris concludes, noting that the field needs more than heuristics to deal with the challenges of this branch of machine learning.
Sumber: ACM Magazine