2. Iterative Methods
Many real-world optimization problems entail objective functions lacking readily available
analytical solutions. In such instances, iterative methods become indispensable for
approximating solutions. These iterative optimization techniques, such as conjugate
gradient methods and their variants, play a vital role in numerically finding optimal
or near-optimal solutions by iteratively refining the current estimate. They prove
especially valuable in scenarios where analytical optimization is impractical or infeasible,
facilitating the effective exploration of complex, high-dimensional solution spaces
[8-9]. The iterative nature of these methods allows for systematic improvement of the solution
approximation until a satisfactory level of optimality is attained.
2.1 Steepest Descent Methods
The method of Steepest Descent (SD) is a gradient algorithm where the step size is
chosen to achieve the maximum decrease in the objective function at each individual
step. Let $\left\{{ x}^{(k)}\right\}_{k=0}^{\infty}$ be a sequence in the domain of
a function $f({ x}):{ R}^{n}\to { R}$ that is differentiable at each point where
n and k are integers [10]. The direction of gradient descent can be obtained as the vector $-\nabla f({ x}^{(k)})$
where the gradient $\nabla f({ x}^{(k)})$ is the first order derivative of $f({ x}^{(k)})$,
Then starting at ${ x}^{(k)}$, the next point ${ x}^{(k+1)}$ is searched by moving
with an amount of in the orthogonal steps,
where $\alpha_{k}$ is a positive scalar called the step size, which is chosen to achieve
the maximum amount of decrease of the objective function at each individual step.
The $\alpha_{k}$ is selected to minimize
with respect to $\alpha_{k}\ge 0$ where the variable
is defined as
For a quadratic objective function of the form:
where ${ Q}\in{ R}^{n\times n}$ is a symmetric positive definite matrix, ${ b}\in{
R}^{n}$ and ${ x}\in{ R}^{n}$, the unique minimizer $\nabla f({ x})$ and the Hessian
matrix ${ H}({ x})$ of $f({ x})$ can be found by:
The gradient descent algorithm for an objective function $f({}{x})$ follows the method
of steepest descent, moving in orthogonal steps, as indicated by $\left <\nabla f({}{x}^{({k})},\:
\nabla{f}({}{x}^{({k}+1)}\right > =0$ for each k in the sequence $\left\{{ x}^{(k)}\right\}_{k=0}^{\infty}$,
where the operator, $ < >$, denotes the inner product. In steepest descent, the algorithm
iteratively updates the current solution by moving in the direction opposite to the
gradient of the objective function at the current point. This direction is considered
the steepest descent as it corresponds to the direction of the fastest decrease in
the function. Nevertheless, this method may display zig-zagging behavior and slow
convergence in certain cases [12,17].
2.2 Conjugate Gradient Methods
To address the limitations of the steepest descent method, the conjugate gradient
(CG) method is introduced. The key idea behind the conjugate gradient method is to
search for conjugate directions, which are directions that are orthogonal with respect
to the problem's quadratic form. By choosing conjugate directions, the method aims
to avoid the zig-zagging behavior and improve the convergence rate compared to the
steepest descent method.
The CG algorithm is widely used for optimization problems, especially in the context
of solving linear systems of equations or unconstrained optimization problems. It
is known for its efficiency, particularly for quadratic functions, where it can converge
to the solution in a number of iterations equal to the dimension of the problem. The
CG algorithm is an iterative optimization technique that belongs to the family of
iterative solvers for linear systems and can be extended to unconstrained optimization
problems as well. The Conjugate Gradient algorithm is described below. Given a quadratic
function, Equation (5), the conjugate gradient algorithm is summarized as follows [2,10,17]:
==============================================
Initialize the process by choosing an initial guess ${ x}^{(0)}$ and the residual
${ r}^{(0)}={ b}-{ Qx}^{(0)}$. Then the search direction becomes ${ d}^{(0)}={ r}^{(0)}=
-\nabla f({ x}^{(0)})={ b}-{ Qx}^{(0)}$
such that
$\nabla{f}({}{x}^{(0)})={}{Qx}^{(0)}-{}{b}$.
If $\nabla{f}({}{x}^{(0)})=0$, stop, else set ${}{d}^{(0)}={}{r}^{(0)}$.
For each iteration $k=0,\: 1,\: 2,\: ...$
- Compute the step size by minimizing
- Update the step:
- Update the gradient:
- Compute the conjugate gradient parameter to determine the next search direction:
- Update the search direction:
==============================================
The iterations terminate when a predefined convergence criterion is met, such as when
the norm of the gradient is below a certain threshold or after a fixed number of iterations.
The CG methods involve more sophisticated update rules. The search directions are
adjusted based on the history of previous iterations to ensure conjugacy, while the
update rule in steep descent is simple and involves multiplying the negative gradient
by a step size (learning rate) to determine the next iterate. The CG methods are designed
to converge more quickly than the SD method when they minimize the objective function
exactly for quadratic problems in a certain number of iterations (up to machine precision).
The CG methods may require additional memory to store vectors representing the history
of previous iterations while steep descent requires less memory typically as it only
needs to store the current gradient vector.
The fact that the CG methods solve quadratics of n variables in n steps is well known,
where the set of search directions is known as the best. The CG methods may be regarded
as an intermediate between the SD methods and Newton’s methods, which implies that
the CG methods typically perform better than the SD methods but not as well as Newton’s
methods [11-12].
Newton’s methods have the property that the minimum may be found without much effort
under the following conditions. First, the objective function is approximately quadratic
near the minimum. Second, the objective function is twice differentiable at the minimum.
Third, the second derivative is non-singular at the minimum point [10,15,16].
Newton’s methods practically utilize that the non-quadratic function is approximately
quadratic near the minimum points if the objective function is twice differentiable
at the minimum and the second derivative is non-singular at the point. However, in
other words, if the initial point is not near to the solution, Newton’s methods are
not practical.
CG methods can solve quadratics of n variables in n steps. Matrix inversion and the
storage of a matrix are not required in the CG procedure compared to Newton’s methods.
Thus, the usual implementation of the quadratic CG algorithm does not require Hessian
matrix evaluations in iterations, while the nonquadratic CG algorithm needs Hessian
evaluations in every iteration [1,10,15].
2.3 Nonquadratic Conjugate Gradient Methods
Researchers have tackled this issue by devising a range of line search techniques.
In general, for non-quadratic problems, evaluating the Hessian matrix at every iteration
of the CG algorithm is computationally demanding. In contrast to the constant Hessian
matrix found in quadratic objective function, the CG methods extend the CG approach
to non-quadratic numerical optimization, eliminating the necessity for Hessian evaluations
at every step. In non-quadratic CG methods, the closed-form expression for the conjugate
gradient parameter $\beta_{k}$ in Equation (11) can be substituted with a line-search procedure, obviating the need for the Hessian
matrix Q in the Equation (5). This alteration enables the algorithm to rely solely on the values of the objective
function and its gradient at each iteration, without requiring matrix involvement.
The conjugate gradient parameter $\beta_{k}$ in Equation (11) can be reformulated as depicted in Equation (13) by replacing $[\nabla f({ x}^{(k+1)}-\nabla f({ x}^{(k)})]/\alpha_{k}$ with the term
${ Qd}^{(k)}$. This term ${ Qd}^{(k)}$ is obtained by multiplying ${ x}^{(k+1)}={
x}^{(k)}+\alpha_{k}{ d}^{(k)}$ (from Equation (9)) with Q, which yields ${ Qx}^{(k+1)}={ Qx}^{(k)}+\alpha_{k}{ Qd}^{(k)}$. From the
relation ${}{Qx}^{({k}+1)}-{}{Qx}^{({k})}=\alpha_{{k}}{}{Qd}^{({k})}$ and $\nabla
f({ x}^{(k+1)})+{ b}={ Qx}^{(k+1)}$ in Equation (10), the term $[\nabla f({ x}^{(k+1)}-\nabla f({ x}^{(k)})]/\alpha_{k}$ is derived. Hence,
Equation (11), $\beta_{k}=\dfrac{\nabla f({ x}^{(k+1)}){ Qd}^{k)}}{{ d}^{(k)^{T}}{ Qd}^{(k)}}$,
is reformulated by incorporating the relations we established, resulting in a new
conjugate gradient parameter $\beta_{k}^{HS}$. This leads to
Equation (13) is known as the Hestenes-Stiefel formula [1], which is derived by eliminating the parameter $\alpha_{k}$ and removing the parameter
Q through mathematical manipulations. In this context, $\nabla f({ x}^{(k)})$ represents
the gradient of the objective function at the k-th iteration, and ${ d}^{(k)}$ denotes
the previous conjugate direction. The Hestenes-Stiefel formula for $\beta_{k}^{HS}$
ensures that the new conjugate direction is maintained conjugate to the previous ones
with respect to the quadratic objective function. This prevents the algorithm from
retracing its steps, thereby contributing to efficient convergence. Therefore, determining
the conjugate gradient parameter without explicit knowledge of the Hessian matrix
Q becomes essential, necessitating only the objective function and gradient values
at the selected location for each iteration [1,10,16].
However, for nonquadratic problems involving n variables, the algorithm typically
fails to converge in n steps [10,16]. As the algorithm advances, the condition of conjugate orthogonality among the direction
vectors tends to weaken. This underscores the importance of line search in identifying
the minimum of a nonquadratic objective function. Popular choices for the parameter
include the Fletcher-Reeves (FR), Polak-Ribiere (PR), Polak-Ribiere-Polyak (PRP),
and Hestenes-Stiefel (HS) formulas, among others, to obviate the need for Q [11-12].
In general, conjugate gradient algorithms for nonlinear problems require only the
objective function and gradient values at each iteration, bypassing the need for explicit
knowledge of the Hessian matrix, with various options available for selecting the
parameter $\beta_{k}$. For quadratic problems, the formulas for $\beta_{k}$—specifically
those of HS, PR, and FR—are identical. However, this is not the case with nonquadratic
problems, where different expressions for $\beta_{k}$ can lead to distinct outcomes.
Nonquadratic objective functions can increase computational demands and lead to stagnation
in practical applications. Consequently, the choice of $\beta_{k}$ significantly influences
the effectiveness of the line search process, making it a critical factor in solving
minimization problems involving nonquadratic functions [10,15].
3. Adaptive Restarting Conjugate Gradient Methods
3.1 Conjugate Gradient methods with Restarting
While the process of finding the objective function's minimum involves conducting
searches along directions that are mutually conjugate for a quadratic function of
n variables in n steps, the CG algorithm often does not converge to the optimum within
n steps for non-quadratic problems.
To counteract the issue of slow convergence, the direction search process is typically
reset to follow the steepest descent direction at least every n or n+1 iterations
in practice for an objective function with n variables, as noted in references [7,10]. However, this approach might not be suitable for large n values. The reason is that
restarting the process every n iteration may not effectively counteract the deterioration
of convergence due to the extended intervals between restarts, which become too lengthy
with a large number of variables.
Your explanation provides a detailed view on the dynamics of the conjugate gradient
(CG) method and its restart mechanism, but it can be made clearer and more concise.
Indeed, the conjugate gradient (CG) method tends to exhibit linear convergence unless
the iterative process is periodically restarted. At the outset of CG iterations, starting
from an initial point ${ x}^{(0)}$, the subsequent point ${}{x}^{(1)}$ is determined
using the Steepest Descent (SD) method. Without a restart, the vector leading from
the initial to the minimum point can be expressed as a linear combination of conjugate
directional vectors. Thus, a restart at ${}{x}^{({k}+1)}$ is affected from ${ x}^{(k)}$
by reforming the search for directional vectors, specifically by setting $\beta_{k}=0$
in Equation (12). This ensures that the new directional vector ${ d}^{(k+1)}$ is no longer a linear
combination of the previous conjugate directional vectors.
Recent studies have introduced various restarting algorithms for nonquadratic conjugate
gradient (CG) methods, developing a method for restarting by monitoring the $l_{2}$
norm of the gradients of $\nabla f({ x}^{(k)})$ and $||{ d}^{(k+1)}||$ at the point
${ x}^{(k)}$. These studies leverage the characteristics of typical conditions related
to gradient directions to establish global convergence, as documented in references
[16-17].
In addition, at each iteration, the backtracking line search method calculates a step
size that ensures a suitable reduction in the non-convex objective function. This
process begins with an initially large estimate of the step size for the line search
direction, then iteratively reduces—or 'backtracks'—the step size until a decrease
in the objective function is achieved. This backtracking process ensures an appropriate
balance between the step size reduction and the local gradient of the objective function.
Restarting methods utilize not only the inner product of the gradient and directional
vector but also consider the norm of the directional vector at each point [14,18,19].
The proposed adaptive method, however, capitalizes on vector properties, specifically
the vector that extends from the initial point to the directional vector, to monitor
changes in convergence without directly searching for the step size. This contrasts
with many restarting methods, which impose a significant computational load by relying
on the degree of backtracking for restart criteria, focusing on finding the next directional
vector and its size without addressing the overall trajectory of conjugate directional
vectors. Instead, the proposed method detects convergence stagnation and initiates
a CG restart, effectively mitigating the risk of stagnation.
3.2 Stagnation of Conjugate Gradient and Restart
Stagnation in optimization occurs when an algorithm's progress toward the optimal
solution significantly slows down or stops altogether. This phenomenon, particularly
within the conjugate gradient (CG) method, manifests as negligible progress toward
the optimal solution, evidenced by minimal changes in the objective function or the
values of decision variables across successive iterations. Despite the CG method's
effectiveness, stagnation can occur, especially in complex optimization challenges.
Identifying the causes of stagnation and implementing suitable strategies or adjusting
parameters are crucial steps to enhance the algorithm's performance.
Several strategies can be employed to address stagnation in the CG method. First,
restarting the CG method involves resetting the algorithm's state after a certain
number of iterations or when specific conditions are met, helping to overcome stagnation
by reinitializing the search process. Second, adjusting parameters such as the step
size or the preconditioner can sometimes alleviate stagnation issues. Third, employing
adaptive methods can enhance the algorithm's robustness. Incorporating adaptive strategies,
like adjusting the search direction based on the objective function's behavior, can
effectively overcome stagnation.
The relationship between stagnation and the Nonlinear Conjugate Gradient (NCG) method
merits exploration. In nonquadratic optimization, the NCG method adapts the conventional
approach to accommodate more complex, nonquadratic objective functions, whereas the
CG method is optimized for quadratic objectives. In the realm of non-quadratic problems
with n variables, the optimization process may fail to converge within n steps. Additionally,
the quality of conjugate direction vectors can decline, exacerbating stagnation. This
decline is often due to the method of expanding direction vectors, where each new
directional vector must be conjugately orthogonal to those of previous iterations,
potentially leading to stagnation.
Stagnation in the NCG method may arise from various challenges, such as ill-conditioned
problems, poor scaling, or unsuitable step size choices. The inherent characteristics
of nonquadratic objective functions, including nonlinearity and variable curvature,
further contribute to stagnation when the algorithm cannot adapt to the evolving landscape.
Causes of stagnation within the NCG framework include loss of conjugacy, inadequate
step size selection, and the complexity of navigating nonquadratic surfaces. Given
the wide variability in the curvature of objective functions in nonquadratic optimization,
selecting appropriate step sizes becomes a significant challenge across the optimization
landscape.
Addressing stagnation in the NCG method can involve several strategies, such as adopting
adaptive step size selection, implementing preconditioning techniques, and employing
restarting strategies. These approaches aim to enhance the method's adaptability and
efficiency in overcoming the unique challenges presented by nonquadratic optimization
problems.
Adaptive methods play a pivotal role in addressing stagnation within optimization,
particularly in the Nonlinear Conjugate Gradient (NCG) method. These methods dynamically
adjust parameters in response to the behavior of the optimization process, especially
changes in the objective function's curvature. Restarting the optimization algorithm,
by resetting its state, offers another effective strategy for overcoming stagnation.
This approach enables the algorithm to explore new search directions, potentially
navigating out of local minima or areas of slow progress. The challenges of nonquadratic
objective functions inherently contribute to stagnation, making adaptive methods and
restarting strategies crucial for mitigating these effects and enhancing optimization
efficiency in the face of nonlinearities.
Stagnation in optimization is characterized by a significant slowdown or complete
halt in progress towards the optimal solution. Within the CG method, it manifests
as minor changes in the objective function or decision variables across iterations.
Despite the method's proven effectiveness, stagnation can occur, especially in the
face of complex optimization challenges. Understanding the underlying causes of stagnation
and implementing effective mitigation strategies, such as parameter adjustments, is
essential for improving the performance of the CG method.
3.3 The Adaptive Restarting Conjugate Gradient methods
Adaptive restarting strategies, which dynamically adjust the criteria for restarting
based on the optimization algorithm's progress, have garnered interest from numerous
researchers. This paper proposes adaptive methods using the CG approach, leveraging
information about the objective function's curvature. The proposed algorithm has demonstrated
an improvement in convergence by mitigating or even eliminating the phenomena of stagnation
through an adaptive CG restarting procedure.
The criterion for restarting is set in an adaptive manner, responding to the dynamic
behavior of the directional vectors. Although the convergence characteristics are
significantly influenced by the initial starting point of the CG method, the adaptive
restarting approach can reduce the computational load and time required for convergence.
The dynamics of CG directional vectors are affected by the gradient, which reflects
the surface curvature of the objective function. Therefore, generally, the deterioration
of convergence in nonquadratic CG problems may be avoided not simply by choosing an
arbitrary initial point but through the implementation of adaptive control strategies.
The restarting process in the CG method involves eliminating the conjugacy of directional
vectors at the point of restart. This means that the requirement for all directional
vectors to be conjugate orthogonal to each other is lifted at this point, allowing
for a new set of CG directional vectors to commence. This constraint is identified
as a potential cause of stagnation in nonconvex CG methods. Stagnation indicates that
convergence is not as effective as expected, partly due to the conjugacy constraint.
To select a new directional vector that is not bound by conjugacy, we employ the Steepest
Gradient (SG) method whenever stagnation is detected.
Consider an initial point ${ x}^{(0)}$ as the centroid of a sphere $||{ x}-{ x}_{c}||$for
every ${}{x}\in{}{R}^{{n}}$, where distances from the position ${ x}_{c}= x^{(0)}$
to ${ x}^{(k)}$ and ${ x}^{(k+1)}$ are measured for $k>1$.
When the distance from the centroid ${ x}_{c}$ to ${ x}^{(k+1)}$, $||{ x}^{(k+1)}-{
x}_{c}||$, is less than the distance $||{ x}^{(k)}-{ x}_{c}||$, the trajectory of
the positions starts spiraling inwards with respect to the centroid of the sphere,
signaling a potential onset of stagnation. Upon detecting the beginning of this spiral,
${ x}^{(k+1)}$ is designated as the new centroid for the subsequent sphere. The task
then shifts to detecting a spiral pattern anew relative to this updated centroid.
Detection of a spiral-in pattern triggers the initiation of a new CG cycle, where
the steepest descent method is applied to remove the property of conjugacy. This approach
effectively adapts to the curvature characteristics of the objective function's surface.
The beginning of the spiral in can be measured by computing the ratio $\rho$ between
two successive reference vectors ${}{x}^{({k})}$ and ${ x}^{(k+1)}$ with respect to
a chosen centroid ${ x}_{c}$, as given by Equation (15) for (k > 1). The ratio of the distances is defined as Equation (14): $\rho =\dfrac{||{ x}^{(k+1)}-{ x}_{c}||}{||{ x}^{(k)}-{ x}_{c}||}$.
Here, || · || denotes a norm. The ratio $\rho$ is utilized to detect the possible
onset of stagnation at the (k+1)th directional vector. When the ratio $\rho$ is less
than the predetermined threshold $\eta\in(0,\: 1]$, we restart the CG process by resetting
the centroid ${ x}_{c}$ to the new position ${ x}^{(k+1)}$ and setting ${ d}^{(k+1)}=
-\nabla f({ x}^{(k+1)})$. This restart breaks the 'conjugacy' connection of the directional
vectors with the new directional vector ${ d}^{(k+1)}$. The properties of the vector
measure not only the step size of a directional vector but also the characteristics
of the trajectory of the directional vectors. However, if the ratio is greater than
the threshold $\eta$, the directional vectors are free from stagnation, and we proceed
without restarting CG.
In the proposed ARCG method, we find that it is rather simple and effective, as it
accelerates the convergence process while avoiding stagnation. For instance, some
restarting methods in previous works could potentially slow down the convergence speed,
although they may improve stability. Additionally, possible round-off errors may cause
misbehavior of the CG method, a concern that does not arise in ARCG. Moreover, ARCG
enhances both the convergence speed and stability properties of non-quadratic optimization
by eliminating stagnation processes during convergence. Below is the algorithm for
ARCG.
==============================================
Algorithm: Adaptive Restarting Conjugate Gradient (ARCG)
==============================================
Input:${ x}^{(0)}$, set $\nabla f({ x}^{(0)})$, ${}{d}^{(0)}=\nabla{f}({}{x}^{(0)})$,
${ x}_{c}= x^{(0)}$, $\eta\in(0,\: 1]$.
for k=0, 1, 2, ...,
do
Compute $\alpha_{k}$ such that
$f({}{x}^{({k})}+\alpha_{{k}}{ d}^{({k})})= f({}{x}^{({k})})+\alpha_{{k}}\nabla{f}({}{x}^{({k})})^{{T}}{
d}^{({k})}$.
Set ${}{x}^{({k}+1)}={}{x}^{({k})}+\alpha_{{k}}{}{d}^{({k})}$ and $\nabla f({ x}^{(k+1)})$.
if( $\rho <\eta$ ) and (k >1),
Compute the ratio
Reset the centroid
Restart the CG algorithm by setting
Choose a conjugate directional parameter.
Set
end for
==============================================
4. Numerical Simulations
The experiments were conducted by randomly selecting initial points to examine convergence
towards the local optimum point and the global optimum point in relation to the objective
function. The choice of initial points was influenced by considerations of convergence
speed and the potential for stagnation. This paper does not explore the selection
of initial points, a critical factor in achieving convergence. Instead, it focuses
on addressing the issue of stagnation that can occur with arbitrary initial points.
In this context, the objective was to demonstrate the effectiveness of the proposed
method, ARCG, in overcoming stagnation by assessing the speed of convergence in situations
prone to stagnation. A nonlinear function is used as the objective function for variables
and ${x}_{1}$, ${x}_{2}$ represented by the equation $f({x}_{1,\: }{x}_{2})= 0.5{x}_{1}^{4}-{x}_{1}^{3}-8{x}_{1}^{2}+
2.5{x}_{1}+ 0.5{x}_{2}^{4}-{x}_{2}^{3}- 8{x}_{2}^{2}+ 2.5{x}_{2}- 10{x}_{1}{x}_{2}$.
The shape of the objective function is depicted in Figure 1.
Fig. 1. An objective function of a nonquadratic nonlinear form: \begin{align*}
$f({x}_{1,\:}{x}_{2})=0.5{x}_{1}^{4}-{x}_{1}^{3}-8{x}_{1}^{2}+2.5{x}_{1}+0.5{x}_{2}^{4}\\-{x}_{2}^{3}-8{x}_{2}^{2}+2.5{x}_{2}-10{x}_{1}{x}_{2}$
Fig 2 depicts the track of directional vectors on the contour map of the objective function,
which is of a nonquadratic nonlinear form, as shown in Fig 1. Observing the behavior of convergence using the CG method for the initial point
(-0.5, -0.9) where f (-0.5, -0.9) = -15, it becomes evident that the convergent directions
deteriorate as the direction vectors approach the optimal point. This phenomenon,
known as convergence stagnation, is characterized by the directional vector sequence
converging very slowly to the optimal point, forming a spiraling shape as illustrated
in Fig 2 through our experiments. All numerical experiments are established with allowed maximum
error 1.0e-9.
Fig. 2. The track of directional vectors of the contour map of the CG Method with
the initial point (-0.5, -0.9) is presented. The minimum value (–277) of the objective
function is found at the optimum point (4.39390103568654, 4.39390103568654) at the
78th directional vector for the allowed error specification 1.0e-9.
For the initial point (-0.5, -0.9), where f (-0.5, -0.9) is 15, the directional vector
sequence of CG method converges to the global minimum point (4.39390103568654, 4.39390103568654)
with an error of 8.881267727026257e-10 at 78 iterations, reaching a minimum value
of –277 with an allowable error less than 1.0e-9. However, the iterations reveal a
deterioration in the convergence process, leading to stagnation, as indicated in Fig 2. As it is well known, this stagnation is a challenge that the conventional nonconvex
CG method needs to overcome. Fig 3 illustrates the convergence of the directional vectors of the ARCG method starting
from the initial point (-0.5, -0.9) with an allowable error which is less than 1.0e-9.
The objective function reaches its minimum value (-277) at the optimal point (4.39390103568654,
4.39390103568654) after only 9 iterations. Table 1 presents the sequence of converging directional vectors. The proposed ARCG method
demonstrates faster convergence with fewer iterations compared to the CG method shown
in Fig 2. Restarting occurs at the second, fourth, and sixth iterations, effectively reducing
the overall number of iterations, as depicted in Figure 3 and Table 1.
Fig. 3. Convergence of the ARCG Method to the minimum point (4.39390103568654, 4.39390103568654)
with the initial point (-0.5, -0.9) takes 10 iterations. Restarting occurs at the
second, fourth, and sixth iterations.
Table 1 Convergence of the ARCG Method with Initial Point (-0.5, -0.9). NI: Number
of Iterations.
NI
|
Optimal Point $({x}_{1},\: {x}_{2})$
|
Restart
0: Off
1: On
|
1
|
(-0.5, -0.9)
|
0
|
2
|
(4.63042015891167, 4.09508799471984)
|
1
|
3
|
(4.37292930860144, 4.07464340741611)
|
0
|
4
|
(4.34834980311952, 4.38421178718921)
|
1
|
5
|
(4.39379845522371, 4.3949020707615)
|
0
|
6
|
(4.39402354706745, 4.39394511570709)
|
1
|
7
|
(4.39390103254279, 4.39390104905889)
|
0
|
8
|
(4.39390103711054, 4.39390103635959)
|
0
|
9
|
(4.39390103568654, 4.39390103568654)
|
0
|
Fig 4 depicts the convergence of the CG method, starting from the initial point (5, -5)
with f (5, −5) = 474. The objective function reaches its minimum value -277 at the
optimal point (4.39390103439962, 4.39390103477718) at the 109th iteration, with an
error 8.194639732233808e-10.
Fig. 4. The error that occurred at the 122th iteration is 8.194639732233808e-10, where
an initial objective function value of 475 is given at the starting point (5, -5).
The global convergent point reached is (4.39390103439962, 4.39390103477718) with a
minimum value of –277, which is denoted as (4.39391, 4.393901) in Fig.4 as approximation
Figure 5 illustrates a scenario in which the number of iterations decreases from 109,
as obtained by the Conjugate Gradient (CG) method, to 10 with the proposed Adaptive
Residual Conjugate Gradient (ARCG) method, starting from the initial point (5, -5)
and achieving an error of . The value of objective function at the initial point is
475, and the global minimum point (4.39390103568654, 4.39390103568654) is reached
at the 10th iteration, achieving a minimum value of -277. The convergence directional
vector sequence obtained from the 10 iterations with three restarts. While the conventional
CG method without restarting requires 109 iterations, the proposed ARCG achieves the
same optimal point with 10 iterations under the same conditions.
Fig. 5. The experiments with ARCG yield an error of 2.141417408469820e-10 for the
initial point (5, -5) with an objective function value of 475. The minimum value reached
at -277 after 11 iterations at the location (4.39390103568654, 4.39390103568654),
which is denoted as (4.393901, 4.393901) in Figure.
For a given initial point (-4, -6) with an objective function value of 375, the convergence
to the optimal minimum point (-2.98907601341104, -2.98907601325043) is achieved after
11 iterations in Fig 6. Both CG and ARCG yield the same results without any restarting process described
in Table 2. In this case, no restart is needed to reach the minimum, indicating that the initial
point converges to a local minimum instead of the global minimum. This suggests that
finding global minima depends on choosing a suitable initial point, a topic worthy
of further discussion in future studies.
Fig. 6. For a given initial point (-4, -6) where f(x) is 375, the local optimal minimum
point (-2.98907601341104, -2.98907601325043) with a value of -114 is reached after
11 iterations. Both the nonconvex CG and ARCG show no restarting for the initial point
as shown in Table 2.
Fig 7 illustrates that the minimum value of –277 at (4.39390104015781, 4.39390104218252)
is reached with an error of 9.901277842301936e-10 after the 1687th iteration for the
given initial point (-4, 6) where f (−4, 6) = 453. As observed in the track of directional
vectors of the nonconvex CG in Fig 7, the phenomenon of stagnation incurs significant computational overload, posing a
serious problem in optimization. Experimental results indicate that the convergence
process is stable but not sufficiently fast, taking time to spiral down to the minimum
point slowly with 1687 iterations with the allowed error 1.0e-09.
Fig. 7. The obtained error is 9.901277842301936e-10 for the initial point (-4,6) with
an objective function value of 453. After 1687 iterations, the nonconvex CG method
converges to (4.39390104015781, 4.39390104218252) with the minimum value of –277.
In Fig 8, we depict the track of directional vectors, which is obtained using the ARCG method,
for the given initial point (-4, 6) where f (−4, 6) = 453. The sequence of directional
vectors of this case converges to the optimal point (4.39390103568864, 4.39390103572041)
with the minimum value of –277 with the 29 directional vectors. The track of directional
vectors demonstrates that the proposed ARCG method effectively eliminates the stagnation
of the nonconvex CG process. The convergence process of ARCG is stable and fast enough
to find the optimal point without convergence deterioration compared to the nonconvex
CG method. The restarting occurs only once at the 20th directional vector in ARCG,
reducing the number of iterations from 1687 to 29. The ARCG method employs a restarting
only at the 20th iteration.
Fig. 8. The ARCG method reaches the minima –277 at (4.39390103568864, 4.39390103572041)
after 28 iterations with the initial point (-4, 6), where the function value f (-4,
6) is 453.
In the previously conducted experiments, we established the criterion for detecting
stagnation as the moment when the variable ρ, as defined in equation (15), drops below a threshold of 1.0. We then investigated the effects that arise when
the established threshold is either below or above 1.0. For the nonquadratic nonlinear
objective function, $f({x}_{1,\:}{x}_{2})=0.5{x}_{1}^{4}-{x}_{1}^{3}-8{x}_{1}^{2}+2.5{x}_{1}+0.5{x}_{2}^{4}-{x}_{2}^{3}-8{x}_{2}^{2}+2.5{x}_{2}-10{x}_{1}{x}_{2}$,
we set the threshold η to be 1.0 for the initial points, such as (-0.5, -0.9), (5,
-5), (-4, -6), (-4, 6), (4, -6), and (0, 0) cases. The number of directional vectors
used for the CG, ARCG, and SD methods with the initial points is presented in Table 2 where the ratio in Equation (15) is set to be less than or equal to 1.0. As shown in Table 2, the objective function has two minimum values, -277 and -3, at the points (4.3939010,
4.3939010) and (-2.9890760, -2.9890760) respectively, which is approximated with seven
digits after decimal point. The initial point (-4, -6) converges to the local minimum
(-2.9890760, -2.9890760), while the other initial points such as (-0.5, -0.9), (5,
-5), (-4, 6), (4, -6), and (0, 0) cases converge to the global minimum point (4.3939010,
4.3939010). Essentially, the initial points tend to converge to an optimal point located
near them respectively.
However, the convergence to a local minimum point (-2.989076, -2.9890760) or to the
global minimum point (4.3939010, 4.3939010) depends on the choice of the initial point
in general, which is known as the initial value problem in optimization. This implies
that the convergence to the local or global minimum entirely depends on the surface
dynamics of the objective functions in optimization, necessitating further studies.
Table 2 The number of direction vectors for CG and ARCG for the initial points when
the threshold is set to η=1.0. The initial point (-4, -6) converges to the local minimum
(-2.298907, -2.9890760), while the other five initial points converge to the global
minimum point (4.3939010, 4.3939010).
Objective Function
|
\begin{align*}
f({x}_{1,\:}{x}_{2})=0.5{x}_{1}^{4}-{x}_{1}^{3}-8{x}_{1}^{2}+2.5{x}_{1}+0.5{x}_{2}^{4}\\
-{x}_{2}^{3}-8{x}_{2}^{2}+2.5{x}_{2}-10{x}_{1}{x}_{2}
\end{align*}$\rho\le\eta ,\: \eta =1.0$
|
Methods
|
CG
|
ARCG
|
SD
|
Case 1
|
Iterations
|
78
|
10
|
8
|
Converged ${x}_{1}^{*}$
|
4.39390103449538
|
4.39390103568654
|
4.39390103568655
|
Converged ${x}_{2}^{*}$
|
4.3939010352255
|
4.39390103568654
|
4.39390103568652
|
Error
|
8.881267727026257e-10
|
8.881784197001252e-16
|
7.017681150351688e-12
|
Initial Point
|
${x}_{1}=-0.5$, ${x}_{2}=-0.9$
$f({x}_{1},\: {x}_{2})= -15$
|
Converged Point value
|
$f({x}_{1}^{*},\: {x}_{2}^{*})= -277$
|
Case 2
|
Iterations
|
122
|
11
|
12
|
Converged
${x}_{1}^{*}$
|
4.39390103439962
|
4.39390103568654
|
4.39390103567406
|
Converged
${x}_{2}^{*}$
|
4.39390103477718
|
4.39390103568654
|
4.39390103567906
|
Error
|
8.194639732233808e-10
|
2.141417408469820e-10
|
1.586372380921446e-10
|
Initial
Point
|
${x}_{1}= 5$, ${x}_{2}=-5$
$f({x}_{1},\: {x}_{2})= 475$
|
Converged Point value
|
$f({x}_{1}^{*},\: {x}_{2}^{*})= -277$
|
|
Case 3
|
Iterations
|
11
|
11
|
15
|
Converged
${x}_{1}^{*}$
|
-2.98907601312605
|
-2.98907601312605
|
-2.98907601315108
|
Converged
${x}_{2}^{*}$
|
-2.98907601314108
|
-2.98907601314108
|
-2.989076013204
|
Error
|
4.109032155931346e-10
|
3.052403315124787e-10
|
4.109032155931346e-10
|
Initial
Point
|
${x}_{1}= -4$, ${x}_{2}=-6$
$f({x}_{1},\: {x}_{2})= 375$
|
Converged Point value
|
$f({x}_{1}^{*},\: {x}_{2}^{*})= -114$
|
Case 4
|
Iterations
|
1495
|
29
|
13
|
Converged
${x}_{1}^{*}$
|
4.39390104015781
|
4.39390103569077
|
4.3939010356884
|
Converged
${x}_{2}^{*}$
|
4.39390104218252
|
4.39390103568738
|
4.39390103568305
|
Error
|
9.901277842301936e-10
|
3.309872148302396e-11
|
7.428821742253838e-11
|
Initial
Point
|
${x}_{1}= -4$, ${x}_{2}= 6$
$f({x}_{1},\: {x}_{2})= 475$
|
Converged Point value
|
$f({x}_{1}^{*},\: {x}_{2}^{*})=-277$
|
Case 5
|
Iterations
|
15
|
10
|
14
|
Converged
${x}_{1}^{*}$
|
4.39390103563522
|
4.39390103568654
|
4.39390103565784
|
Converged
${x}_{2}^{*}$
|
4.39390103563426
|
4.39390103568654
|
4.39390103567814
|
Error
|
2.163227819026375e-10
|
8.881784197001252e-16
|
2.163227819026375e-10
|
Initial
Point
|
${x}_{1}= 4$, ${x}_{2}=-6$
$f({x}_{1},\: {x}_{2})= 747$
|
|
$f({x}_{1}^{*},\: {x}_{2}^{*})= -277$
|
Case 6
|
Iterations
|
1
|
1
|
3
|
Converged
${x}_{1}^{*}$
|
4.39390103568654
|
4.39390103568654
|
4.39390103568654
|
Converged
${x}_{2}^{*}$
|
4.39390103568654
|
4.39390103568654
|
4.39390103568654
|
Error
|
0.0
|
0.0
|
0.0
|
Initial
Point
|
${x}_{1}= 0$, ${x}_{2}=0$
$f({x}_{1},\: {x}_{2})= 0$
|
Converged Point value
|
$f({x}_{1}^{*},\: {x}_{2}^{*})= -277$
|
However, the convergence to a local minimum point (-2.989076, -2.9890760) or to the
global minimum point (4.3939010, 4.3939010) depends on the choice of the initial point
in general, which is known as the initial value problem in optimization. This implies
that the convergence to the local or global minimum entirely depends on the surface
dynamics of the objective functions in optimization, necessitating further studies.
In experiments, the case of the initial point (-4, 6) using CG exhibits much worse
convergence stagnation compared to the other five initial positions, as shown in Table 2. It requires 1495 iterations to satisfy the error requirement, where the error needs
to be less than $1.0\times 10^{-9}$. As depicted in Table 2, the initial points located near the minimum point in a nonconvex region with respect
to an optimal point tend to converge fast to the optimal points, as observed with
the initial points (0, 0) and (4, -6).
However, for the initial points (-0.5, -0.9), (5, 5), and (-4, 6), the CG method generates
stagnation, resulting in slow convergence. This phenomenon appears to reflect the
shape of the surface of the objective function, indicating that the surface structure
is somewhat far from being nonconvex with respect to the optimal points. The initial
point (-4, -6) converges to a local minimum, while the other five initial points converge
to the global optimal point.
In this work, we proposed a new restarting scheme, the Adaptive Restarting Conjugate
Gradient (ARCG) method, to effectively remove stagnation in CG, significantly reducing
the convergence time when the initial value in CG deteriorates due to convergence
stagnation. However, for a quadratic nonlinear objective function , both the CG and
ARCG methods generally outperform the SD method in terms of convergence speed. For
example, as demonstrated in Table 3, convergence with the CG method is significantly faster than with the SD method.
Furthermore, Table 3 shows that the ARCG method achieves effective convergence at a rate comparable to
that of the CG method.
Table 3 The number of iterations of a convex objective function for CG, SD and ARCG
with the threshold and the permitted convergence error $\epsilon_{s}=1.0\times 10^{-9}$,
which converges to (2, -1) with f (2, -1) = -3.
Objective Function
|
$f({x}_{1,\:}{x}_{2})={x}_{1}^{2}+{x}_{2}^{2}+{x}_{1}{x}_{2}-3{x}_{1}$ ,
$\rho\le\eta ,\: \eta =1.0$, $\epsilon_{s}=1.0\times 10^{-9}$
|
Initial Point
|
CG
|
SD
|
ARCG
|
Converging Point
|
(0,0)
|
2
|
30
|
2
|
(2, -1)
|
(4,0)
|
2
|
30
|
2
|
(2, -1)
|
(4,4)
|
2
|
13
|
2
|
(2, -1)
|
(-4, -4)
|
2
|
11
|
2
|
(2, -1)
|
The threshold $\eta$ serves as a reference for comparing the ratios between two successive
reference vectors, as described in Equations (14) and (15). The CG method restart is initiated only if the ratio falls below this threshold.
The experiment demonstrates that the proposed method, the ARCG method, outperforms
the conventional CG method in the convergence process. By eliminating or weakening
the dependence on previously found conjugate directional vectors when applying the
steepest gradient method as a restart, the proposed method addresses the issue of
deterioration. This improvement in the convergence process mitigates the tendency
to meander around the optimal point caused by the effect of conjugates, which is a
serious drawback of CG.
5. Conclusion
The issue of stagnation in the CG methods, especially when tackling nonquadratic objective
functions, is widely recognized and poses a significant challenge to the convergence
efficiency of these methods. This problem, fundamentally linked to the methods’s reliance
on conjugation between directional vectors, can severely impede progress towards an
optimal solution. The ARCG method introduces a novel solution to enhance convergence
by systematically addressing the root causes of stagnation observed in traditional
CG iterations.
One of the hallmark features of the ARCG method is its innovative use of a restarting
mechanism which plays a pivotical role in steering the convergence process more swiftly
towards the optimal minimum. Unlike backtracking methods, which primarly focus on
step size adjustments, the ARCG employs advanced vector properties to sensitively
detect and adapt to convergence variations. This is achived through a dynamic restarting
algorithm that adjusts to the changing curvatures encountered on the surface of nonlinear
objective functions, showcasing a nuanced understanding of the problem space.
A particularly distinctive aspect of the ARCG method is its strategic focus on the
cumulative trajectory of conjugate directional vectors. This approach, diverging from
traditional methods that may only consider the magnitude of directional vectors, offers
a more holistic and potentially more effective strategy for overcoming stagnation.
Such a comprehensive view on navigating through the optimization process underscores
the method’s innovative approach to stagnation detection and avoidance, contributing
to its superiority.
The importance of selecting appropriate initial values is also highlighted, suggesting
a promising area for future research in nonlinear optimization. The ARCG method stands
out not only for its technical innovations but also for its practical benefits, including
the potential for significant time and resourxe savings. This advantage positions
the ARCG method as a valuable too for a wide range of applications across various
fields.