Computational aspects of column generation for nonlinear and conic optimization: classical and linearized schemes

Solving large scale nonlinear optimization problems requires either significant computing resources or the development of specialized algorithms. For Linear Programming (LP) problems, decomposition methods can take advantage of problem structure, gradually constructing the full problem by generating variables or constraints. We first present a direct adaptation of the Column Generation (CG) methodology for nonlinear optimization problems, such that when optimizing over a structured set X\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {X}}$$\end{document} plus a moderate number of complicating constraints, we solve a succession of (1) restricted master problems on a smaller set S⊂X\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {S}}\subset {\mathcal {X}}$$\end{document} and (2) pricing problems that are Lagrangean relaxations wrt the complicating constraints. The former provides feasible solutions and feeds dual information to the latter. In turn, the pricing problem identifies a variable of interest that is then taken into account into an updated subset S′⊂X\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {S}}'\subset {\mathcal {X}}$$\end{document}. Our approach is valid whenever the master problem has zero Lagrangean duality gap wrt to the complicating constraints, and not only when S\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {S}}$$\end{document} is the convex hull of the generated variables as in CG for LPs, but also with a variety of subsets such as the conic hull, the linear span, and a special variable aggregation set. We discuss how the structure of S\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {S}}$$\end{document} and its update mechanism influence the number of iterations required to reach near-optimality and the difficulty of solving the restricted master problems, and present linearized schemes that alleviate the computational burden of solving the pricing problem. We test our methods on synthetic portfolio optimization instances with up to 5 million variables including nonlinear objective functions and second order cone constraints. We show that some CGs with linearized pricing are 2–3 times faster than solving the complete problem directly and are able to provide solutions within 1% of optimality in 6 h for the larger instances, whereas solving the complete problem runs out of memory.


Introduction
Decomposition methods are fundamental tools to solve difficult large scale problems. In this work, we focus on Column Generation (CG) algorithms, where the number of variables is too large to allow a direct solution with an off-the-shelf optimization software. More formally, we focus on solving the following problem: where C is a cone in some Euclidean space, X is a high-dimensional structured set and f and g are generic mappings 1 . The feasibility set of the auxiliary variables y is fully defined by constraints −g(x, y) ∈ C.
Defining n, n 0 and m as the respective dimensions of x, y and C , we assume that the main issue with P(X) is the joint presence of the difficult (or coupling/ side) constraints −g(x, y) ∈ C and the magnitude of n ≫ 1 . In other words, we consider a setting in which: (1) for some low-dimensional subset S ⊆ X , P(S) can be solved efficiently, and (2) P(X) without the constraints −g(x, y) ∈ C gives birth to an efficiently solvable problem. We propose to use these simpler optimization problems to build a computationally efficient solution method for P(X) . For the sake of simplicity, we do not consider equalities in the complicating constraints (the generalization is straightforward).

Examples
We now show some examples of problems that have the structure we just described. For given integers p, q > 0 , we denote [p] ∶= {1, ..., p} , || ⋅ || q is the q-norm in ℝ p , the Lorentz cone of dimension p + 1 is the set L  [L] and C ∶= {(u l , u l 0 ) l∈ [L] ∶ ||u l || 2 ⩽ u l 0 , ∀l ∈ [L]} with L reasonably small:

3
Computational aspects of column generation for nonlinear… Notice that partially block-angular LPs are a particular case of Example 1 (i.e. when A k , a k , X l , Y l and b l are all zero).

Example 2 Partially separable Semidefinite Programming (SDP) where
x k j A kj ⪰ B k , ∀k ∈ [K]} and C is a cone of SDP matrices of reasonable dimension: Several existing works already use CG to tackle this kind of problem with e.g. polyhedral approximations of the cone of SDP matrices as their S sets [2]. CG could also be used to generate the block factor-width-2 cone of matrices that is used in [68] to inner approximate huge SDP cones. Example 1 and some inner approximations of high-dimensional SDP cones are a particular case of Example 2. Finally, a vast majority of problems that are tractable computationally when their data is known become way harder when considering their stochastic counterpart, i.e. when their data-such as their objective function or constraints-are uncertain and considered random [5]. They can become highly nonlinear or too big when-to name a few-using risk measures or chance constraints. As we see later in our experimental setup, the risk-averse portfolio optimization problem that we consider allows our CG scheme to take advantage of the original (deterministic) features of the problem. g i x 1 , ..., x K , y ⩽ 0, ∀i ∈ [m].

Preliminaries
We now introduce several definitions that are used through this paper. We call the dual cone of the cone C , the set C * defined as C * ∶= {u ∶ ⟨u, v⟩ ⩾ 0, ∀v ∈ C} . For any penalization vector ∈ C * , let us define the following Lagrangean relaxation: For any ∈ C * we have (X, ) ⩽ (X) . The Lagrangean dual of P(X) is: We say that P(X) has no Lagrangean duality gap if P(X) and D(X) share the same optimal value (X) . Notice that the concept of Lagrangean dual is associated to the constraints that are relaxed, which in this work are the constraints −g(x, y) ∈ C . On another hand, given a set S ⊆ X , we define the restricted problem as follows: Given that the original problem P(X) is a relaxation of P(S) for any S ⊆ X , we have (X) ⩽ (S) . If S contains the projection onto the x-components of an optimal solution of P(X) , notice that we have (X) = (S) and P(S) returns optimal solutions for P(X).

Main concept
In this paper, we extend the concept of CG beyond the scope of LPs and simplicial approximations of X , while keeping a similar philosophy: in the LP case, at each iteration k we solve an approximation P(S k ) of P(X) that uses S k ∶= conv(x l ) l∈ [k] , the convex hull of a family of columns x l , each belonging to X . Doing so, we obtain an upper bound (S k ) on (X) and retrieve the corresponding optimal dual multipliers k . These multipliers are fed to the pricing problem L(X, k ) that returns an optimal solution (x k ,ȳ k ) and provides a lower bound (X, k ) for (X).
As pictured in Fig. 1, we iteratively refine both problems until the optimality gap (X, k ) − (S k ) is under some tolerance. Our approach generalizes CG in several ways: under reasonable conditions (1) the approximating set S k does not have to be a convex hull of previous columns, and (2) P(X) is not necessarily an LP but P(S k ) must have zero Lagrangean duality gap wrt the complicating constraints (L(X, )) (X, ) ∶= min x∈X,y {f (x, y) + ⟨ , g(x, y)⟩}.
Computational aspects of column generation for nonlinear… −g(x, y) ∈ C . Further, (1) depending on the structure of the approximations S k , the master problem can be greatly simplified and (2) under some convexity assumptions, it is possible to replace L(X, k ) by a computationally easier pricing. We now present the assumptions used in this article.

Working hypothesis
We make two kinds of assumptions: the first ensures the validity of our framework and the remaining are necessary to make it computationally efficient: Assumption 1 S is such that P(S) can be solved by a Lagrangean algorithm that provides a multiplier ∈ C * such that (S) = (S, ).

Assumption 2
For any ∈ C * we can solve efficiently L(X, ) in practice.

Assumption 3
The choice of S makes P(S) efficiently solvable in practice.
Assumption 1 implies that P(S) has no Lagrangean duality gap and we have an algorithm to find an optimal primal-dual pair ((x, y), ) for P(S) ; notice that (x, y) is also optimal for L(S, ) . Assumption 1 is satisfied by many optimization problems such as e.g. LPs or Linear Conic problems (LC) and Nonlinear optimization Problems (NLP) that satisfy a Constraint Qualification [43], which can be solved using an interior point method [41]. Slater's condition is a popular constraint qualification that is satisfied if the problem at hand is convex and strictly feasible 2 : Although Assumption 1 may sound overly restrictive, we show in Proposition 1 that for a regular enough S , P(S) satisfies Slater's condition. Notice that we only need P(S) to satisfy Assumption 1, while P(X) may not, as we show in Example 4: Example 4 Consider the following sets and the following SDP P(X): Because P(S) is an LP, it satisfies Assumption 1, while P(X) does not by having a nonzero duality gap (see Appendix A for the details). This example shows that our framework could possibly solve some problems having a nonzero duality gap. Assumptions 2 and 3 are not needed from a theoretical point a view, however, they are essential for our methodology to be competitive computationally. Assumption 2 is the basic assumption for classical CG for LPs and means that the pricing problem L(X, ) is either (1) block-decomposable thanks to the structure of X and can be solved in parallel, or (2) there is an efficient dedicated algorithm to solve it. Finally, Assumption 3 says that S is e.g. low dimensional and defined with a few constraints 3 . Both Assumptions 2 and 3 depend on the ability of the modeller to identify a set of constraints that are relaxed and choose a set S such that both P(S) and L(X, ) are easy enough to solve in practice.
Our objective is to design an iterative search in terms of both and S that successively improves the lower and upper bounds (X, ) and (S) , returning increasingly good feasible solutions as a byproduct. Our framework achieves this goal by feeding information from one problem to the other by updating respectively from P(S) and S from L(X, ) , while choosing computationally efficient approximations S and pricing problems.

Dantzig-Wolfe for LPs
To illustrate our point, consider the following LP as a special (well known) case of P(X) -i.e. when X is a polyhedron, f and g are linear mappings and C ∶= ℝ m + : Because of the high dimensionality of the polyhedron X and the presence of (1c) breaking any eventual structure, even state-of-the art solvers cannot tackle directly this kind of problem. Suppose in our case that optimizing a linear objective over the (high-dimensional) polyhedron X is easy in practice and, as pointed out before, if we were to replace X by some wisely chosen subset S ⊆ X in (1), we would obtain a computationally cheap upper bound for the optimal value of (1).

3
Computational aspects of column generation for nonlinear… Master problem P(S) In this LP case, there is a natural choice for S readily available [7]: Letting V be the set of vertices of X and R a complete set of extreme rays of X , we have X = conv V + cone R , where cone R is the conic hull of R: Problem (1) can thus be rewritten as the following extensive formulation: Because X ∶= {x ⩾ 0 ∶ Ax = a} is convex and each x l belongs to X , notice that the side constraints Xx + Yy ⩾ b are the only remnants of the original problem. The LP dual of the extended formulation is the following problem: A direct solution of problem (2) is in general impractical as its number of variables can be exponential in (n, n 0 , m) . However, the Dantzig-Wolfe (DW) algorithm [16] offers a solution method successively generating vertices and extreme rays of the polyhedron X . It starts with finite subsets V ⊂ V and R ⊂ R and solves a restricted master problem (2) with V and R instead of the full sets V and R . With our notation, this restricted master problem is none other than P(S) with S ∶= conv V + cone R . Making different choices for S and consider a broader class of optimization problems is one of the central ideas of this paper.
Pricing problem L(X, ) Obtaining the optimal dual variables associated with constraints (2b) in P(S) , the Lagrangean relaxation L(X, ) is solved: By dual feasibility (3d) of , implying that Y ⊤ = d , it can be rewritten

3
thus eliminating the variables y from the pricing problem. Discarding this dependency in y is, however, not always possible in a nonlinear setting. Letting x be an optimal solution of the pricing problem (4), with a slight abuse of notation we refer to an optimal solution to either (1) a vertex of X if the pricing problem in x has a bounded optimal objective value, or (2) an extreme ray of X otherwise. In the latter case we increment R ←R ∪ {x} and in the former V ←V ∪ {x} , which defines the particular update mechanism used by DW. We iterate until an optimality tolerance criterion is satisfied or until we generated the complete sets V and R , thus solving the full, original problem P(X).

Decomposition methods and previous work
CG algorithms were studied in depth for LPs [16,36] or Mixed Integer Linear Programming (MILP) problems [4,55], where notoriously large MILPs could be solved by embedding DW in a branch-and-price framework [14,18]. The main idea of DW is to exploit the structure of X and solve smaller problems: (1) the master problem, that works over a reduced subset S ⊂ X while keeping the side constraints; and (2) a pricing problem that is still large but is computationally easy to solve thanks to the absence of side constraints.
In a nonlinear setting, several algorithms such as the Alternating Direction Method of Multipliers (ADMM) [64], the Douglas-Rachford splitting operator [21] or augmented Lagrangean algorithms [53] all make use of a special structure in X . However, they all solve inexactly the Lagrangean dual D(X) and do not always provide feasible solutions for P(X) . Further, proving optimality or near optimality can be tricky and a concrete stopping criterion is also not always available. Closer to a generalization of CG for NLPs, the convex simplex method [65] minimizes a nonlinear objective over a polyhedron. It can be seen as solving a master problem over a basis of variables and-similar to the simplex algorithm-selecting the entering variable by linearizing some penalization function. Akin to DW, the simplicial decomposition [61] solves a linearized master problem over a subset S that is the convex hull of a handful of points, and the pricing problem generating such columns is the original problem P(X) with an objective linearized at an incumbent point. [39] has the same master problem but the pricing problem uses a penalty function instead of a lagrangean relaxation and does not consider generic conic constraints.
Problem-dependent CG schemes for NLPs were presented in e.g. [42] for nonlinear facility location problems that are reformulated as set partitioning problems and solved with DW, whose pricing problem is an NLP with integer variables; [15] that uses a branch-and-price scheme for sibling groups reconstruction, which is reformulated as a set covering problem for which columns are generated with quadratic optimization pricing problems. A direct extension of DW for NLPs with C = ℝ m + is introduced in [9]. DW is used in [40,44] for mixed integer nonlinear-nonconvex optimization problems with polyhedral complicating constraints.

3
Computational aspects of column generation for nonlinear… In an LC setting-i.e. f and g are linear mappings but C is a more general cone than ℝ m + -similar extensions of CG have been developed: [2] present a decomposition procedure for SDPs where S is an inner approximation of a matrix set X , that are updated with the (matricial) "columns" generated by a separation problem tailored for SDPs. The approach has two drawbacks: (1) depending on the inner approximation chosen, the master problem can be slow to attain near optimality, and (2) the pricing problem is a handmade separation problem that uses problem-specific considerations and does not provide dual bounds. For SDPs, chordal sparsity patterns [59] are able to detect underlying substructures that can be exploited by ADMM [58,66,67], but no CG approach has been attempted so far. The presence of a special substructure being crucial for decomposition, automatic structure detection in LPs were developed in [6,31,63] so that a decomposition method can make use of it. Previous CG methods for LC have focused on gradually building the set of variables considered with problem specific algorithms that are difficult to generalize.
Other works use a different kind of set S for LPs. A subset S consisting on forcing clusters of variables to share the same value-thus aggregating the variables together-is used in [38] and [8]. This variable aggregation principle has been successfully applied to Freight routing [54], general extended formulations [55], open pit mining scheduling [8], pricing problems [3], quadratic binary knapsack problems [50], support vector machine problems [45] or in a column-and-row generation context where DW is used in combination with a constraint aggregation scheme to solve resource constrained covering problems [51]. Finally, [23,24] introduce a CG scheme for almost generic sets S and pricing problems. However, they do not take into account generic conic side constraints and link their generic scheme with only a few special cases, whereas we take advantage of the structure of both the pricing and master problems.

Article outline
Section 2 presents a CG algorithm to solve the generic NLP P(X) with a large number of variables and nonlinear-conic side constraints. We show that it admits several existing schemes as special cases, all defined by different sets S . We present sufficient conditions to (1) drop the optimization in the y variables for the pricing problem and (2) make sure that P(S) has no Lagrangean duality gap. As the Lagrangean relaxation of a nonlinear optimization problem can be as hard as the problem itself, under some convexity assumptions we present in Sect. 3 a linearized version of the methodology making the pricing problem easier to solve. Additionally, we also prove that L(X, ) can always be independent of the secondary variables y in the linearized algorithm. In Sect. 4, we point out the relationships of our generic schemes to existing frameworks. In Sect. 5 we describe the risk-averse portfolio optimization problem on which we test our algorithms, and present several computational enhancements. In Sect. 6, we present numerical results on large scale synthetic instances and empirically prove the usefulness of our methodology. We conclude with some remarks and the description of several ongoing works in Sect. 7.

Background notations
Given a set U , we call respectively conv U , cone U , lin U , aff U , relint U and dim U , the convex hull, the conic hull, the linear span, the affine span, the relative interior and the dimension of U . The adjoint U * of a linear mapping U ∶ U → V is the linear operator such that ⟨Uu, v⟩ = ⟨u, U * v⟩ for any (u, v) ∈ U × V . For any integer If is real-valued and differentiable, we call its linear approximation at some ū ∈ U the function: is the Jacobian of at ū . Given a linear mapping and a mapping , we say that Unless otherwise specified, through this document ((x k , y k ), k ) is an optimal primal-dual pair for P(S k ) and (x k ,ȳ k ) is an optimal solution for L(X, k ).

A generic column generation algorithm
Instead of using specific forms of feeding the pricing information to the restricted problem, we use a generic mechanism to update S at each iteration as described in Algorithm 1. Fig. 2 summarizes the relationships between the bounds of the problems involved in Algorithm 1. In all the "bound relationship" figures of this paper, an edge a → b means that a ⩽ b , the gray edges are the nontrivial relationships that apply at a stopping criterion and the gray nodes are the optimal values of the problems solved by the algorithm.

Theorem 1 At termination, Algorithm 1 returns an optimal solution for P(X).
Proof If Algorithm 1 terminates at line 8, we have x k ∈ S k . (x k ,ȳ k ) is then feasible and optimal for the Lagrangean relaxation of P(S k ) with k : In consequence, we have that (X, k ) = (S k , k ) . Recall that (x k , y k ) is optimal for P(S k ) and k is an optimal dual vector associated to −g(x, y) ∈ C in P(S k ) . From Assumption 1, (x k , y k ) is then also optimal for L(S k , k ) , and (S k ) = (S k , k ) , thus proving the gray edges in Fig. 2. To summarize, we have: making (x k , y k ) optimal for P(X) . Now, if Algorithm 1 terminates at line 5, we can choose (x k ,ȳ k ) = (x k−1 ,ȳ k−1 ) and have x k ∈ S k , which is the first stopping condition at line 8. ◻

General remarks
Algorithm 1 can be useful only if Assumptions 2 and 3 are satisfied, i.e. either dim S k is significantly smaller than n, or P(S k ) possesses a special structure or is sparser than in P(X) . In Sects. 4 and 5, we show that the master problem can be considerably shrunk depending on the set S k in use. Even though Assumption 1 must be satisfied in order to get the dual variables and make Theorem 1 hold, in presence of a nonzero duality gap Algorithm 1 can still be used as a heuristic that provides optimality bounds. Notice again that we do not need P(X) to have zero duality gap, but we do see in the next Subsection that it does help to make Assumption 1 hold. We did not prove that Algorithm 1 always finishes but we did prove that if it were to stop, it would return an optimal solution. Its finite termination depends on the way the sets S k are generated in combination with the structure of the original problem. More precisely, the sequence S k should shift towards at least one optimal solution of P(X) by e.g. strictly growing in dimension until reaching X in the worst case (which can be achieved in some special cases that we describe later). Ideally, the sets S k should contain increasingly good solutions for P(X) such that we can stop prematurely the algorithm and still obtain an approximately optimal solution. For example, by forcing S k+1 to contain S k or x k , we ensure that the primal bound (S k ) is nonincreasing. Because we maintain lower and upper bounds over the optimal value of P(X) at each iteration, early termination can be reasonably used and Algorithm 1 can provide a solution (x k , y k ) feasible for P(X) with an optimality gap (S k ) − (X, k ) . Because the algorithm is not guaranteed to converge, we set a maximum number of iterations and execution time that we can spend before forcing the algorithm to stop.

How can the restricted problem P(S) maintain a zero duality gap?
As we mentioned earlier, it is not clear when Assumption 1 can hold. We now show that whenever (1) the approximated sets S k are described by convex constraints and are strictly feasible, and (2) P(S k ) admits as a feasible solution some known, feasible solution for P(X) satisfying strictly the side constraints −g(x, y) ∈ C , then Slater's condition holds for P(S k ) and thus satisfies Assumption 1: such that x ∈ X and −g(x,ȳ) ∈ relint C , and S is described as for some proper cone K , a linear mapping and a K-convex mapping . If S ⊆ X is strictly feasible and contains x , then P(S) has no Lagrangean duality gap.
In this work, we consider approximated sets S k that are nonempty polyhedra 4 that always contain the projection onto the variables x of a known, strictly feasible solution for P(X) . In fact, it is enough to find some feasible solution (x 0 ,ȳ 0 ) for P(X) such that −g(x 0 ,ȳ 0 ) ∈ relint C , and set S 1 ∋x 0 : In other words, we consider x 0 as the first column generated.

y-independent pricing problems L(X, k )
Notice that L(X, k ) optimizes in both x and y and is still an NLP that can be as difficult to solve as P(X) . We partially address the former issue in the next Proposition

3
Computational aspects of column generation for nonlinear… and both in the next Section. We now introduce an adaptation of the P-property 5 [25]: Given a function ∶ U × V → ℝ , consider the following NLP: can be solved independently of u ∈ U . Even if this P-property appears to be overly restrictive, it holds if We are now ready to state a y-independency result for the Lagrangean relaxation L(X, k ): Proposition 2 If L(X, k ) satisfies the P-property wrt y, then y = y k is always an optimal choice in L(X, k ) , which becomes an optimization problem in x only: Proposition 2 allows to drop the optimization in y in the pricing problem if the P-property holds. We now show that using a linearized version of L(X, k ) , we can always drop the optimization in y in the pricing, regardless of the P-property.

A linearized column generation algorithm
In practice, it is common to face e.g. LPs with a nice structure that admit tailored algorithms to solve them, getting hardened by replacing their linear objective with a nonlinear objective function f and/or adding possibly nonlinear side constraints −g(x, y) ∈ C to their polyhedral feasible set X . This can happen for e.g. robust optimization problems [5] that are no easier than their deterministic counterparts.
In this Section, we show that solving a pricing problem whose objective function is linearized at the current incumbent (x k , y k ) holds the same guarantees as Algorithm 1, while alleviating the difficulty of solving the pricing problem whenever optimizing a linear objective over X is an easy task. As we show in our experiments, this can be extremely useful whenever a linear objective pricing problem can be solved with a dedicated algorithm.

Additional assumptions and results
We now present several results that allow the use of a linearized version of L(X, k ) as a pricing problem. In this Section, the following extra assumption is met: Further, we assume that for any cost vector c, min x∈X ⟨c, x⟩ can be solved efficiently (which is in fact equivalent to Assumption 2 if the objective function of L(X, k ) is linear). We now present several technical Lemmas to prove our main result: Proof Appendix F. ◻ Lemma 3 tells us that an optimal solution of a convex optimization problem is also optimal for the same problem with an objective function linearized at said solution. For any S ⊆ X , ∈ C * and (x,ȳ) , let us define the following problem: which is L(S, ) with its objective function linearized at (x,ȳ).

A linearized algorithm
Now consider the following algorithm: (1) instead of solving L(X, k ) , the pricing we solve is its linear approximation L [x k , y k ](X, k ) at the current incumbent (x k , y k ) of the restricted problem P(S k ) and (2) the stopping criterion at line 5 of Algorithm 1 is replaced by a slightly more restrictive condition. We describe the changes applied to Algorithm 1 in Algorithm 2 and illustrate in Fig. 3 the relationships between the bounds of the problems involved in it.

3
Computational aspects of column generation for nonlinear… Theorem 2 Algorithm 2 returns an optimal solution for P(X) at termination.
Proof If Algorithm 2 terminates because x k ∈ S k , then (x k ,ȳ k ) is feasible and optimal for L [x k , y k ](S k , k ) and we have: Because f is convex and g is C-convex, Lemmas 1 and 2 imply that f [x k , y k ] and ⟨ k ,ḡ[x k , y k ](⋅)⟩ are global under estimators of f and ⟨ k , g(⋅)⟩ respectively. In consequence we have that On another hand, (x k , y k ) is also an optimal solution for L(S k , k ) from Assumption 1. Finally, we can interpret L[x k , y k ](S k , k ) as the linearization of L(S k , k ) at one of its optimal solutions (x k , y k ) . Recalling that: Lemma 3 then tells us that (x k , y k ) is also an optimal solution for L [x k , . To summarize, we obtain thus proving the optimality of (x k , y k ) for P(X) . If Algorithm 2 stops from its criterion at line 5, we can choose (x k ,ȳ k ) = ( x k , y k X, k ⩽ X, k ⩽ (X). Fig. 3 Relationships of the optimal values involved in Algorithm 2 As in the nonlinearized case, there is no guarantee that Algorithm 2 terminates but if it does, an optimal solution for P(X) is returned.

y-independent pricing problems L [x k , y k ](X, k )
As opposed to Algorithm 1, we now show that regardless of the P-property, we can always get rid of the variables y in the linearized pricing L [x k , y k ](X, k ): Proposition 3 Taking y = y k is always optimal for L [x k , y k ](X, k ) , which becomes a linear objective minimization problem in x only:

"Reduced costs"
Recall that DW for LPs can stop whenever the reduced costs 6 are zero: we now show that this is also true for the linearized scheme in our nonlinear setting.
Proof The linearized pricing in x is min x∈X ⟨c k , x⟩ . If c k = 0 , then any x k ∈ X is optimal: choosing any x k ∈ S k ⊆ X satisfies the stopping criterion. ◻ This last result provides a computationally cheap stopping criterion for the nonlinearized scheme as well, as the pricing problem-linearized or not-always provides a lower bound for (X) : an all-zeroes reduced cost vector ensures that the master problem cannot be improved.

Dantzig-Wolfe
Assume that X ⊆ ℝ n is a polyhedron (that we consider bounded for simplicity), f (x, y) ∶= c ⊤ x + d ⊤ y and the conic inequality is defined by C ∶= ℝ m + and g(x, y) ∶= b − Xx − Yy . Using S k ∶= conv(x l ) l∈{0,...,k−1} we retrieve DW for LPs. Its finite convergence is ensured by the fact that X has a finite-although exponential in general-number of extreme points.
Computational aspects of column generation for nonlinear… Extensions Notice that if C is a more general cone, our algorithm generalizes DW for LCs, where at each iteration P(S k ) is an LC and L(X, k ) is an LP. Applications of DW to LCs can be found in [2] and references therein. Further, there is a direct extension of DW to a special class of nonlinear problems: going back to the general case for C , f and g but keeping X polyhedral, L [x k , y k ](X, k ) turns out to be an LP. Because X is finitely generated, the finite convergence of the linearized algorithm is also ensured with the same arguments.
Constraint redundancy If X is not convex, to enforce S k ⊆ X , we must use thus potentially losing the advantage of dropping any X-defining constraint in the master problem as in the LP/convex cases. Similarly, if the conic hull is used instead and X is a cone, it is sufficient to use

Bienstock-Zuckerberg (BZ)
We now link our framework with a decomposition scheme for LPs [8] where the choice of S differs substantially from DW. Considering a partition , we force all the variables x j belonging to a same cluster J k l to yield the same value, ultimately aggregating all the variables into a single one. In other words, we use Update mechanism The update mechanism in this case refines the partition J k into a new partition J k+1 , by splitting some of its clusters such that the new column x k belongs to S k+1 ∶= P(J k+1 ) . We call a partition induced by some x ∈ ℝ n , a partition J( Given a partition J k and J k (x k ) a partition induced by x k , we first compute the refined partition J k+1 ∶= J k ΔJ k (x k ) and the new restricted set is given by S k+1 ∶= P(J k+1 ).
Extensions and convergence Such a scheme makes Algorithms 1 and 2 generalizations to nonlinear problems of the BZ algorithm [8]. This time the convergence is not ensured by some property of P(X) , but rather thanks to the structure of the sets S k . In fact, either (1) the partition is refined until turning into {{1}, ..., {n}} , meaning we reached the original problem P(X) , or (2) the partition is not refined, in which case J k+1 = J k ΔJ k (x k ) = J k . It is not difficult to see that the latter implies that x k ∈ S k = P(J k ) , which is a stopping criterion for both Algorithms 1 and 2. The former implies that we are guaranteed to converge

3
to an optimal solution in at most n iterations because the size of the partition increases by at least one (i.e. |J k+1 | > |J k | ). Motivating the scheme in the next Subsection, [38] show that given a sequence of columns (x l ) l∈{0,...,k−1} , we have S k ⊇ X ∩ lin(x l ) l∈{0,...,k−1} . Induced partition cardinality Notice that the pricing problem may provide a column x k with a large number of different values, hence generating a highcardinality induced partition J(x k ) and increasing rapidly the size of the partition J k+1 used in the next restricted problem P(S k+1 ) . This issue is partially addressed by the linearized Algorithm 2 and completely circumvented in the next scheme. For example, if X is polyhedral and possesses the integrality property [26], the pricing problems L [x k , y k ](X, k ) can return integer optimal solutions x k , thus increasing the probability of having a reduced number of different values.

Constraint redundancy
Given that {x ∶ x j = l , ∀j ∈ J k l , ∀l ∈ [L k ], for some ∈ ℝ L k } is not necessarily contained in X , we need to keep the X-defining constraints in general. This issue can sometimes be avoided for e.g. bound constraints ⩽ x ⩽ u that are present in the definition of X : they become equivalent to the following L k ≪ n bound constraints in terms of :

Non-partitioned BZ
[38] link the last scheme for MILPs to another that uses at each iteration the subset of X spanned by lin(x l ) l∈{0,...,k−1} , i.e.
Akin to the classical BZ, it converges in at most n iterations. What is not mentioned in [38] is that this is also true independently of the structure of P(X) . Instead of using partitions of variables, it uses the raw directions x l , thus avoiding an explosive increase in the number of variables. This comes at the cost of a less structured P(S) : in fact, variable aggregation is akin to a contraction operation in combinatorial optimization, which can eliminate a substantial amount of rows and columns when dealing with structured LPs.
Notice that even if we can maintain a reasonable number of variables in the master problem, the loss in structure in comparison to BZ prohibits in general the use of the trick we present in (6), and all the X-defining constraints must be kept, including variables bounds.

How do we check if x k ∈ S k ?
In our experimental design, we consider the four aforementioned sets S k . We now show how to determine if x k ∈ S k efficiently in those special cases: For BZ, it is enough to check if the size of the partition after refinement increased or not; For the linear span case, it is enough to check if x k is a linear combination of the previous columns, which is done by projection. In the convex and conic hull cases, we must check whether a small LP is feasible. Let the polyhedron Θ k be as follows: In these cases, we have that To avoid this large number of constraints, we choose to solve the following problem instead: We can see that x k ∈ S k iff distance 2 2 x k , S k = 0 . There are two advantages to use (7): (1) we are able to monitor the distance of the current column x k to S k , and (2) when solving (7) with an interior point method, the O(k) constraints that define Θ k are not a problem because k ≪ n and the gradient and Hessian of the penalized objective have dimensions k and k × k respectively.

x-free master problems P(S) and constraint redundancy
First, notice that all of the aforementioned schemes are of the form meaning that P(S k ) is equivalent to the following problem in and y only: Further, we are sometimes able to shrink or make redundant some X-defining constraints in P(S k ) . This is of crucial importance from a computational point of view, as we must make P(S k ) as simple to solve as possible: Given a generic set G , a convex set V , a cone K , a subspace E ∶= {x ∶ Ex = 0} and some variable bounds (if x is a vector) B ∶= {x ∶ ⩽ x ⩽ u} , consider that We summarize in Table 1 how each of these X-defining constraints can be simplified depending of the scheme used: If using the conic hull.

3 5 Risk-averse portfolio optimization problem
We now describe the NLP we test our algorithms on. We consider the portfolio optimization problem of determining which assets to buy-with uncertain returnssuch that (1) some risk of being rewarded a poor outcome is minimized, and (2) the variance of the return is kept under some threshold. As opposed to a classical expected value maximization model, using the variance and nonlinear risk measures makes the resulting optimization problem a large scale, nonlinear objective, SOC constrained optimization problem (See e.g. [1,17,33,34,60] for non expected value portfolio optimization).

Problem description
An administrator must allocate the resources of T + 1 different clients interested in disjoint subsets of stocks. The administrator must minimize some client-dependant risk measures while keeping the variance of some returns under some threshold 2 . The clients have budgets b t and are interested in n t assets each. We purposefully separate the 0th client from the T others as its associated decision variables are only a handful that we model with the y variables. Client t has to pay a unitary cost a t j per asset j. Each asset j has an uncertain future value c t j and at most u t j units can be purchased . The variables x t j ( y j ) represent the amount of each asset j purchased by client t (0). We enforce that x t ∈ X t and y ∈ X 0 where y ∈ X 0 is considered to be part of the side constraints −g(x, y) ∈ C . We define d t as the vector of returns c t , where many components are zero except for some vital assets. Upper bounding the variance of the returns of these assets is equivalent to: Computational aspects of column generation for nonlinear… Each client minimizes a risk measure f t that depends on its uncertain returns. They each minimize an entropic risk [52] of . The general problem can be cast as follows:

Sample average approximation (SAA)
The last problem is approximated by using S samples c ts of respective probabilities p s with SAA [32]. The approximations of the variance and expectations are summarized in Table 2. For simplicity, we use the same names for the functions and their respective SAAs.
Defining V, V t and V 0 such that V t sj ∶= √ p s (d ts j − d t j ) and Vx ∶= ∑ T t=1 V t x t , the variance constraint can be expressed as a classic nonlinear quadratic (CLA) convex constraint ||V 0 y + Vx|| 2 2 ⩽ 2 , or the SOC constraint (V 0 y + Vx, ) ∈ L S+1 2 . The full approximated problem becomes: Notice that Assumption 1 is satisfied from Proposition 1, as P(X) is a convex NLP that for some small > 0 , admits the all-'s vector of ℝ n+n 0 , n+n 0 , as a strictly feasible point and we can use n as a starting column for S 1 ⊂ S 2 ⊂ S 3 ...
If (8) is seen as a SOC.

Pricing problem
The structure of the pricing problem depends on the coupling constraint considered: Given any dual vector ( 1,0 , 2,0 , 3,0 ) ∈ ℝ n 0 + × ℝ n 0 + × ℝ + corresponding to the y-specific constraints y ⩾ 0 , y ⩽ u 0 and (a 0 ) ⊤ y ⩽ b 0 , the pricing problem is: Notice that considering (8) as a SOC makes the pricing problem separable in each x t and y and also allows the use of the y-independency result in Proposition 2. More importantly, using the linearized pricing each problem in x t is solvable in O(n t ln n t ) time: the objective function becomes linear and every problem in x t can be solved with a dedicated algorithm (see Proposition 5).

Proposition 5 The following LP with p variables can be reduced to a continuous knapsack problem, for which an optimal solution can be found in O(p ln p) time:
Proof Appendix H. ◻ We summarize In Table 3 the types of pricing problem we encounter with our framework, and how to solve them.

Master problem
From Table 1, we summarize in Table 4 the types of sets S k we use in our experiments and their implications for the master problems P(S k ) . Notice that the convex hull and the partitioning schemes hold a clear advantage wrt the others, as their number of constraints is way lower than the rest.
Chasing the conic dual variables If we consider the variance constraint as a SOC we must be able to retrieve conic dual variables for the master problem P(S) . Even though nonlinear or linear conic solvers do exist, we could not find any general (8) is seen as a CLA, If (8) is seen as SOC, purpose package for problems having both features and returning conic multipliers.
To circumvent this issue, we solve the master with an off the shelf nonlinear solver by considering the quadratic constraint as a CLA, then use the following result to obtain conic multipliers:

Proof Appendix I. ◻
Proposition 6 indicates that we can always derivate SOC multipliers from the multipliers of the constraint in nonlinear quadratic convex form.  Table 4 Master constraints ( y ∈ X 0 and the variance constraint are always present)

Numerical enhancement
In preliminary experiments, solving the convex optimization problems at hand as they are with an interior point method rapidly exceeds the capabilities of an average workstation. This is due to the fact that the Hessian matrix of the penalized objective can have many nonzero coefficients because of the entropic risk measures and the variance constraint. This observation implies that the linear system that is solved during each Newton step of the interior point method can be overly demanding both in terms of memory and running time. Introducing new variables and constraints, notice that we have the following identity: for any t ∈ {0, ..., T} and any z ( x t or y) we have In the same fashion, we can replace the variance constraint with By using this transformation, the objective function eliminates the variables z from the objective function and involves only the S variables v t s . This way, the Hessian of the Lagrangean function is a slightly larger matrix with (T + 1)S extra columns and rows having way less nonzero coefficients: at most (T + 1)S 2 of them come from the entropies, and S from the variance constraint. For this reason and the fact that CG is typically useful when there are only a few side constraints, we purposefully kept S moderately small for our experiments.
In Appendix J, we briefly recall the mechanics of an iteration of a primaldual interior point scheme and showcase the sparsity patterns of the Newton step for the pricing problem and the master problem in their original and enhanced forms. We also show that on our testbed, we can divide at least by a factor 100 the number of nonzero coefficients in the Newton system at the cost of having less than a percent of additional unknowns.
Chasing the dual variables for the enhanced model We must now be able to catch the dual variables associated to the original constraints (depending only of x and y) from the dual variables of the constraints in the enhanced formulation (now also including v and w). We address this in a general setting in the next Proposition, by showing that we can ignore the extra constraints and use the multipliers as is:

Proposition 7 Given a proper cone K , consider:
Suppose there is a transformation with extra variables v and w such that for any v = Vu , (u) =̃(v) and for any w = Wu , (u) =̃(w) , so that (11) can be rewritten as: ||w|| 2 2 ⩽ 2 and w = V 0 y + Vx.

Computational experience
The algorithms presented in this paper were coded in C programming language and run over Dell PowerEdge C6420 cluster nodes with Intel Xeon Gold 6152 CPUs at 2.10GHz with 32Gb RAM each. All the convex NLPs are solved using the callable library of IPOPT [49,62], using as a subroutine the linear solver Pardiso [47,48].

Methods tested and nomenclature
We test our methods (1) on different sets S that use the convex hull (V), the conic hull (C), the linear span (LR) or the partition-based linear span (LP), (2) using a linearized pricing problem (L) or without (NL), (3) considering the variance constraint as a conic (SOC) or as a classical nonlinear quadratic (CLA). The y-independency result or the tailored algorithm for knapsack problems are always used whenever it applies. For example, the scheme using the convex hull with linearized pricing and considering the variance constraint as a SOC will be named L-SOC-V. We summarize the different options tested in Table 5. We do not test any non-linearized scheme if the variance constraint is considered CLA, as the pricing problem is still not separable and can be as hard as P(X).

Instances generated
In order to push our frameworks to their limits, we generate synthetic instances of variable sizes. We consider T ∈ {1, 50} blocks of equal sizes n t = N ∈ {10 4 , 10 5 } . There are n 0 = 50 auxiliary variables and we generate S = 20 scenarios. The supplies u t j are uniformly drawn from {1, ..., 5} , and the costs and weights c ts j and a t are uniformly drawn from [1,2]. The budgets are b t = 0.05 ⋅ (u t ) ⊤ a t , i.e. such that each client can buy 5% of the assets. For some scenario s ∈ [S] , letting x t ∈ arg max x t {(c ts ) ⊤ x t ∶ x t ∈ X t } and ŷ ∈ arg max y {(c 0s ) ⊤ y ∶ y ∈ X 0 } , we set 2 ∶= 0.1 ⋅ ||V 0ŷ + Vx|| 2 2 so that the variance constraint is binding. Remark that for any value z and any utility random variable Z we have E (−Z) = E (z − Z) − z , meaning that minimizing the entropic risk measure means to avoid outcomes of Z such that z − Z is greater than . Given a reference random variable Ẑ , by setting z = (Ẑ) and = ⋅ ( (Ẑ)) 1∕2 , the clients wish to avoid asset selections whose outcomes can make you lose more than standard deviations, compared to the expected return of the reference solution. With this observation in mind, we set t ∶= 0.7 ⋅ ||V txt || 2 and 0 ∶= 0.7 ⋅ ||V 0ŷ || 2 .
The vectors d t are the returns c t where all the components are zero, except for 60% of the assets bought in the referent (ŷ,x) (i.e. only these are considered in the variance constraint), and the initial column x 0 ∈ S 1 -that is feasible for P(X) -is: Absolute and relative tolerances are respectively set to 10 −6 and 0.1% and the runs are stopped after 6 h.

Computational results
In Tables 6, 7, 8 and 9, we report the number of iterations (it), the total execution time (t), the master time (tmas), the pricing time (tpri), the best lower bound given by a pricing problem at any time (LB), the best upper bound given by the objective value of the last master (UB), the optimality gap (gap), and the number of variables defined by the last set S k ( |S| ). The execution times are in seconds, the gaps in % and LB and UB are scaled wrt to the upper bound of L-CLA-V. The entries in bold font are the best of each column, except for the "t" column where it means that the associated scheme went faster than solving the monolithic problem. If the time limit is hit during the last iteration, we report the execution time at the end of the previous one. If an algorithm stalls before giving any partial result, we mark the entry with "*".
Overall, every scheme is shown to always converge to solutions within the optimality tolerance for the small and mid-sized instances (10K-500K main variables: Tables 6, 7 and 8). Further, the execution time is mostly used to solve the master problems for the linearized schemes but more evenly split with the pricing time for the non-linearized schemes. We can see that using the convex hull with a linearized pricing (L-SOC-V and L-CLA-V) performs 2-3 times faster than the monolithic model. Along with using the partitioned linear span with a linearized pricing (L-SOC-LP and L-CLA-LP), they are the only algorithms that were able to terminate successfully or return a good quality solution in the allotted time for the largest instance (Table 9, T = 50 , N = 100K ) with 5 million variables. This is due to the fact that these schemes are the only ones that reduce substantially the size of our master problem, be it by making the X constraints redundant for the convex hull, or by shrinking the variable bounds into a small number of other variable bounds for the partitioned linear span (See Table 4 in Subsection 5.4). We can see that-in our case-a linearized pricing is a crucial ingredient for a successful scheme as we can use a tailored algorithm to solve it. Also, considering the main side constraint as a SOC or a CLA does not make any difference when linearizing the pricing. Even though the non-linearized schemes are not competitive for large instances, we can see that their pricing problems provide excellent quality columns and bounds and make the schemes converge in a few iterations. This phenomenon is probably due to the reliability of the pricing problem on giving a "good"  column: By linearizing the objective function of the pricing problem, we somehow lose some of the information that the nonlinearity of the pricing's objective would have carried along. One could make a parallel with bundle/proximal methods [11] and stabilization techniques for CG [20,46] where the objective function of the pricing problem is penalized with a nonlinear term enforcing that the new column does not make the dual variables stray too much away from the current dual incumbent.
Notice that using the variable aggregation scheme, the master problems are significantly bigger than the others, thus making their solution slower, although the bounds they return are also significantly better.  show examples of progression of respectively the bounds and gaps over time of the schemes associated to V and LP on the mid-sized instance ( T = 50 , N = 10K ). We can see that the bound/gap improvement is quite progressive for the linearized schemes, whereas it takes only a few large steps in the nonlinearized schemes.
In Figs. 6 and 7, we show examples of progression of respectively the distance between x k and S k , and the largest "reduced costs" (see Subsection 3.4) in absolute value over time of the schemes associated to V and LP on the mid-sized instance ( T = 50 , N = 10K ). This empirically confirms the theoretical results about the stopping criterion x k ∈ S k and the zero-reduced cost one presented in Proposition 4. Interestingly, to some extent these values can be used to estimate the proximity of the current solution from being optimal. We can make the same observation as for the bounds/gap, where the evolution is more progressive for the linearized schemes than the nonlinearized ones.

Conclusions and future work
We propose a generic primal decomposition method that unifies a broad range of existing schemes and opens the door for new exotic algorithmic frameworks. The convergence rate of the algorithms we present is not studied but can be heavily problem dependent. Several special cases of our methods have been proved to converge under mild assumptions but more work is required to prove the convergence of broader classes of algorithms. Extensive computational experiments should be conducted on benchmark instances to gauge the advantages and inconvenients of each of those schemes.
Delayed column generation We assumed here that some structure X is exploitable: in an ongoing work, we explore the same kind of algorithm but relaxing all the constraints. It leads to algorithms sharing similarities with delayed column generation [7] and a simplex for nonlinear problems [65]. The pricing boils down to check if the reduced costs are zero and pick as an entering column a variable whose reduced cost is nonzero, while hardening significantly the master problem.
Non Lagrangean relaxations as pricing problems Instead of relying on Lagrangean duality, be it in the information we have access to when solving the master problem or the kind of pricing problem we solve, different relaxations can be used to provide stronger bounds and attenuate unstable behaviors [36,46]. We can use e.g. surrogate relaxations [27,30] where instead of relaxing the side constraints in the objective, they are bundled into a single constraint ⟨ , g(x, y)⟩ ⩽ 0 . The pricing problem becomes harder, but provides stronger dual bounds with weaker working hypothesis. The master problems must be solved with a surrogate algorithm [37] that returns optimal surrogate multipliers .
Non-convex problems P(X) X can be a tractable relaxation of a combinatorial problem P(X) : a choice of S defines the relaxation X of X we work on. The strength of the bounds and the difficulty to solve the subproblems in our algorithms can vary greatly from one choice of S to another [12,19,28,29,35,57,68].
Dual decomposition In an ongoing work, we provide Dual decomposition schemes where-using the tight relationship between DW and the Benders decomposition method-we present a constraint generation methodology where the dual variables are decomposed and generated on the fly. Again, a variety of sets S can be used, yielding different master problems e.g. the generalized Benders decomposition [25] or constraint aggregation schemes [13,22,56].

Appendix J: Sparsity patterns for the enhanced models
Primal-dual interior point and Newton step Given a barrier parameter > 0 and an optimization problem min u { (u) ∶ (u) ⩽ 0, Ru = s} , the barrier problem is defined as min u { (u) − ∑ p i=1 ln(− i (u)) ∶ Ru = s} . The first order optimality conditions for the barrier problem are the following perturbed conditions: The heavy works at each iteration of a primal-dual interior point algorithm occur during the Newton step, which consists in approximately solve the latter system of equations by solving in (Δu, Δ , Δ ) its following first-order approximation: Monolithic formulation After the sparsity patterns in Figs. 8, 9, Figure 10 summarizes the number of nonzero coefficients present in the Newton step's system.
Pricing problems We only cover here the pricing problem when the variance constraint is considered as conic because the pricing when considering it as a single nonlinear constraint becomes very similar to the monolithic formulation. Given that the problem is splittable in each x t (and x 0 = y ), we successively solve the pricing for each t ∈ {0, ..., T} , which is the problem for which we study the sparsity pattern. As per Fig. 11, Fig. 12 summarizes the number of nonzero coefficients present in the Newton step's system.
The non-enhanced version has n 2 t + 8n t + 1 nonzeroes, with 3n t + 1 unknowns, while the enhanced version sums n t (8 + 2S) + S(S + 2) + 1 nonzeroes, with 3n t + 2S + 1 unknowns. We summarize these results in Table 11 to illustrate the benefits of using the enhanced version instead of the original version, where we can see again that we can divide at least by a factor 100 the number of nonzero coefficients in our instances at the cost of having less than a percent of additional unknowns.