Sphere Refinement in Gap Safe Screening

The Gap safe screening technique is a powerful tool to accelerate the convergence of sparse optimization solvers. Its performance is largely based on the ability to determine the smallest “sphere”, centered at a given feasible dual point, that contains the dual solution. This can be achieved through an inner sphere refinement loop, applied at each screening step. In this work, we show that this refinement loop actually converges to the solution of a fixed-point equation for which we derive a closed-form expression for two common loss functions. This allows us to develop an analytic (i.e., non iterative), more concise and theoretically-grounded variant of the sphere refinement step.


I. INTRODUCTION
S PARSE optimization problems are encountered in fields such as signal processing, inverse problems, statistics, and machine learning. A very common formulation is given by where A ∈ R m×n , C ∈ {R n , R n ≥0 }, λ > 0, and each scalar function f i : R → R ∪ {+∞} is proper convex, and differentiable. As such, numerous algorithms have been developed to tackle problems of the form (1). These include, but are not limited to, proximal gradient [1], [2], [3], coordinate descent [4], [5], and majorization-minimization [6] methods.
Within this context, the promise of safe screening is to identify zero coordinates in x so as to reduce the size of the problem and, consequently, accelerate the convergence of the solver. This identification can be performed before or within the course of iterations, leading respectively to the so-called static [7] and dynamic [8] screening approaches. Although originally proposed for the Lasso problem [7] (i.e., f i (z) = (y i − z) 2 where y i is the ith entry of a data vector y ∈ R m ), safe screening techniques have then been extended to a large variety of sparse-regularized problems [8], [9], [10], [11]. The case where the 1 -norm in (1) is replaced by a generic group separable norm has also been treated in [9], [12]. Manuscript  Notations: We let [n] = {1, . . . , n}. For a vector x ∈ R n , we denote x i its ith entry. Given a subset of indices g ⊆ [n] with cardinality |g| = n g , x g ∈ R n g is the restriction of x to its elements indexed by g. For a matrix A, we similarly define the restrictions a j and A g with respect to the columns of A.
Safe Screening in a Nutshell: Safe screening techniques rely on the first-order primal-dual optimality conditions of (1). More precisely, we get the key property that [7], [12] where In the dual formulation (3), corresponds to the dual feasible set, with dom(D λ ) denoting the domain of the dual function.
One sees from (2) that the knowledge of θ allows us to identify zero coordinates in x . Yet, this is not practical as θ is unknown. The main task in safe screening is thus to define a safe region S θ from which we can derive the following safe screening rule for the jth component Clearly, in order to maximize screening performance, the safe region S should be as small as possible (to increase the number of screened variables) while allowing an efficient computation of the screening test given by max θ∈S |φ(a T j θ)| < 1 (to minimize the computational overhead).
Among existing safe regions, the Gap safe sphere [9] leads to state-of-the-art screening performance. Given any primal-dual pair (x, θ), it reads as to the strong concavity constant of D λ over R m . Not only is its geometry simple (allowing fast screening tests), but its radius vanishes upon convergence of the primal-dual iterates when strong duality holds (i.e., Gap λ (x , θ ) = 0). Yet, it requires the dual function D λ to be globally strongly concave which precludes its use for an important class of functions f i such as the β-divergences with β ∈ [1, 2) [13].
In a previous work [12], we overcame this limitation by computing local strong concavity bounds on wellchosen subsets of the domain. Moreover, by re-evaluating the strong-concavity bound on the current safe sphere, we proposed a sphere refinement loop that improves screening performance.
Contributions: In this letter, we scrutinize the sphere refinement loop proposed in [12] and recalled in Section II. We Algorithm 1: GSS with Iterative Sphere Refinement [12].
Primal and Dual updates 5: Safe region with iterative refinement Init. safe region 13: if α S k > αS then 14: repeat (loop over j) Sphere refinement 16:

II. GSS WITH ITERATIVE SPHERE REFINEMENT
The Gap safe screening (GSS) with iterative sphere refinement proposed in [12] is recalled in Algorithm 1. There, PrimalUpdate (resp., DualUpdate) refers to the update step of any iterative primal (resp., dual) solver for (1) (resp., (3)). Given a subset S ⊂ R m , α S stands for the strong concavity constant of D λ over S, i.e., a lower bound on f * i over S and ∀i ∈ [m] (c.f. [14], [12,Proposition 11]). Then, the construction of the safe region, starting at line 7, is made of three steps.
r First, if the new dual point does not belong to the previous safe region, the latter is inflated (Line 10) to make the next step possible with θ k ∈S.
r Second, as θ k ∈S, Theorem 5 in [12] can be invoked to build a new safe region centered at θ k using αS (Line 12).
r Third, if the strong concavity constant over this new safe region improves (Line 13), this safe region is iteratively refined (lines [15][16][17][18]. From [12,Proposition 7], this refinement loop generates a sequence of nested Gap safe spheres (i.e., with decreasing radius), all centered in θ k . Finally, the refined safe region is used at Line 22 to safely screen out zero-coordinates of the solution vector x . 1: Inputs: Primal and Dual updates 5: Projection onto S b (c.f. [14]) 8: Safe region with analytic refinement 9: Init. safe region Track region with best α 14: end if 15: end if 16:

III. GSS WITH ANALYTIC SPHERE REFINEMENT
The proposed Gap safe screening with analytic sphere refinement is presented in Algorithm 2. Its main novelties with respect to Algorithm 1 are outlined in the next three sections.

A. Tracking the Best Strong Concavity Constant
As opposed to Algorithm 1, in Algorithm 2 we keep track of the safe region S b over which the best (i.e., largest) constant α S b has been computed so far. As such, we ensure the construction of a non-decreasing sequence of strong concavity constants. Then, to ensure that the new dual point θ k belongs to S b , we replaced the inflation step at Line 10 of Algorithm 1 by the projection step at Line 7 of Algorithm 2. The benefit of this modification is twofold.
r It discards the need of recomputing the strong concavity constant (on the inflated region) before refinement.
r It leads to improved dual points. Indeed, given that S b is convex and θ ∈ S b , we have for allθ

B. Avoiding Unnecessary Refinement Attempts
In Algorithm 2, we added the test at Line 10 to avoid unnecessary refinement attempts. Indeed, one can see that the refinement step will improve the initial kth Gap safe sphere only if α S k > α S b (i.e., the strong concavity constant on the new S k is better than the best one α S b computed so far). From the  definition of strong concavity, 1 we can thus derive the following three situations (illustrated in Fig. 1(a)-(c)), A typical situation of improvement arises when the duality Gap (and thus the radius) decreases more than the displacement of the dual point from one iteration to the next. Indeed, let θ S b ∈ Δ A and r S b > 0 denote respectively the center and the radius of S b , then S k ⊆ S b is equivalent to (see Fig. 1(a)) Similarly, we get that S k ⊇ S b (i.e., no improvement case) is equivalent to Hence, the complement of (7) includes all improvement and indecisive cases (see Fig. 2). It can be used as a test that does not require to compute any strong concavity constant to decide whether to perform the refinement step (Line 10 of Algorithm 2). Yet, a second test involving strong concavity constants (Line 11 of Algorithm 2) is required to deal with indecisive cases.

C. Dropping the Refinement Loop
In Proposition 1, we prove that the sequence of strong concavity constants generated by the refinement loop at lines 15-18 of Algorithm 1 converges to the solutionᾱ k of a fixed-point equation. This is illustrated in Fig. 1(d)-(e). As such, provided that one has access to a closed-form expression forᾱ k (see Section IV), the refinement is no longer iterative, as implemented at Line 12 of Algorithm 2.
Proposition 1 (Fixed point equation): Assume that D λ is twice differentiable and let (x k , θ k ) ∈ C × Δ A be the kth primal-dual iterate pair and G k := Gap λ (x k , θ k ). Define with unconstrained inf when α = 0. Then, the refinement loop (lines 15-18 in Algorithm 1) converges to the safe region S k = B(θ k , 2G k /ᾱ k ) withᾱ k a non-repelling 2 fixed point of h k .

IV. CLOSED-FORM EXPRESSIONS OF FIXED POINTS
The practical relevance of Algorithm 2 depends on our ability to derive closed-form expressions of fixed points of h k in (8). We show that this is possible for two very common loss functions. More general properties of h k are given in [14,Sec. 1.3].

A. Kullback-Leibler Divergence
Here, the scalar data-fidelity functions f i and their convex conjugates are given by: where y i is the ith entry of the data vector y ∈ R m ≥0 , > 0 is a smoothing factor that avoids singularities around zero and dom(f * i ) = {u ∈ R | u ≤ 1}. Finally, in this case we have C = R n ≥0 , and A ∈ R m×n ≥0 . (10) has a unique attracting fixed-point

B. Logistic Function
For an input signal y ∈ R m , the data-fidelity functions f i and their convex conjugates f * i are given by: with In this case, we have C = R n .

C. Complexity Analysis
The computation of the fixed point from the closed-form expressions derived in propositions 2 and 3 is of the order of O(m) (due to the min operations). This is about the same complexity as for the evaluation of the strong concavity constant on any ball [12, Table 1]. As such, denoting by P the number of sphere refinement iterations in Algorithm 1, the analytic version of Algorithm 2 allows for reducing the overall sphere refinement complexity from O(P m) to O(m).

V. NUMERICAL ILLUSTRATION
In this section, we illustrate the behavior of the sphere refinement procedure with the following two examples: r KL regression with a proximal gradient (PG) solver [3] for archetypal analysis on the NIPS papers dataset [17]. The size of the problem is (m × n) = (2483 × 14035) and λ/λ max = 10 −1 (λ max being the regularization above which the zero vector is a solution of (1) [12,Sec. 4.1.2]).
r Logistic regression with a coordinate descent (CoD) solver [18] for binary classification of the Leukemia dataset [19]. The size of the problem is (m × n) = (71 × 7129) and λ/λ max = 10 −2 . We report in Table I, for each of the three situations described in Fig. 1, the distribution of times whereᾱ k > α S b andᾱ k ≤ α S b . The reported distributions have been experimentally observed to depend mostly on the underlying solver rather than other parameters like regularization and problem instance. Therefore, the reported scenarios in Table I are quite Alg. 1 (bottom), as a function of the outer iteration number k. Green, grey (very rare) and red backgrounds depict respectively the improvement, indecisive, and no-improvement situations. The initial white region corresponds to a "burn-in" phase whereᾱ k < α S b = α Δ A . Indeed, here α S b has been initialized with α Δ A which is known.
representative of CoD and PG solver typical behaviors [14]. An interesting observation is that, for both experiments, the proportion of indecisive situations is very low (1.9% for Logistic regression with CoD and 1.2% for KL-regression with proximal gradient). In particular, the proportion of times where the fixed pointᾱ k has been computed without being used (bold values in Table I) is even smaller. These observations show the efficiency of the simple test at Line 10 of Algorithm 2 in discriminating improvement from no-improvement situations. If, for a given problem, indecisive cases are more abundant, often leading to useless computations ofᾱ k , one can modify the test in Line 10 so as to exclude such indecisive cases.
To further illustrate the sphere refinement behavior, we report in Fig. 3 the evolution of the best strong concavity constant α S b , the kth fixed pointᾱ k , and the duality gap G k = Gap λ (x k , θ k ) as a function of the iteration number k. We observe that, in general, no-improvement situations (red areas) occur at iterations where the duality gap increased. This behavior can be more frequent for some solvers (e.g., PG) than others (e.g., CoD). Moreover, we see that the fixed pointᾱ k (blue curves) can be significantly degraded in such no-improvement situations. This shows the importance of updating the strong concavity constant only when it improves over the current one (α S b ).
The number of refinement iterations performed by Alg. 1 is also reported in Fig. 3. We see that, in the coordinate descent (resp. proximal gradient) case, a total of 37 (resp. 69) inner iterations are avoided by the proposed approach (more extensive experiments are available in [14]).

VI. CONCLUSION
In this work, we made a detailed theoretical analysis of the sphere refinement loop proposed in [12], by reformulating it as fixed point iterations and characterizing its convergence point. Not only does it shed new light on this refinement step, but it allows us to derive a non-iterative version that is more elegant, more concise, and enjoys a better computational complexity.