Summary of the "Generalization Bounds via Convex Analysis" Paper [Draft]

The "Generalization Bounds via Convex Analysis" paper by Gergely Neu, and Gábor Lugosi, discusses the generalization error of supervised learning algorithms and how it can be bounded regarding the mutual information between their input and output.

The authors generalize this result beyond the standard choice of Shannon's mutual information to measure the dependence between the input and output.

Shannon's mutual information is a measure of the amount of information that is shared by two random variables.

The mutual information between two random variables, X and Y, is the difference between X's entropy and X's conditional entropy given Y. 

In previous papers, the generalization error of supervised learning algorithms was bounded by Shannon's mutual information measure

However, this paper shows that it's possible to replace the mutual information by any strongly convex function of the joint input-output distribution with the subgaussianity condition on the losses replaced by a bound on an appropriately chosen norm capturing the geometry of the dependence measure.

This fundamental proof allows us to derive a range of generalization bounds that are either entirely new or strengthen previously known ones. 

Examples include bounds in terms of p-norm divergences and the Wasserstein-2 distance, which are applicable for heavy-tailed loss distributions and highly smooth loss functions.

Definitions

A function is considered strongly convex if it is convex and its second derivative is greater than or equal to some positive constant. In other words, a function is strongly convex if it is always "curving up" at a rate that is at least as fast as some positive constant. This property makes strongly convex functions useful in optimization problems because they have a unique minimum that can be found efficiently. 

A Subgaussian random variable is one whose tail probabilities decay exponentially. A subgaussianity condition is a condition that requires the loss of any fixed hypothesis to have a subgaussian tail.

A norm is a function that assigns a non-negative real number to a vector space and behaves like the distance from the origin. 

A divergence is a function that takes two probability distributions as input and returns a number that measures how much they differ. The number returned must be non-negative and equal to zero if and only if the two distributions are identical. 

Problem

\[gen(W_n, S_n) = L(W_n, S_n) - \mathbf{E} [\ell(W_n, \tilde{Z}) | W_n] \]

\(gen(W_n, S_n) \) : The generalization error of the learning algorithm. 

\( L(W_n, S_n)\) : The training loss

\( \mathbf{E} [\ell(W_n, \tilde{Z}) | W_n] \) : The expected loss from testing 

Proof

 

References 

1. Generalization Bounds via Convex Analysis (Gábor Lugosi, Gergely Neu) - https://arxiv.org/abs/2202.04985