Fair Algorithms for Infinite and Contextual Bandits
Abstract
We study fairness in linear bandit problems. Starting from the notion of meritocratic fairness introduced in Joseph et al. (2016), we carry out a more refined analysis of a more general problem, achieving better performance guarantees with fewer modelling assumptions on the number and structure of available choices as well as the number selected. We also analyze the previouslyunstudied question of fairness in infinite linear bandit problems, obtaining instancedependent regret upper bounds as well as lower bounds demonstrating that this instancedependence is necessary. The result is a framework for meritocratic fairness in an online linear setting that is substantially more powerful, general, and realistic than the current state of the art.
1 Introduction
The problem of repeatedly making choices and learning from choice feedback arises in a variety of settings, including granting loans, serving ads, and hiring. Encoding these problems in a bandit setting enables one to take advantage of a rich body of existing bandit algorithms. UCBstyle algorithms, for example, are guaranteed to yield noregret policies for these problems.
Joseph et al. (2016), however, raises the concern that these noregret policies may be unfair: in some rounds, they will choose options with lower expected rewards over options with higher expected rewards, for example choosing less qualified job applicants over more qualified ones. Consider a UCBlike algorithm aiming to hire all qualified applicants in every round. As time goes on, any noregret algorithm must behave unfairly for a vanishing fraction of rounds, but the total number of mistreated people – in hiring, people who saw a less qualified job applicant hired in a round in which they themselves were not hired – can be large (see Figure 1).
Joseph et al. (2016) then design noregret algorithms which minimize mistreatment and are fair in the following sense: their algorithms (with high probability) never at any round place higher selection probability on a less qualified applicant than on a more qualified applicant. However, their analysis assumes that there are welldefined groups, each with its own mapping from features to expected rewards; at each round exactly one individual from each group arrives; and exactly one individual is chosen in each round. In the hiring setting, this equates to assuming that a company receives one job applicant from each group and must hire exactly one (rather than or all qualified applicants) introducing an unrealistic element of competition and unfairness both between applicants and between groups.
The aforementioned assumptions are unrealistic in many practical settings; our work shows they are also unnecessary. Meritocratic fairness can be defined without reference to groups, and algorithms can satisfy the strictest form of meritocratic fairness without any knowledge of group membership. Even without this knowledge, we design algorithms which will be fair with respect to any possible group structure over individuals. In Section 2, we present this general definition of fairness. The definition further allows for the number of individuals arriving in any round to vary, and is sufficiently flexible to apply to settings where algorithms can select individuals in each round. By virtue of the definition making no reference to groups, the model makes no assumptions about how many individuals arriving at time belong to any group. A company can then consider a large pool of applicants, not necessarily stratified by race or gender, with an arbitrary number of candidates from any one of these populations, and hire one or or even every qualified applicant.
We then present a framework for designing meritocratically fair online linear contextual bandit algorithms. In Section 3, we show how to design fair algorithms when at most some finite number of individuals arrives in any round (the linear contextual bandits problem (Abe et al., 2003; Auer, 2002)), as well as when individuals may be chosen in each round (the “multiple play" introduced and studied absent fairness in Anantharam et al. (1987)). We therefore study a much more general model than (Joseph et al., 2016) and, in Section 3, substantially improve upon their blackbox regret guarantees for linear bandit problems using a technical analysis specific to the linear setting.
However, these regret bounds still scale (polynomially) with , the maximum number of individuals seen in any given round. This may be undesirable for large , thus motivating the investigation of fair algorithms for the infinite bandit setting (the online linear optimization with bandit feedback problem Flaxman et al. (2005)).^{1}^{1}1We note that both the finite and infinite settings have infinite numbers of potential candidates: the difference arises in how many choices an algorithm has in a given round. In Section 4 we provide such an algorithm via an adaptation of our general confidence intervalbased framework that takes advantage of the fact that optimal solutions to linear programs must be extreme points of the feasible region. We then prove, subject to certain assumptions, a regret upper bound that depends on , an instancedependent parameter based on the distance between the best and secondbest extreme points in a given choice set.
In Section 5 we show that this instance dependence is almost tight by exhibiting an infinite choice set satisfying our assumptions for which any fair algorithm must incur regret dependent polynomially on , separating this setting from the online linear optimization setting absent a fairness constraint. Finally, we justify our assumptions on the choice set by in Section 6 exhibiting a choice set that both violates our assumptions and admits no fair algorithm with nontrivial regret guarantees. A condensed presentation of our methods and results appears in Figure 2.
Finally, we note that our algorithms share an overarching logic for reasoning about fairness. These algorithms all satisfy fairness by certifying optimality, never giving preferential treatment to over unless the algorithm is certain that has higher reward than . The algorithms accomplish this by computing confidence intervals around the estimated rewards for individuals. If two individuals have overlapping confidence intervals, we say they are linked; if can be reached from using a sequence of linked individuals, we say they are chained.


Technique  Notes 








Select all in every chain with highest UCB  Deterministic  
Exactly 




1.1 Related Work and Discussion of Our Fairness Definition
Fairness in machine learning has seen substantial recent growth as a subject of study, and many different definitions of fairness exist. We provide a brief overview here; see e.g. Berk et al. (2017) and CorbettDavies et al. (2017) for detailed descriptions and comparisons of these definitions.
Many extant fairness notions are predicated on the existence of groups, and aim to guarantee that certain groups are not unequally favored or mistreated. In this vein, Hardt et al. (2016) introduced the notion of equality of opportunity, which requires that a classifier’s predicted outcome should be independent of a protected attribute (such as race) conditioned on the true outcome, and they and Woodworth et al. (2017) have studied the feasibility and possible relaxations thereof. Similarly, Zafar et al. (2017) analyzed an equivalent concurrent notion of (un)fairness they call disparate mistreatment. Separately, Kleinberg et al. (2017) and Chouldechova (2017) showed that different notions of group fairness may (and sometimes must) conflict with one another.
This paper, like Joseph et al. (2016), departs from the work above in a number of ways. We attempt to capture a particular notion of individual and weakly meritocratic fairness that holds throughout the learning process. This was inspired by Dwork et al. (2012), who suggest fair treatment equates to treating “similar” people similarly, where similarity is defined with respect to an assumed prespecified taskspecific metric. Taking the fairness formulation of Joseph et al. (2016) as our starting point, our definition of fairness does not promise to correct for past inequities or inaccurate or biased data. Instead, it assumes the existence of an accurate mapping from features to true quality for the task at hand^{2}^{2}2 Friedler et al. (2016) provide evidence that providing fairness from biascorrupted data is quite difficult. and promises fairness while learning and using this mapping in the following sense: any individual who is currently more qualified (for a job, loan, or college acceptance) than another individual will always have at least as good a chance of selection as the less qualified individual.
The onesided nature of this guarantee, as well as its formulation in terms of quality, leads to the name weakly meritocratic fairness. Weakly meritocratic fairness may then be interpreted as a minimal guarantee of fairness: an algorithm satisfying our fairness definition cannot favor a worse option but is not required to favor a better option. In this sense our fairness requirement encodes a necessary variant of fairness rather than a completely sufficient one. This makes our upper bounds (Sections 3 and 4) relatively weaker and our lower bounds (Sections 5 and 6) relatively stronger.
We additionally note that our fairness guarantees require fairness at every step of the learning process. We view this as an important point, especially for algorithms whose learning processes may be long (or even continuous). Furthermore, while it may seem reasonable to relax this requirement to allow a small fraction of unfair steps, it is unclear how to do so without enabling discrimination against a correspondingly small population.
Finally, while our fairness definition draws from Joseph et al. (2016), we work in what we believe to be a significantly more general and realistic setting. In the finite case we allow for a variable number of individuals in each round from a variable number of groups and also allow selection of a variable number of individuals in each round, thus dropping several assumptions from Joseph et al. (2016). We also analyze the previously unstudied topic of fairness with infinitely many choices.
2 Model
Fix some , the underlying linear coefficients of our learning problem, and the number of rounds. For each , let denote the set of available choices in round . We will consider both the “finite” action case, where , and the infinite action case. An algorithm , facing choices , picks a subset , and for each , observes reward such that , and the distribution of the noise is subGaussian (see Section 7.1 for a definition of subGaussian).
Refer to all observations in round as where for each . Finally, let refer to the design and observation matrices at round .
We are interested in settings where an algorithm may face size constraints on . We consider three cases: the standard linear bandits problem (), the multiple choice linear bandits problem (), and the heretofore unstudied (to the best of the authors’ knowledge) case in which the size of is unconstrained. For short, we refer to these as 1bandit, mbandit, and kbandit.
Regret
The notion of regret we will consider is that of pseudoregret. Facing a sequence of choice sets , suppose chooses sets .^{3}^{3}3If these are randomized choices, the randomness of is incorporated into the expected value calculations. Then, the expected reward of on this sequence is .
Refer to the sequence of feasible choices^{4}^{4}4We assume these have the appropriate size for each problem we consider: singletons in the 1bandit problem, size at most in the mbandit problem, and arbitrarily large in the kbandit problem. which maximizes expected reward as , defined with full knowledge of .
Then, the pseudoregret of on any sequence is defined as
The pseudoregret of refers to the maximum pseudoregret incurs on any sequence of choice sets and any . If , then is said to be noregret. If, for any input parameter , upperbounds the expectation of the rewards of the sequence chosen by with probability , then we call this a highprobability regret bound for .
Fairness
Consider an algorithm , which chooses a sequence of probability distributions over feasible sets to pick, . Note that distribution depends upon , the choices , and .
We now give a formal definition of fairness of an algorithm for the 1bandit, mbandit, and kbandit problems. We adapt our fairness definition from Joseph et al. (2016), generalizing from discrete distributions over finite action sets to mixture distributions over possibly infinite action sets. We slightly abuse notation and refer to the probability density and mass functions of an element : this refers to the marginal distribution of being chosen (namely, the probability that belongs to the set picked according to the distribution ).
Definition 1 (Weakly Meritocratic Fairness).
We say that an algorithm is weakly meritocratic if, for any input and any , with probability at least , at every round , for every such that :

If is a discrete distribution: For (the probability mass function)

If is a continuous distribution: For (the probability density function)

If can be written as a mixture distribution: , such that each constituent distribution is either discrete or continuous and satisfies one of the above two conditions.
For brevity, as consider only this fairness notion in this paper, we will refer to weakly meritocratic fairness as “fairness”. We say is roundfair at time if satisfies the above conditions.
This definition can be easily generalized over any partition of , by requiring this weak monotonicity hold only for pairs belonging to different elements of the partition . The special case above of the singleton partition is the most stringent choice of partition. We focus our analysis on the singleton partition as a minimal worstcase framework, but this model easily relaxes to apply only across groups, as well as to only requiring “onesided” monotonicity, where monotonicity is required only for pairs where the more qualified member belongs to group rather than .
Remark 1.
In the kbandit setting, Definition 1 can be simplified to require, with probability over its observations, an algorithm never select a lessqualified individual over morequalified one in any round, and can be satisfied by deterministic algorithms.
3 Finite Action Spaces: Fair Ridge Regression
In this section, we introduce a family of fair algorithms for linear 1bandit, mbandit, and the (unconstrained) kbandit problems. Here, an algorithm sees a slate of at most distinct individuals each round and selects some subset of them for reward and observation. This allows us to encode settings where an algorithm repeatedly observes a new pool of individuals, each represented by a vector of features, then decides to give some of those individuals loans based upon those vectors, observes the quality of the individuals to whom they gave loans, and updates the selection rule for loan allocation. The regret of these algorithms will scale polynomially in and as the algorithm gets tighter estimates of .
All of the algorithms are based upon the following template. They maintain an estimate of from observations, along with confidence intervals around the estimate. They use to estimate the rewards for the individuals on day and the confidence interval around to create a confidence interval around each of these estimated rewards. Any two individuals whose intervals overlap on day will picked with the same probability by the algorithm. Call any two individuals whose intervals overlap on day linked, and any two individuals belonging to the transitive closure of the linked relation chained. Since any two linked individuals will chosen with the same probability, any two chained individuals will also be chosen with the same probability.
An algorithm constrained to pick exactly individuals each round will pick them in the following way. Order the chains by their highest upper confidence bound. In that order, select all individuals from each chain (with probability while that results in taking fewer than individuals. When the algorithm arrives at the first chain for which it does not have capacity to accept every individual in the chain, it selects to fill its capacity uniformly at random from that chain’s individuals. If the algorithm can pick any number of individuals, it will pick all individuals chained to any individual with positive upper confidence bound.
We now present the regret guarantees for fair 1bandit, mbandit, and kbandit using this framework.
Theorem 1.
Suppose, for all , is subGaussian, , and for all , and . Then, , , and are fair algorithms for the 1bandit, mbandit, and kbandit problems, respectively. With probability , for , the regret of is
We pause to compare our bound for 1bandit to that found in Joseph et al. (2016). Their work supposes that each of groups has an independent dimensional linear function governing its reward and provides a fair algorithm regret upper bound of
Each of these algorithms will use regularized leastsquares regressor to estimate . Given a design matrix , response vector , and regularization parameter this is of the form . Valid confidence intervals (that contain with high probability) are nontrivial to derive for this estimator (which might be biased); to construct them, we rely on martingale matrix concentration results (AbbasiYadkori et al., 2011).
We now sketch how the proof of Theorem 1 proceeds, deferring a full proof (of this and all other results in this paper) and pseudocode to the supplementary materials. We first establish that, with probability , for all rounds , for all , that (i.e. that the confidence intervals being used are valid). Using this fact, we establish that the algorithm is fair. The algorithm plays any two actions which are linked with equal probability in each round, and any action with a confidence interval above another action’s confidence interval with weakly higher probability. Thus, if the payoffs for the actions lie anywhere within their confidence intervals, RidgeFair is fair, which holds as the confidence intervals are valid.
Proving a bound on the regret of RidgeFair requires some nonstandard analysis, primarily because the widths of the confidence intervals used by the algorithm do not shrink uniformly. The sum of the widths of the intervals of our selected (and therefore observed) actions grows sublinearly in . UCB variants, by virtue of playing an action with highest upper confidence bound, have regret in round bounded by ’s confidence interval width. RidgeFair, conversely, suffers regret equal to the sum of the confidence widths of the chained set, while only receiving feedback for the action it actually takes. We overcome this obstacle by relating the sum of the confidence interval widths of the linked set to the sum of the widths of the selected actions.
4 Fair algorithms for convex action sets
In this section we analyze linear bandits with infinite choice sets in the familiar 1bandit setting.^{5}^{5}5Note that noregret guarantees are in general impossible for infinite choice sets in mbandit and kbandit settings, since the continuity of the infinite choice sets we consider makes selecting multiple choices while satisfying fairness impossible without choosing uniformly at random from the entire set. We provide a fair algorithm with an instancedependent sublinear regret bound for infinite choice sets – specifically convex bodies – below. In Section 5 we match this with lower bounds showing that instance dependence is an unavoidable cost for fair algorithms in an infinite setting.
A naive adaptation of RidgeFair to an infinite setting requires maintenance of infinitely many confidence intervals and is therefore impractical. We instead assume that our choice sets are convex bodies and exploit the resulting geometry: since our underlying function is linear, it is maximized at an extremal point. This simplifies the problem, since we need only reason about the relative quality of extremal points. The relevant quantity is , a notion adapted from Dani et al. (2008) that denotes the difference in reward between the best and secondbest extremal points in the choice set. When is large it is easier to confidently identify the optimal choice and select it deterministically without violating fairness. When is small, it is more difficult to determine which of the top two points is best – and since deterministically selecting the wrong one violates fairness for any points infinitesimally close to the true best point, we are forced to play randomly from the entire choice set.
Our resulting fair algorithm, FairGap, proceeds as follows: in each round it uses its current estimate of to construct confidence intervals around the two choices with highest estimated reward and selects the higher one if these intervals do not overlap; otherwise, it selects uniformly at random from the entire convex body. We prove fairness and bound regret by analyzing the rate at which random exploration shrinks our confidence intervals and relating it to the frequency of exploitation, a function of . We begin by formally defining below.
Definition 2 (Gap, adapted from Dani et al. (2008)).
Given sequence of action sets , define to be the set of extremal points of , i.e. the points in that cannot be expressed as a proper convex combination of other points in , and let . The gap of is
is a lower bound on difference in payoff between the optimal action and any other extremal action in any . When , this implies the existence of a unique optimal action in each . Our algorithm (implicitly) and our analysis (explicitly) exploits this quantity: a larger gap enables us to confidently identify the optimal action more quickly.
We now present the regret and fairness guarantees for FairGap.
Theorem 2.
Given sequence of action sets where each has nonzero Lebesgue measure and is contained in a ball of radius and feedback with subGaussian noise, FairGap is fair and achieves
where
A full proof of FairGap’s fairness and regret bound, as well as pseudocode, appears in the supplement. We sketch the proof here: our proof of fairness proceeds by bounding the influence of noise on the confidence intervals we construct (via matrix Chernoff bounds) and proving that, with high probability, FairGap constructs correct confidence intervals. This requires reasoning about the spectrum of the covariance matrix of each choice set, which is governed by , a quantity which, informally, measures how quickly we learn from uniformly random actions. ^{6}^{6}6 can be computed directly for finite or approximated by any positive lower bound for infinite and substituted directly into our results.. With correct confidence intervals in hand, fairness follows almost immediately, and to bound regret we analyze the rate at which these confidence intervals shrink.
The analysis above implies identical regret and fairness guarantees when each is finite. For comparison, the results of Section 3 guarantee . This result, in comparison, enjoys a regret independent of which may prove especially useful for cases involving large .
Finally, our analysis so far has elided any computational efficiency issues arising from sampling randomly from . We note that it is possible to circumvent this issue by relaxing our definition of fairness to approximate fairness and obtain similar regret bounds for an efficient implementation. We achieve this using results from the broad literature on sampling and estimating volume in convex bodies, as well as recent work on finding “2nd best” extremal solutions to linear programs. Full details appear in Section 7.4 of the Supplement.
5 Instancedependent Lower Bound for Fair Algorithms
We now present a lower bound instance for which any fair algorithm must suffer gapdependent regret. More formally, we show that when each choice set is a square, i.e. for all , for any fair algorithm with probability at least . This also implies the weaker result that no fair algorithm enjoys an instanceindependent sublinear regret bound holding uniformly over all . We therefore obtain a clear separation between fair learning and the unconstrained case Dani et al. (2008), and show that an instancedependent upper bound like the one in Section 4 is unavoidable. Our arguments establish fundamental constraints on fair learning with large choice sets and quantify through the parameter how choice set geometry can affect the performance of fair algorithms. The lower bound employs a Bayesian argument resembling that in Joseph et al. (2016) but with a novel “chaining” argument suited to infinite action sets. We present the result for for simplicity; the proof technique holds in any dimension .
Theorem 3.
For all let , , and where . Let be any fair algorithm. Then for every gap , there is a distribution over instances with gap such that any fair algorithm has regret with probability .
We a sketch of the central ideas in the proof, relegating a full proof to the Supplement. We start with the fact that any fair algorithm is required to be fair for any value of the linear parameter. Thus if we draw , must be roundfair for all with probability at least , where now the probability includes the random draw . Then Bayes’ rule implies that the procedure that draws and then plays according to is identical to the procedure which at each step redraws from its posterior distribution given the past .
Next, given the prior , ’s round fairness at step requires that (with high probability) if plays action with higher probability than action , we must have
(1) 
This enables us to reason about the fairness and regret of the algorithm via a specific analysis of the posterior distribution . We formalize this argument in Lemmas 7 and 8. This Bayesian trick, first applied in Joseph et al. (2016), is a general technique useful for proving fairness lower bounds.
We then show that for a choice of prior specific to our choice set , that two things hold: (i) whenever , Equation 1 forces to play uniformly from , and (ii) with high probability until , where is a parameter of the prior that acts as a proxy for . Playing an action uniformly from incurs regret per round, so these two facts combine to show that with high probability .
Finally we consider conditional on the event that , which by our construction of happens with probability . Let be the conditional distribution of given that . Then
which implies
Note that as , and so this is a highprobability bound. Since for every in the support of , we have that , we’ve exhibited a distribution such that when , with high probability, as desired.
The proof uses the fact that when , Equation 1 forces to play uniformly at random. This happens by transitivity: if Equation 1 forces to play equiprobably with and equiprobably with , then must be played equiprobably with . The fact that any two actions in can be connected via such a (finite) transitive chain is illustrated in Figure 5 and formalized in Lemma 10.
Remark 2.
We note that this impossibility result only holds for . When , the choice set reduces to , and similarly . Thus, the optimal action is ). It takes observations to determine the sign of , so a simple fair algorithm may play randomly from until it has determined , and then play for every following round. Because the maximum perround regret of any action is , and because the maximum cumulative regret obtained by the algorithm is with high probability , the regret of this simple algorithm over rounds is . Taking the worst case over , we see that this quantity is bounded uniformly by , a sublinear parameter independent regret bound.
6 Zero Gap: Impossibility Result
Section 4 presents an algorithm for which the sublinear regret bound has dependence on the instance gap. Section 5 exhibits an choice set with a dependence on the gap parameter. We now exhibit a choice set for which for every , and for which no fair algorithm can obtain nontrivial regret for any value of . This precludes even instancedependent fair regret bounds on this action space, in sharp contrast with the unconstrained bandit setting.
Theorem 4.
For all let , the unit circle, and . Then for any fair algorithm we have
makes fair learning difficult for the following reasons: since has no extremal points, there is no finite set of points which for any contains the uniquely optimal action, and for any point in , and any finite set of observations, there is another point in for which the algorithm cannot confidently determine relative reward. Since this property holds for every point, the fairness constraint transitively requires that the algorithm play every point uniformly at random, at every round.
References
 AbbasiYadkori et al. [2011] Yasin AbbasiYadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pages 2312–2320, 2011.
 Abe et al. [2003] Naoki Abe, Alan W Biermann, and Philip M Long. Reinforcement learning with immediate rewards and linear hypotheses. Algorithmica, 37(4):263–293, 2003.
 Anantharam et al. [1987] Venkatachalam Anantharam, Pravin Varaiya, and Jean Walrand. Asymptotically efficient allocation rules for the multiarmed bandit problem with multiple plays – part i: I.i.d. rewards. IEEE Transactions on Automatic Control, AC32(Nov):968–976, 1987.
 Auer [2002] Peter Auer. Using confidence bounds for exploitationexploration tradeoffs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
 Berk et al. [2017] Richard Berk, Hoda Heidari, Shahin Jabbari, Michael Kearns, and Aaron Roth. Fairness in criminal justice risk assessments: The state of the art. arXiv preprint arXiv:1703.09207, 2017.
 Chouldechova [2017] Alexandra Chouldechova. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. arXiv preprint arXiv:1703.00056, 2017.
 CorbettDavies et al. [2017] Sam CorbettDavies, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. Algorithmic decision making and the cost of fairness. arXiv preprint arXiv:1701.08230, 2017.
 Dani et al. [2008] Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. In COLT, pages 355–366, 2008.
 Dwork et al. [2012] Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of ITCS 2012, pages 214–226. ACM, 2012.
 Flaxman et al. [2005] Abraham D Flaxman, Adam Tauman Kalai, and H Brendan McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACMSIAM symposium on Discrete algorithms, pages 385–394. Society for Industrial and Applied Mathematics, 2005.
 Friedler et al. [2016] Sorelle A. Friedler, Carlos Scheidegger, and Suresh Venkatasubramanian. On the (im)possibility of fairness. In arXiv, volume abs/1609.07236, 2016. URL http://arxiv.org/abs/1609.07236.
 Hardt et al. [2016] Moritz Hardt, Eric Price, and Nathan Srebro. Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 510, 2016, Barcelona, Spain, volume abs/1610.02413, 2016. URL http://arxiv.org/abs/1610.02413.
 Joseph et al. [2016] Matthew Joseph, Michael Kearns, Jamie H Morgenstern, and Aaron Roth. Fairness in learning: Classic and contextual bandits. In Advances in Neural Information Processing Systems, pages 325–333, 2016.
 Kleinberg et al. [2017] J. Kleinberg, S. Mullainathan, and M. Raghavan. Inherent tradeoffs in the fair determination of risk scores. In ITCS, Jan 2017.
 Lindgren et al. [2016] Erik M Lindgren, Alexandros G Dimakis, and Adam Klivans. Facet guessing for finding the mbest integral solutions of a linear program. In NIPS Workshop on Optimization for Machine Learning, 2016.
 Lovász and Vempala [2006] László Lovász and Santosh Vempala. Hitandrun from a corner. SIAM Journal on Computing, 35(4):985–1005, 2006.
 Tropp et al. [2015] Joel A Tropp et al. An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning, 8(12):1–230, 2015.
 Vempala [2005] Santosh Vempala. Geometric random walks: a survey. Combinatorial and computational geometry, 52(573612):2, 2005.
 Woodworth et al. [2017] Blake Woodworth, Suriya Gunasekar, Mesrob I Ohannessian, and Nathan Srebro. Learning nondiscriminatory predictors. arXiv preprint arXiv:1702.06081, 2017.
 Zafar et al. [2017] Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P. Gummadi. Fairness beyond disparate treatment and disparate impact: Learning classification without disparate mistreatment. In Proceedings of World Wide Web Conference, 2017.
7 Appendix
7.1 SubGaussian Definition
SubGaussian random variables have moment generating functions bounded by the Gaussian moment generating function, and hence can be controlled via Chernoff bounds.
Definition 3.
A random variable with is subGaussian if, for all , .
7.2 Proofs from Section 3
We start with full pseudocode for .
Proof of Theorem 1.
We first claim that confidence intervals are valid: that with probability , for all and all , . Assuming this claim, we prove that is fair. With probability , for all rounds and all individuals , . So, for any pair of individuals , if , then . So, if belongs to some chain from which arms are selected, either belongs to a higher chain or the same chain as . Every individual belonging to a higher chain is played with weakly higher probability to any individual belonging to a lower chain, and every two individuals belonging to the same chain are played with equal probability, so is played with weakly higher probability than . Thus, at all rounds and for all pairs of individuals, the fairness constraint is satisfied by this distribution over , and so is fair.
We now prove the confidence intervals are valid: that with probability , for all and all , . We adopt the notation in AbbasiYadkori et al. (2011): let , where is the design matrix at time . Let be the regularized least squares estimator at time .
Consider a feature vector at time . For a dimensional vector and a positive definite matrix , let denote . Let be the noise sequence prior to round . Then, we have . Then some matrix algebra in the proof of Theorem 2 of AbbasiYadkori et al. (2011) shows
which using the above notation gives
Applying CauchySchwarz,
which follows from the fact that (a basic corollary of the Rayleigh quotient, and the fact that by assumption .
We now present a result derived from (AbbasiYadkori et al., 2011) that will help us upper bound this quantity. The upper bound on at the bottom of page 13 of AbbasiYadkori et al. (2011) and the upper bound on at the top of page 15, combined with our assumption that our noise is subGaussian, implies that
Using this result and combining the inequalities we get that over all rounds with probability
(2) 
and therefore the claim that the confidence intervals are valid holds.
Regret bound for
We now proceed with upperbounding the regret of . With probability , the confidence intervals are valid. We will condition on that event for the analysis of the regret of the algorithm, since this regret bound will hold with high probability (namely, with probability ).
We start with a bound that will be useful for analyzing the width of our confidence intervals. The top of page 15 of of (AbbasiYadkori et al., 2011) notes that , and we combine this with the fact that (proven as part of Lemma 11 in (AbbasiYadkori et al., 2011)) to get that
(3) 
We now have all the tools needed to analyze the algorithm’s regret. First note that the choice of the algorithm is a singleton, i.e. that , for some , the active chained set. Further, since the confidence intervals are valid, for the best action . By the definition of , the instantaneous regret for any is at most (as any is chained to some other arm in ). So, we have that
or for , as desired.
Regret bound for
This regret bound relies on a similar analysis to , with the following changes. The algorithm now selects individuals, not all from the top chain, but instead from several chains. For each of the top choices , we relate reward of that choice to the reward of one of our choices in the following way. For each , consider the th top chain. We claim that if the th top chain contains of the top choices, our algorithm selects individuals from the th top chain. We prove this claim by induction. First, however, we notice that every individual in the th top chain has strictly higher reward than every individual in any lower chain. For the first top chain, contains either the entire chain or from the top chain. As every individuals in the top chain has strictly higher reward than any other individuals, in the former case, every individual in the first top chain belongs to ; in the latter case, is entirely contained in the top chain. Thus, and contain either all individuals in the top chain or exactly of them. Then, assuming the claim for the first top chains, both and have the same “capacity” for individuals in the th top chain (and therefore either take all of the th top chain or fewer but the same number from it). This proves the claim.
Then, we relate the reward of an with some action in belonging to same chain. Following the previous claim, we can form a matching between and for which all matches belong to the same chains. Then, the analysis of bounds the difference between the reward of any individual in the th top chain to any other individual in the th top chain. Summing up over all choices, the total regret for all of is at most times the loss suffered in 1bandit.
Regret bound for
The regret bound for this case reduces to lowerbounding the amount of reward incurred by playing arms with negative reward. Any individual selected by is within the sum of the widths of the confidence intervals in its chain, one of which has UCB which is positive. So, the reward of any action chosen is at least for the th top chain, or at most the sum of all interval widths. Thus, summing up over all individuals selected, one gets regret which is at most times worse than that for . ∎
7.3 Proofs from Section 4
We begin with the full pseudocode for FairGap.
We start our proof of Theorem 4 with a lemma bounding the contribution of noise to our confidence intervals.
Lemma 1.
Let be i.i.d draws of subGaussian noise. Then
Proof of Lemma 1.
A Hoeffding bound, in the general case for unbounded variables, implies that
so taking yields the desired result. ∎
Next, since the regret bound we will prove depends on , the minimum smallest eigenvalue of the expected outer product of a vector drawn uniformly at random from each we will need in order for this bound to make sense. We prove this in another lemma.
Lemma 2.
Given sequence of action sets where each has nonzero Lebesgue measure and is contained in a ball of radius ,