多变量线性优化

Expedia Group Technology —数据 (EXPEDIA GROUP TECHNOLOGY — DATA)

Or how you can run full webpage optimisations with a context-aware outcome.

或如何运行具有上下文感知结果的完整网页优化。

Contextual multi-armed bandits offer promising opportunities to improve web content. We describe how to use them to optimise several aspects of a page at the same time. We will cover a common example, a promising model, and an infrastructure sketch to deploy it live. This blog is intended for data scientists and machine learning practitioners that already know the basics of multi-armed bandit algorithms.

上下文相关的多武装匪徒提供了改善网络内容的有前途的机会。我们描述了如何使用它们同时优化页面的多个方面。我们将介绍一个常见的示例，一个有前途的模型以及一个将其实时部署的基础架构草图。该博客适用于已经了解多臂强盗算法基础知识的数据科学家和机器学习从业人员。

优化网站 (Optimising the website)

E-commerce websites have a constant need to improve the end user experience. These kinds of changes go from tweaking the colour of a button to a total rebuild of a page layout. Typically, a page is optimised with a series of concurrent of A/B tests run by several product teams. Different ideas for separate aspects of the page are independently tested and rolled out as soon the respective test reading is positive. This has been shown to improve websites in the long term and has become a staple approach in many companies. However, it has three main concerns:

电子商务网站始终需要改善最终用户体验。这些更改从微调按钮的颜色到完全重建页面布局。通常，页面是由多个产品团队运行的一系列并行A / B测试来优化的。页面各个方面的不同想法将被独立测试，并在相应的测试结果为肯定时推出。从长远来看，这已被证明可以改善网站，并已成为许多公司的主要方法。但是，它有三个主要问题：

Ideas are usually tested independently which might lead to missing out on some positive or negative interactions.
想法通常是独立测试的，可能导致在某些正面或负面的互动中错失良机。
The standard A/B testing does not scale well as the amount of variants increases. This becomes especially problematic when attempting to combine several tests into one and using an exponentially-increasing number of combinations of variants.
随着变体数量的增加，标准的A / B测试无法很好地扩展。当尝试将多个测试组合为一个并使用数量递增的变体组合时，这尤其成问题。
There is no way to add context in a scalable way. If implemented, it is usually done as a segmentation (separate test per segment), which is not the best use of the traffic as it splits it between a series of independent and thus less precise tests.
无法以可扩展的方式添加上下文。如果实现，通常将其作为分段(每个分段进行单独的测试)来完成，这不是流量的最佳用途，因为它将流量分成一系列独立的测试(因此精度较低)。

Given these bottlenecks, the recent development in contextual (multi-armed) bandits applied to web optimisation has shown a lot of promise. In a nutshell, the contextual bandits are a family of algorithms which learn from online feedback to find the best possible options for customers. It naturally balances between exploring existing options and focusing on the most promising ones.

考虑到这些瓶颈，应用于网络优化的上下文(多臂)强盗的最新发展显示出了很大的希望。简而言之，上下文强盗是一类算法，可以从在线反馈中学习，以为客户找到最佳的选择。自然地在探索现有选择与关注最有前途的选择之间取得平衡。

Image for post — Contextual bandits automatically experiment with different options and learn from customers responses.

Some ground breaking papers [2–4] have shown that these techniques can alleviate the aforementioned issues by:

一些突破性的论文[2-4]表明，这些技术可以通过以下方法缓解上述问题：

Repurposing traffic to more promising options on the fly which reduces the risk of poor variants running for too long and makes better use of the traffic in general.
将流量重新分配给更可行的选择，从而降低不良变型运行时间过长的风险，并更好地利用流量。
Running optimisations as a model (statistical/machine learning model) which enables combining tests into a bundle and taking interactions into account.
作为模型(统计/机器学习模型)运行优化，可以将测试合并到捆绑软件中并考虑交互作用。
Optimising for specific users sub-segments which is now not only possible but also maintainable.
针对特定用户的子细分市场进行优化，现在不仅可以实现而且可以维护。

For the remainder of this entry: we present a generic use case. Then we go into details on how to apply a linear model. We describe the algorithms used to update and explore the feature space. Finally, we sketch how to put it in production.

对于本条目的其余部分：我们提出一个通用的用例。 然后，我们将详细介绍如何应用线性模型 。我们描述了用于更新和探索特征空间的算法。最后，我们概述了如何将其投入生产。

用例 (The use case)

Let us illustrate the above with a simple example (this is not a real use case): say we have a mobile landing page like the Hotels.com™ (part of Expedia Group™) landing page for Rome. There are several aspects that the team would like to explore: the welcome message, the size of the image, and the complexity of the search module. Each of them has three variants:

让我们用一个简单的例子(不是一个实际的用例)来说明上述内容：说说我们有一个移动的登陆页面，例如，罗马的Hotels.com ™ (Expedia Group™的一部分)登陆页面。团队希望从多个方面进行探讨：欢迎消息，图像大小以及搜索模块的复杂性。它们每个都有三个变体：

A UI with a greeting, image, and search fields, with two or three alternative appearances for each — This example is not real but is close to the typical type of testing on a webpage.

Combining the changes of three aspects together yields 3³ = 27 unique layouts — clearly, this does not scale well in a classic testing approach as it splits the traffic among many variants. Yet, these small changes combined can produce quite distinctive user experiences:

将三个方面的变化结合在一起，可以得到3³= 27个独特的布局-显然，这在经典的测试方法中无法很好地扩展，因为它会将流量分配给许多变体。但是，这些微小的变化可以产生非常独特的用户体验：

Three of the possible UI renderings from the alternatives described above, with different features visible on the screen — Three combinations of smaller changes can have a big impact. For instance, shorter modules make ‘guarantee messaging’ more prominent.

Additionally, we would like to be able to find the best possible layout for a specific context, for instance returning versus new users or the customers country. This type of use case is very common across the website and can be specified as follows:

此外，我们希望能够找到针对特定情况的最佳布局，例如回头客，新用户或客户所在的国家/地区。这种用例在整个网站上非常普遍，可以指定如下：

We are trying to find the optimal layout of a page for a certain context.
我们正在尝试为特定上下文找到页面的最佳布局。
This layout is composed of several target aspects (eg. the welcome message or the size of the image).
该布局由几个目标方面组成(例如，欢迎消息或图像大小)。
Each aspect has several variants (eg. the welcome message can be “Good Morning”, “Welcome Back” or removed).
每个方面都有几种变体 (例如，欢迎消息可以是“早安”，“欢迎回来”或删除)。
Similarly, the context is composed of several categories with levels — this could be for instance customers country of origin: France, Japan, Germany, etc. (We do not consider continuous variables at this stage.)
同样， 上下文由几个级别组成的类别-例如，客户的原产国：法国，日本，德国等。(在此阶段，我们不考虑连续变量。)
We want to optimise for a specific reward (i.e., metric, KPI) which is usually the progression rate of the page. There could be different definitions of progression and it is a crucial decision to make. In this case, we target the event of a user reaching another page directly from this one.
我们想针对特定奖励进行优化(即指标，KPI) 通常是页面的进度。进展可能有不同的定义，这是一个至关重要的决定。在这种情况下，我们针对的是用户直接从该页面到达另一页面的事件。

该模型 (The model)

To tackle this use case, we use linear contextual multi-armed bandits. This is a quite common approach presented by Agrawal et al. [1] and has been used in very influential projects such as the works from Hill et al. [2] and Chapelle et al., [3] which are the main sources of inspiration for this implementation.

为了解决这个用例，我们使用线性上下文多臂土匪。这是Agrawal 等人提出的一种非常普遍的方法。 [1]并已用于很有影响力的项目，例如Hill 等人的著作。 [2]和Chapelle 等人， [3]是此实现的主要灵感来源。

Let us define the model:

让我们定义模型：

The layouts A are represented by K categorical variables. The context C is encoded in the same way with D categorical variables. Each of the categorical variables has a specific number of levels:
布局A用K个分类变量表示。 上下文C用D类别变量以相同的方式编码。每个分类变量都有特定数量的级别：

Mathematical equations. Our apologies; Medium.com does not provide an accessible way to render equations.

We define m as the function mapping A and C into the feature space X. A feature vector is of length L and is composed of the K one-hot-encoded aspects, the D one-hot-encoded contexts and first-order interactions between categorical variables:
我们定义m为将A和C映射到特征空间X中的函数。特征向量的长度为L ，由K个单热编码方面， D个单热编码上下文和分类变量之间的一阶交互组成：

Which variables to interact is a modelling decision. It is however recommended to have at least the first-order interactions between A and C to enable the model learning which layout works better in which context. We also recommend using first-order interaction between aspects of A to capture influences of one aspect over another.
要交互的变量是建模决策。但是，建议至少在A和C之间进行一阶交互，以使模型能够了解哪种布局在哪种情况下效果更好。我们还建议使用A方面之间的一阶交互来捕获一个方面对另一个方面的影响。
We link the expected value of the Bernoulli random variable R and an observation x (ie. a and c) with a sigmoid function and a vector of parameters w:
我们将S值和参数w的S型函数与伯努利随机变量R的期望值和观测值x (即a和c )联系起来：

For each context c, there is an optimal layout a* that maximises the expected reward:
对于每个上下文c ，有一个最佳布局a *可以最大化期望的回报：

As in most multi-armed bandit cases, the objective is to minimise regret (Δ) which is defined as the sum of differences between the expected reward of the optimal layout and the one chosen at time t:
像在大多数多武装匪徒案件中一样，目标是使后悔(Δ)最小化，后悔(Δ)定义为最佳布局的预期奖励与在时间t选择的奖励之间的差之和：

To accomplish this we will need to update this model online with the feedback generated by our customers. We propose the following model:

为此，我们将需要根据客户产生的反馈在线更新此模型。我们提出以下模型：

A logistic regression will be trained on batches of feedback made of (rᵢ , xᵢ) tuples of length N with rᵢ ∈ {0,1} and xᵢ ∈ X.
逻辑回归将在由与rᵢ长度N(rᵢ，xᵢ)元组∈{0,1}和xᵢ∈X的反馈分批训练。
We use a Bayesian updating strategy in which the current distribution of w serves as prior. With the likelihood of the batch we can update the distribution (posterior):
我们使用贝叶斯更新策略，其中w的当前分布用作先验。有了批次的可能性，我们可以更新分布(后验)：

We assume that each weight is an independent gaussian distribution:
我们假设每个权重都是独立的高斯分布：

The following cost function can be used to find the new vector of means of w:
以下成本函数可用于查找w均值的新向量：

However with this likelihood function and a gaussian prior, the posterior distribution has no analytical form. It is however possible to use the Laplace approximation (which can be grossly summarised as approximating a reasonably well-behaved density function with a gaussian one, please see references [3,5] for more details) yielding a simple analytical solution with the posterior following a gaussian distribution. (see Algorithm 3).
但是，有了这种似然函数和高斯先验，后验分布就没有解析形式。但是，可以使用拉普拉斯近似(可以将其概括为用高斯近似近似表现良好的密度函数，有关更多详细信息，请参见参考文献[3,5])，产生一个简单的解析解，后验高斯分布。 (请参阅算法3 )。

实施 (The implementation)

A contextual multi-armed bandit needs essentially be able to accomplish two operations: choosing a layout given a context and updating from the feedback generated by customers. Our implementation has three key aspects:

上下文多臂匪徒基本上需要能够完成两项操作：根据上下文选择布局，并根据客户产生的反馈进行更新。我们的实施包含三个关键方面：

To chose a layout to display, we use the Thompson sampling heuristic which consists in drawing a vector w̃ from the current distribution of w, then choose an arm a given a context c that maximises the expected reward.
为了选择要显示的布局，我们使用了Thompson采样启发式方法，该方法包括从w的当前分布中绘制一个向量w̃ ，然后选择一个臂a 给定一个最大化预期报酬的上下文c 。
To make the above scalable at a web traffic level, the sampling will be modified to do a greedy hill-climbing search on the layout space A to reduce complexity and latency.
为了使上述内容在Web流量级别上可伸缩，将修改采样以在布局空间A上进行贪婪的爬坡搜索 ，以减少复杂性和延迟。
To learn from the feedback, the distribution of the parameters w is updated using a Laplace approximation with batches of new observations from the layouts sampled from the step above and the observed reward.
为了从反馈中学习，使用拉普拉斯逼近法对参数w的分布进行更新，该方法具有从上述步骤中采样的布局中的一批新观测值和观测到的奖励。

The following sections explain in details.

以下各节详细说明。

汤普森采样 (Thompson sampling)

Thompson sampling is one of the most common heuristics for multi-armed bandits. It can be easily applied to a linear model as proposed in Agrawal et al [1]. We assume each weight of w follows an independent gaussian distribution. The Thompson sampling (Algorithm 1) consists of:

汤普森采样是多臂匪徒最常见的启发式方法之一。如Agrawal 等人 [1]所提出的，它可以很容易地应用于线性模型。我们假设w的每个权重遵循独立的高斯分布。汤普森采样(算法1)包括：

First, sampling a set of random parameters w̃.
首先，采样一组随机参数W的。
Second, choosing the arm that maximises the expected reward given those sampled parameters w̃ and a given context c.
其次，在给定那些采样参数w̃和给定上下文c的情况下，选择使期望回报最大化的手臂。

贪婪的爬山，以加快采样 (Greedy hill climbing to speed up the sampling)

One of the problems with Thompson sampling is the need to get the score all the possible layouts before selecting the maximum:

汤普森采样的问题之一是需要在选择最大值之前获得所有可能布局的分数：

This does not scale well when the number of aspects and variants increase, (ie O(Mᴷ) per request if fixed M levels across K aspects). Hill et al. [2] have proposed an alternative that uses a greedy hill climbing strategy. This method does not guarantee finding the absolute maximum but significantly reduces the complexity when sampling. Described in Algorithm 2, the intuition of this approach is as follows:

当方面和变体的数量增加时(例如 ，如果跨K个方面的M级别固定，则每个请求为O(Mᴷ)) ，这不能很好地扩展。希尔等。 [2]提出了一种使用贪婪爬山策略的替代方案。这种方法不能保证找到绝对最大值，但是会大大降低采样时的复杂度。如算法2所述，这种方法的直觉如下：

We are given a context c, a sampled weights vector w̃ and a number of climbing steps S.
我们得到一个上下文c，一个采样的权重向量w̃和许多爬升步骤S。
Select a random layout a⁰.
选择一个随机布局a⁰ 。
For each climbing step select a random aspect k̂ to optimise. For all the variants of this aspect (and keeping the other aspects fixed) score the layouts and pick the one with the highest score.
对于每个爬升步骤，选择一个随机方面k̂进行优化。对于此方面的所有变体 (并保持其他方面不变)，请对版面评分并选择得分最高的版面。
Continue picking random aspects, scoring the variants and selecting the one with the highest score for the remainder of the climbing steps. Optionally, you can add a stopping rule which fires when all aspects have been explored without improvement which means reaching a local maximum (not presented in Algorithm 2 for simplicity).
继续选择随机方面，为变体评分并在其余的攀登步骤中选择得分最高的一个。 (可选)您可以添加一个停止规则，当对所有方面进行了探索而没有改进时将触发该规则，这意味着达到局部最大值(为简单起见，在算法2中未列出)。
To mitigate the risk of ending in a local maximum, you run the above with R parallel starts.
为了减轻以局部最大值结尾的风险，请使用R parallel starts运行上述命令。

The main advantage of this algorithm is that it drastically reduces the sampling complexity from O(Mᴷ) to O(M*R*S). There is an obvious trade off between risk of not selecting the currently optimal value and computational cost which increases with R and S. At a high level, this can been summarised as balancing between generating less regret and a lower latency.

该算法的主要优点是可以将采样复杂度从O(Mᴷ)大大降低到O(M * R * S)。在不选择当前最优值的风险与随R和S增加的计算成本之间存在明显的权衡。总的来说，这可以概括为在减少后悔和降低延迟之间取得平衡。

更新参数 (Updating the parameters)

Finally, we use the feedback from our customers to update the algorithm. To do so, we use the approach proposed by Chapelle et al [3]. Algorithm 3 has two main steps:

最后，我们利用客户的反馈来更新算法。为此，我们使用Chapelle等[3]提出的方法。 算法3有两个主要步骤：

Update the means by minimising the following the cost function with any gradient descent based optimisation.
通过基于梯度下降的优化使以下成本函数最小化来更新均值。
Update the variances using the Laplace approximation.
使用拉普拉斯近似更新方差。

生产草图 (Production sketch)

To give a better understanding, here is a simple diagram on how this approach is usually put in production across the industry. The main idea is decoupling the updating (training) and sampling facets to make the infrastructure more resilient.

为了更好地理解，下面是一个简单的图表，说明了通常如何在整个行业中将这种方法投入生产。主要思想是将更新(训练)面和采样面分离，以使基础架构更具弹性。

The trainer can be scheduled with a CRON job to run hourly. The parameter store is a simple database holding the latest state of the distribution of w. Finally, the sampler can live in several instances scaling horizontally. To reduce latency, it is a good idea to update the parameters periodically and keep them in the sampler memory.

可以为培训师安排CRON作业，使其每小时运行一次。参数存储是一个简单的数据库，其中保存w的最新分布状态。最后，采样器可以在几个实例中进行水平缩放。为减少延迟，最好定期更新参数并将其保留在采样器内存中。

Flow diagram showing cooperation between a trainer, parameter store, sample, website application and a user — This is a sketch of a production system that can be used to implement the contextual bandit and where the 3 algorithms would live.

未来的机会 (Opportunities ahead)

This model is simple and scalable, yet it can capture new patterns in users’ preferences that were not accessible before. This is especially important to support personalisation, which is a key strategic target for web commerces like Expedia. Additionally, this model can accelerate the page optimisation programme, as it is equivalent to running several tests at the same time, while capturing positive or negative interactions. Finally, bandits naturally repurpose traffic to more promising options and reduce the opportunity cost of exploring poor ideas effectively accelerating decision making.

该模型既简单又可扩展，但是它可以捕获以前无法访问的用户偏好中的新模式。这对支持个性化尤为重要，个性化是Expedia等网络商务的关键战略目标。此外，此模型可以加快页面优化程序的速度，因为它等效于同时运行多个测试，同时捕获正面或负面的互动。最后，强盗自然会将流量重新定向到更具前景的选择，并减少探索不良想法的机会成本，从而有效地加快决策制定速度。

In our opinion, this offers non-negligible improvements over classic A/B testing programs as optimising several aspects of a page at the same time for different contexts is a ubiquitous problem for many web companies. These algorithms are easy to implement, and using a limited sets of parameters enables to keep everything in memory reducing the need for a complex production infrastructure.

我们认为，与传统的A / B测试程序相比，这提供了不可忽略的改进，因为针对不同的上下文同时优化页面的多个方面是许多Web公司普遍存在的问题。这些算法易于实现，并且使用有限的参数集可以将所有内容保留在内存中，从而减少了对复杂生产基础架构的需求。

Stay tuned for more posts on the infrastructure, successful use cases, simulations and extensions! Meanwhile, check our previous post on how we optimised the main images of properties using a simple bandit approach:

请继续关注有关基础架构，成功用例，模拟和扩展的更多文章！同时，请查看我们以前的文章，了解如何使用简单的强盗方法优化属性的主要图像：

How We Optimised Hero Images on Hotels.com using Multi-Armed Bandit Algorithms

我们如何使用多武装强盗算法在Hotels.com上优化英雄形象