Search This Blog

Thursday, April 26, 2018

What is Learning? — An Example of Linear Classification

Problem. Suppose that we are a bank trying to learn the creditworthiness of our current customers. Our resources include a database of the history of previous customers who have been already classified as defaulters or not. Based on this history, we are to learn the function $f$ which takes a customer as input and spits out the (binary) value of their creditworthiness ("yes" or "no").

[Assume that the customers are linearly separable.]

Solution. According to our understanding of learning, following are the quantities we need to analyze.

  1. The input space $\mathcal X$ is the space of all customers. Each customer $\vec x\in \mathcal X$ is represented by a list of numbers, e.g. $\vec x = [$age, annual salary, years in residence, years in job, current debt,$\ \dots]$, which add like vectors (of course, customers do not add, but numbers do). Therefore, $\mathcal X$ is homeomorphic (equivalent) to an Euclidean vectorspace say $\mathbb R^n \ni \vec x = (x_1, \dots, x_n)$. It means that each customer is mathematically an $n$-dimensional vector in an $n$-dimensional space $\mathcal X \cong \mathbb R^n$.

  2. The target space is $Y=\mathbb R$.

  3. The target function $f:\mathbb R^n \to \mathbb R$ is such that its image $f(\mathbb R^n) \in \{\pm 1\} \subset \mathbb R$ is a binary set representing "yes" or "no" answers. In other words, a customer $x$ is creditworthy if $f(x) = +1$ and is not if $f(x) = -1$.

  4. The training examples $(\vec x_1, y_1), (\vec x_2,y_2), \dots, (\vec x_N,y_N)$ are given by the historical records of previous credit customers. Here $N$ denotes the total number of training data.

  5. The hypothesis set or model $\mathcal H$ we will employ here is known as the Perceptron. Since the data points live in $\mathbb R^n$, why not separate them by constructing a hyperplane, on one side of which live the creditworthy customers and on the other side live the defaulters! That we will be able to do so is an assumption known as linear separability of the data points. We have to make this assumption as part of the model and make sure that the problem statement guarantees it. Thus the hypothesis set consists of all possible hyperplanes in $\mathbb R^n$. The hope is that there is at least one hypothesis (hyperplane) that segregates the given data ($x_i$) according to the desired outputs in the training data $y_i$.

    Recall that a hyperplane passing through the origin is described by a linear equation, such as $ \vec w \cdot \vec x = 0 $ where $\vec w$ is a vector perpendicular to the hyperplane. Since the dot or inner product is a measure of how much one vector projects into another, this equation tells us that the hyperplane is described by all such points $\vec x$ which are perpendicular to the given direction $\vec w $. In this sense, $\vec w$ chooses an orientation for the hyperplane. For a general hyperplane that does not pass through the origin, $\vec w $ will make some angle with the points $\vec x \not\perp \vec w$ in the hyperplane and the inner product will be non-zero. The hyperplane is defined by the fact that this non-zero inner product is a constant $b$ for all the points in the plane. In fact, if $\vec w$ is a unit vector, then $b$ measures the shortest distance of the hyperplane from the origin. A general equation for the hyperplane is, therefore, given by $$\vec w \cdot \vec x + b = 0\,.$$ In two dimensions (n=2), the following diagram illustrates the above discussion.


    In two dimensions, $\vec w \cdot \vec x + b = 0$ represents a straight line offset from the origin by a distance $b$ and perpendicular to the direction $\vec w$. Moreover, notice that any point in the green area (along the direction of $\vec w$) has $\vec w \cdot \vec{\color{green}{x}} + b > 0$ whereas any point in the red area (opposite to the direction $\vec w$) has $\vec w \cdot \vec{\color{red}{x}} + b < 0$. To see this, observe that for any point $x$ that lies in the hyperplane $\vec w\cdot \vec x + b = 0$, the dot product is $\vec w \cdot \vec x = -b$, the value of the offset from the origin. Any point in the red area is offset further away from the origin, and therefore the dot product should give a number more negative than $-b$, namely that for points $\color{red}{x}$ in the red area, $\vec w \cdot \vec{\color{red}{x}} < -b$ or $\vec w \cdot \vec{\color{red}{x}} + b < 0$. Similarly, for any point in the green area, the offset must be greater than $-b$ and hence $\vec w \cdot \vec{\color{green}{x}} > -b$ or $\vec w \cdot \vec{\color{green}{x}} + b > 0$.

    This means that if we take the sign of $\vec w \cdot \vec x + b$, we will have split the space into two halves. Our hypothesis set $\mathcal H$ consists of functions of the form $$ h(\vec x) = sign (\vec w \cdot \vec x + b) = \begin{cases} \color{green}{+1} &\text{if } \vec w \cdot \vec x + b > 0\,, \\ \color{red}{-1} &\text{if } \vec w \cdot \vec x +b < 0\,, \end{cases}$$ which are parametrized by our choices of the orientation-fixing normal (perpendicular) vector $\vec w$ and the offset from the origin $b$. These parameters are often called weights and biases respectively. Lastly, for simplicity, we write $\vec w \cdot \vec x +b \equiv w \cdot x$ where $w := (b, \vec w)^T$ and $x := (1, \vec x)^T$.

  6. A learning algorithm $\mathcal A$ attempts to navigate through the hypothesis set $\mathcal H$ until it finds a suitable candidate $g \in \mathcal H$ for the unknown target function $f$ mentioned in step 3. In our case, the goal of the Perceptron Learning Algorithm is to find a hyperplane that segregates the training inputs correctly, in the sense that $g(\vec x_i) = y_i$ for almost all $i \in \{1,...,N\}$. One way of doing this would be to randomly pick a hyperplane from $\mathcal H$ (randomly initialize the weights and biases) and look for points that have been misclassified, namely $h(\vec x_i) = sign(w \cdot x) \ne y_i$. Well, in that case we will jolt the hyperplane just the right amount so that the above sign flips. You understand that the sign of a dot product depends on whether the angle between the two vectors is acute or obtuse. Suppose that for the misclassified point $\vec x_i$, $h(\vec x_i) = sign(w \cdot x_i) = +1$ (the second diagram below) meaning that the angle between $w = (b, \vec w)^T$ and $x_i = (1, \vec x_i)^T$ is acute whereas it should have been obtuse (since $y_i = -1$, being misclassified). Well, we can update the weights and biases $w = (b, \vec w)^T$ so that this angle becomes obtuse, namely by taking $w + (-1)x_i = w+y_i x_i$ as the new parameter. Alternatively, if the misclassified point had $sign(w \cdot x) = -1$ (the first diagram below) meaning the angle was obtuse instead of being acute (this time $y_i = +1$), we can update the weights and biases to a new one, namely $ w + x_i = w+ y_i x_i$ which would make the angle acute.


    Thus the UPDATE rule for every misclassified data point $(x_i, y_i)$ reads: jolt the hyperplane according to the rule: $w \mapsto w + y_i x_i$. This algorithm ought to correct all mistakes and in the end classify all training data appropriately.

  7. The final hyperplane defines our guess $g$ for the target function $f$. [P.S. To play with these algorithms and see better visualizations, visit https://playground.tensorflow.org.]

Alright, so we have concluded the above problem by finding a final hypothesis $g \in \mathcal H$ which performs very well on our in-sample data, namely that it classifies the training data with magnificent perfection. Now, the million dollar question is the following. Have we actually learnt? In other words, if we bring new data from outside our sample, such as new customers, will this $g$ classify these new data correctly? How sure are we that $g \approx f$? What guarantees that $g$ generalizes to out-of-sample data and that this approximation is close?

These are valid questions and we will address them in our next post.

No comments:

Post a Comment

Any thoughts you'd like to share?