{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.9" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": false }, "colab": { "name": "calc_intro_corrections.ipynb", "provenance": [], "collapsed_sections": [] } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "8dm9ilyg3ohu", "colab_type": "text" }, "source": [ "## Multivariable Calculus Review\n", "We will be covering some of the most relevant concepts from multivariable calculus for this course, and show how they can be extended to deal with matrices of training data." ] }, { "cell_type": "markdown", "metadata": { "id": "zeoXkrTn3ohw", "colab_type": "text" }, "source": [ "### Partial Derivatives\n", "The derivative of a function of 2 variables $f(x, y)$ w.r.t. either one of its variables is defined as \n", "\n", "\n", "\\begin{aligned}\n", "\\frac{\\partial}{\\partial x} f(x, y) &:= \\lim_{h \\to 0} \\frac{f(x + h, y) - f(x, y)}{h}\\\\\n", "\\frac{\\partial}{\\partial y} f(x, y) &:= \\lim_{h \\to 0} \\frac{f(x, y + h) - f(x, y)}{h}\n", "\\end{aligned}\n", "" ] }, { "cell_type": "markdown", "metadata": { "id": "1yQeWAU13ohy", "colab_type": "text" }, "source": [ "This is a very obvious extension going from the derivative definition for functions of single variables.\n", "As an example, what is $\\frac{\\partial}{\\partial x} f(x,y)$ where $f(x, y) = x^2 cos(xy)$?" ] }, { "cell_type": "markdown", "metadata": { "id": "rXrFvDlJ3oh4", "colab_type": "text" }, "source": [ "$$\\frac{\\partial}{\\partial x} f(x,y) = 2x cos(xy) - x^2 sin(xy)y$$" ] }, { "cell_type": "markdown", "metadata": { "id": "6ZP-JaVx3oh5", "colab_type": "text" }, "source": [ "---\n", "### Directional Derivatives\n", "Before we define the gradient, we discuss the closely related concept of a directional derivative.\n", "\n", "The directional derivative of a function $f(x, y)$ at $(x_0, y_0)$ moving in the direction of a unit vector $u = [a, b]$ is \n", "\n", "\n", "\\begin{aligned}\n", "D_{u} f(x_0, y_0) := \\lim_{h \\to 0} \\frac{f(x_0 + ha, y_0 + hb) - f(x_0, y_0)}{h}\n", "\\end{aligned}\n", "\n", "\n", "This simply tells us how the scalar output of $f$ will change when we move in an arbitrary direction that is not necesarily axis-aligned. It can be written as \n", "\n", "\n", "\\begin{aligned}\n", "D_{u} f(x, y) = \\left[\\frac{\\partial}{\\partial x} f(x, y), \\frac{\\partial}{\\partial y} f(x, y)\\right] \\cdot u\n", "\\end{aligned}\n", "" ] }, { "cell_type": "markdown", "metadata": { "id": "JAMywhg23oh6", "colab_type": "text" }, "source": [ "---\n", "### Gradient\n", "This is perhaps the term you will hear most often in the context of neural networks. The gradient is informally defined as the vector pointing in the direction of steepest increase for a given function. Thus, if we would like to minimize a loss function such as mean squared error (MSE), we move in the opposite direction of the gradient, i.e. in the direction of the negative gradient.\n", "\n", "Now the gradient is simply defined as \n", "\n", "\n", "\\begin{aligned}\n", "\\nabla f(x, y) := \\left[\\frac{\\partial}{\\partial x} f(x, y), \\frac{\\partial}{\\partial y} f(x, y)\\right]\n", "\\end{aligned}\n", "\n", "\n", "Note that this is a vector-field (vector-valued function), and not a scalar-valued function!" ] }, { "cell_type": "markdown", "metadata": { "id": "vZwC9Jmx3oh6", "colab_type": "text" }, "source": [ "What does $\\nabla f(x, y)$ have to do with neural networks and minimizing loss functions? Let's see in which direction the directional derivative achieves the highest possible value. Using the gradient notation, we can rewrite the directional derivative as\n", "\n", "\n", "\\begin{aligned}\n", "D_{u} f(x, y) &= \\nabla f(x, y) \\cdot u\\\\\n", "&=|| \\nabla f(x, y) ||_{2} \\cdot || u ||_{2} \\cdot cos (\\theta)\\\\\n", "&= ||\\nabla f(x, y)||_{2} \\cdot cos(\\theta)\n", "\\end{aligned}\n", "\n", "\n", "This quantity is maximized when $cos(\\theta) = 1$, i.e. $\\nabla f(x, y)$ and $u$ are parallel. Therefore, the gradient points in the direction of steepest increase for $f$.\n", "\n", "So if $f$ is a neural network parameterized by some weights $w$, and $x \\in R^D$ is some input to it, then taking steps in $\\nabla_{w} f_{w}(x)$ will move us in the direction that maximizes the output of $f$. Thus, if $f$ is a binary classification network, say the probability of an image being of a car, then moving in this direction would maximimally increase the probability of $x$ being classified as a car." ] }, { "cell_type": "markdown", "metadata": { "id": "_IkgwH5w3oh7", "colab_type": "text" }, "source": [ " " ] }, { "cell_type": "markdown", "metadata": { "toc-hr-collapsed": false, "id": "BfUWxllI3oh8", "colab_type": "text" }, "source": [ "---\n", "## Basic ML Setup\n", "\n", "If you recall from CSC411, the basic supervised learning scenarios are regression and classification. Most buzz in computer vision has driven by success on classifcation problems, so we start with that first." ] }, { "cell_type": "markdown", "metadata": { "id": "oSshknWC3oh9", "colab_type": "text" }, "source": [ "### Classification\n", "In $K$-class classification, we are given a dataset $\\{\\mathbf{x}^{(i)}, t^{(i)}\\}_{i=1}^{N}$ of $N$ samples where $\\mathbf{x}^{(i)} \\in R^D$ and $t^{(i)} \\in [1, ..., K]$.\n", "\n", "We would like to fit a probabilistic model, in our case a neural network, that maximizes the probability \n", "\n", "\n", "\\begin{aligned}\n", "\\prod_{i=1}^{N} p(t^{(i)} | \\mathbf{x}^{(i)}, \\mathbf{w})\n", "\\end{aligned}\n", "" ] }, { "cell_type": "markdown", "metadata": { "id": "Pd3peeRb3oh-", "colab_type": "text" }, "source": [ "we can vectorize these objects/terms by defining \n", "\n", "$$\n", "\\mathbf{X} = \\begin{bmatrix}\n", "(\\mathbf{x}^{(1)})^{\\top} \\\\\n", "\\vdots \\\\\n", "(\\mathbf{x}^{(N)})^{\\top}\\end{bmatrix}\n", "$$\n", "\n", "$$\n", "\\mathbf{T} = \\begin{bmatrix}\n", "(t^{(1)})^{\\top} \\\\\n", "\\vdots \\\\\n", "(t^{(N)})^{\\top}\n", "\\end{bmatrix}\n", "$$\n", "\n", "where $\\mathbf{X}$ is called the design matrix, and has dimensions $(N, D)$, where entry $\\mathbf{X}_{ij}$ corresponds to the j-th feature for the i-th training sample. $\\mathbf{T}$ is our matrix of labels. The i-th row of $\\mathbf{T}$ is the one-hot encoded vector of the label for the i-th sample which is simply a $K$-dimensional vector of all $0$s except for the a single $1$ in the position of the label. If $K=4$, and $t^{(i)} = 2$, the one-hot encoded version is\n", "\n", "$$\n", "\\begin{bmatrix}\n", "0 \\\\\n", "1 \\\\\n", "0 \\\\ \n", "0\n", "\\end{bmatrix}\n", "$$\n", "\n", "We assume that all vectors are column vectors." ] }, { "cell_type": "markdown", "metadata": { "id": "HoNGHsZC3oh_", "colab_type": "text" }, "source": [ "Our goal becomes to maximize the probability\n", "\n", "\n", "\\begin{aligned}\n", "p(\\mathbf{t} | \\mathbf{X}, \\mathbf{w}) &= \\prod_{i=1}^{N} p(t^{(i)} | \\mathbf{x}^{(i)}, \\mathbf{w})\\\\\n", "&= \\prod_{i=1}^{N} \\prod_{k=1}^{K} p(k | \\mathbf{x}^{(i)}, \\mathbf{w})^{\\mathbf{T}_{ik}}\n", "\\end{aligned}\n", "\n", "\n", "Note that just the term in the innermost product corresponding to the true label of the i-th can be less than 1. Since this expression would lead to numerical underflow given a large enough dataset, we can equivalently maximize a monotonically increasing function of this expression, i.e. take the log\n", "\n", "\n", "\\begin{aligned}\n", "\\mathrm{log}\\left(\\prod_{i=1}^{N} \\prod_{k=1}^{K} p(k | \\mathbf{x}^{(i)}, \\mathbf{w})^{\\mathbf{T}_{ik}}\\right) &= \\sum_{i=1}^{N} \\sum_{k=1}^{K} \\mathbf{T}_{ik} \\mathrm{log} ( p(k | \\mathbf{x}^{(i)}, \\mathbf{w}) )\n", "\\end{aligned}\n", "\n", "\n", "If we negate this term, we end up with the most popular loss function for classification known as cross-entropy\n", "\n", "\n", "\\begin{aligned}\n", "L_{CE} = - \\sum_{i=1}^{N} \\sum_{k=1}^{K} \\mathbf{T}_{ik} \\mathrm{log} ( p(k | \\mathbf{x}^{(i)}, \\mathbf{w}) )\n", "\\end{aligned}\n", "\n", "\n", "Our goal is to **minimize** $L_{CE}$ by moving in the direction $- \\nabla_{\\mathbf{w}} L_{CE}$." ] }, { "cell_type": "markdown", "metadata": { "id": "6o--1Jhb3oiA", "colab_type": "text" }, "source": [ "---\n", "### Linear Regression" ] }, { "cell_type": "markdown", "metadata": { "id": "fxScO4vp3oiB", "colab_type": "text" }, "source": [ "#### Multivariate input. Univariate output. $R^M \\rightarrow R$\n", "\n", "##### Forward pass\n", "Suppose we have some features $x_1, x_2, \\dots, x_M$, and we predict a single scalar $y = w_1x_1 + w_2x_2 + \\dotsm + w_Mx_M$. We can write this in matrix notation as a dot product\n", "$$\n", "y = \\sum_{i=1}^M w_ix_i = \\mathbf{w}^T\\mathbf{x} \\quad\\text{ where }\\quad\\mathbf{w} = \\begin{bmatrix}\n", "w_1 \\\\ w_2 \\\\ \\vdots \\\\ w_M\n", "\\end{bmatrix}, \\mathbf{x} = \\begin{bmatrix}\n", "x_1 \\\\ x_2 \\\\ \\vdots \\\\ x_M\n", "\\end{bmatrix}\n", "$$\n", "We say $\\mathbf{x} \\in R^M$ to indicate that $\\mathbf{x}$ is a $M$-dimensional vector." ] }, { "cell_type": "markdown", "metadata": { "id": "QBCTh3Qs3oiC", "colab_type": "text" }, "source": [ "##### Backward pass\n", "Suppose we have a target $t$ and a loss function $L = (y-t)^2$, what is the gradient $\\nabla_\\mathbf{w} L$?\n", "\n", "Remember the definition of the gradient,\n", "$$\n", "\\nabla_w L = \\begin{bmatrix}\n", "\\frac{\\partial L}{\\partial w_1} \\\\\n", "\\frac{\\partial L}{\\partial w_2} \\\\\n", "\\vdots \\\\\n", "\\frac{\\partial L}{\\partial w_M} \\\\\n", "\\end{bmatrix}\n", "$$\n", "\n", "Each element is a partial derivative that can be written out using *chain rule*. \n", "\n", "$$\n", "\\frac{\\partial L}{\\partial w_i} = \\frac{\\partial L}{\\partial y} \\; \\frac{\\partial y}{\\partial w_i}\n", "= 2(y-t) \\; x_i \n", "$$\n", "\n", "So the gradient is \n", "$$\n", "\\nabla_w L = \\begin{bmatrix}\n", "2(y-t) \\; x_1 \\\\\n", "\\vdots \\\\\n", "2(y-t) \\; x_M \\\\\n", "\\end{bmatrix} = 2(y-t) \\mathbf{x}\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "id": "hRfLCnbL3oiD", "colab_type": "text" }, "source": [ "**Remark.** There are some matrix calculus formulas that are worth remembering. One that we could have used here is \n", "\n", "$$\n", "\\mathbf{w}^T\\mathbf{x} \\implies \\nabla_\\mathbf{w} \\mathbf{w}^T\\mathbf{x} = \\mathbf{x}\n", "$$\n", "\n", "So the math becomes much easier.\n", "$$\\nabla_\\mathbf{w} L = \\underbrace{(\\nabla_y L )(\\nabla_\\mathbf{w} y)}_{\\text{Chain rule holds for matrix calculus.}} = 2(y - t)\\mathbf{x}$$" ] }, { "cell_type": "markdown", "metadata": { "id": "bIAbg-wL3oiE", "colab_type": "text" }, "source": [ "---\n", "#### Multivariate input. __Multivariate output__. $R^D \\rightarrow R^M$\n", "\n", "##### Forward pass\n", "Suppose we have some features $x_1, x_2, \\dots, x_D$, and we predict a **vector** $\\mathbf{y \\in R^M}$ where $y_j = w_{j1}x_1 + w_{j2}x_2 + \\dotsm + w_{jD}x_D$. We can write this in matrix notation as a **matrix-vector product**.\n", "$$\n", "\\mathbf{y} = \\begin{bmatrix}\n", "w_{11}x_1, w_{12}x_2, \\dotsm w_{1D}x_D \\\\\n", "w_{21}x_1, w_{22}x_2, \\dotsm w_{2D}x_D \\\\\n", "\\vdots \\\\\n", "w_{M1}x_1, w_{M2}x_2, \\dotsm w_{MD}x_D \\\\\n", "\\end{bmatrix} =\n", "\\begin{bmatrix}\n", "w_1^Tx \\\\\n", "w_2^Tx \\\\\n", "\\vdots \\\\\n", "w_M^Tx \\\\\n", "\\end{bmatrix} = \\mathbf{W}\\mathbf{x} \\quad\\text{ where }\\quad W = \n", "\\begin{bmatrix}\n", "w_1^T \\\\\n", "w_2^T \\\\\n", "\\vdots \\\\\n", "w_M^T\n", "\\end{bmatrix} = \\begin{bmatrix}\n", "w_{11} & w_{12} & \\dots & w_{1D} \\\\\n", "w_{21} & w_{22} & \\dots & w_{2D} \\\\\n", "\\vdots \\\\\n", "w_{M1} & w_{M2} & \\dots & w_{MD}\n", "\\end{bmatrix}\n", "$$\n", "\n", "We say $\\mathbf{W} \\in R^{M \\times D}$ to indicate that $\\mathbf{W}$ is a matrix of dimension $M$ by $D$." ] }, { "cell_type": "markdown", "metadata": { "id": "GDkhO3Ih3oiE", "colab_type": "text" }, "source": [ "##### Backward pass\n", "Suppose we have a target $t$ and a loss function $L = ||y-t||_2^2$, what is the gradient $\\nabla_\\mathbf{W} L$?\n", "\n", "Remember the definition of the gradient,\n", "$$\n", "\\nabla_w L = \\begin{bmatrix}\n", "\\frac{\\partial L}{\\partial w_{11}} & \\frac{\\partial L}{\\partial w_{12}} & \\dots & \\frac{\\partial L}{\\partial w_{1D}} \\\\\n", "\\frac{\\partial L}{\\partial w_{21}} & \\frac{\\partial L}{\\partial w_{22}} & \\dots & \\frac{\\partial L}{\\partial w_{2D}} \\\\\n", "\\vdots \\\\\n", "\\frac{\\partial L}{\\partial w_{M1}} & \\frac{\\partial L}{\\partial w_{M2}} & \\dots & \\frac{\\partial L}{\\partial w_{MD}} \\\\\n", "\\end{bmatrix} = \\begin{bmatrix}\n", "\\frac{\\partial L}{\\partial \\mathbf{w_1}}^T \\\\\n", "\\frac{\\partial L}{\\partial \\mathbf{w_2}}^T \\\\\n", "\\vdots \\\\\n", "\\frac{\\partial L}{\\partial \\mathbf{w_M}}^T \\\\\n", "\\end{bmatrix}\n", "$$\n", "\n", "In this particular case, *because $L$ depends on $\\mathbf{w_1}$ only through $y_1$*, we have \n", "\n", "$$\n", "\\frac{\\partial L}{\\partial \\mathbf{w_k}} = \\sum_{j=1}^M \\frac{\\partial L}{\\partial y_j} \\; \\frac{\\partial y_j}{\\partial \\mathbf{w_k}} = \\frac{\\partial L}{\\partial y_k} \\; \\frac{\\partial y_k}{\\partial \\mathbf{w_k}} \n", "$$\n", "\n", "(ie. Can show that $\\frac{\\partial y_j}{\\partial \\mathbf{w_k}} = 0$ if $j\\ne k$.)" ] }, { "cell_type": "markdown", "metadata": { "id": "MSXL9TeU3oiF", "colab_type": "text" }, "source": [ "The two parts:\n", "$$\\frac{\\partial L}{\\partial y_k} = \\frac{\\partial}{\\partial y_k} \\sum_{j=1}^D (y_j-t_j)^2 = 2(y_k - t_k)$$\n", "$$\\frac{\\partial y_k}{\\partial \\mathbf{w_k}} = \\frac{\\partial}{\\partial \\mathbf{w_k}} \\mathbf{w_k}^T\\mathbf{x} = \\mathbf{x}$$\n", "\n", "And we finally have\n", "\n", "$$\\nabla_w L = \\begin{bmatrix}\n", "\\frac{\\partial L}{\\partial \\mathbf{w_1}}^T \\\\\n", "\\frac{\\partial L}{\\partial \\mathbf{w_2}}^T \\\\\n", "\\vdots \\\\\n", "\\frac{\\partial L}{\\partial \\mathbf{w_M}}^T \\\\\n", "\\end{bmatrix} = \\begin{bmatrix}\n", "2(y_1 - t_1)\\mathbf{x}^T \\\\\n", "2(y_2 - t_2)\\mathbf{x}^T \\\\\n", "\\vdots \\\\\n", "2(y_M - t_M)\\mathbf{x}^T \\\\\n", "\\end{bmatrix} = 2(\\mathbf{y} - \\mathbf{t})\\mathbf{x}^T\n", "$$" ] }, { "cell_type": "markdown", "metadata": { "id": "6BL3JfV33oiG", "colab_type": "text" }, "source": [ "**Remark.** There are some matrix calculus formulas that are worth remembering. One that we could have used here is \n", "\n", "For any vectors $\\mathbf{v} \\in R^M, \\mathbf{A} \\in R^{M \\times D}, \\mathbf{u} \\in R^D$,\n", "\n", "For any vector $\\mathbf{v}$,\n", "$$\n", "\\nabla_\\mathbf{A} \\mathbf{v}^T\\mathbf{A}\\mathbf{u} = \\mathbf{v}\\mathbf{u}^T\n", "$$\n", "\n", "So the math becomes much easier.\n", "$$\\nabla_\\mathbf{W} L = \\underbrace{(\\nabla_\\mathbf{y} L)^T(\\nabla_\\mathbf{W} \\mathbf{y})}_{\\text{Chain rule holds for matrix calculus.}} = \\nabla_\\mathbf{W} (\\underbrace{2(\\mathbf{y} - \\mathbf{t})}_{\\text{Evaluate } \\nabla_\\mathbf{y} L\\text{ first.}})^TW\\mathbf{x} = 2(\\mathbf{y} - \\mathbf{t})\\mathbf{x}^T$$" ] }, { "cell_type": "markdown", "metadata": { "id": "R9SYFsvl3oiH", "colab_type": "text" }, "source": [ "---\n", "### 1-Hidden Layer Neural Network" ] }, { "cell_type": "markdown", "metadata": { "id": "4bu8Negr3oiI", "colab_type": "text" }, "source": [ "Putting it all together, suppose we now have a neural network with a single hidden layer.\n", "\n", " \n", "\n", "No more summation notation. Let's put what we've learned about matrix notation in use. \n", "\n", "\n", "\\begin{aligned}\n", "&(1) \\;\\; \\nabla_\\mathbf{w} \\mathbf{w}^T\\mathbf{x} = \\mathbf{x} \\\\\n", "&(2) \\;\\; \\nabla_\\mathbf{A} \\mathbf{v}^T\\mathbf{A}\\mathbf{u} = \\mathbf{v}\\mathbf{u}^T\n", "\\end{aligned}\n", "" ] }, { "cell_type": "markdown", "metadata": { "id": "dZM6Ckyp3oiI", "colab_type": "text" }, "source": [ "**Forward pass.**\n", "\n", "\n", "\\begin{aligned}\n", "& \\mathbf{z} = \\mathbf{W^{(1)}}\\mathbf{x} \\quad\\quad\\quad &\\text{(Linear transformation)}\\\\\n", "& \\mathbf{h} = \\sigma(\\mathbf{z}) &\\text{(Element-wise nonlinear function)}\\\\\n", "& \\mathbf{y} = \\mathbf{W^{(2)}}\\mathbf{h} &\\text{(Linear transformation)}\\\\\n", "& L = ||\\mathbf{y}-\\mathbf{t}||^2 &\\text{(Loss function)}\\\\\n", "\\end{aligned}\n", "\n", "\n", "**Backward pass.**\n", "\n", "\n", "\\begin{aligned}\n", "& \\nabla_\\mathbf{y} L = 2(\\mathbf{y} - \\mathbf{t}) &\\text{(Pass gradient through loss function)}\\\\\n", "& \\nabla_\\mathbf{h} L = ((\\nabla_\\mathbf{y} L)^T\\mathbf{W^{(2)}})^T &\\text{(Pass gradient through linear function; uses (1))}\\\\\n", "& \\nabla_\\mathbf{W^{(2)}} L = (\\nabla_\\mathbf{y} L)\\mathbf{h}^T &\\text{(Pass gradient to } \\mathbf{W^{(2)}} \\text{ ; uses (3))}\\\\\n", "& \\nabla_\\mathbf{z} L = (\\nabla_\\mathbf{h} L) \\circ \\sigma'(\\mathbf{z}) &\\text{(Pass gradient through nonlinearity)} \\\\\n", "& \\nabla_\\mathbf{W^{(1)}} L = (\\nabla_\\mathbf{z} L)\\mathbf{x}^T &\\text{(Pass gradient to }\\mathbf{W^{(1)}} \\text{ ; uses (3))}\\\\\n", "\\end{aligned}\n", "" ] } ] }