Tests of divisibility for 13 and 17

Inspired by the test of divisibility for the number 7, I came up with similar tests of divisibility for 13 and 17:

Test of divisibility for 13: For a given number, remove the last digit from it and then add 4 times the digit that was just removed. The original number is divisible by 13 if and only if the resulting number is divisible by 13.

Test of divisibility for 17: For a given number, remove the last digit from it and then subtract 5 times the digit that was just removed. The original number is divisible by 17 if and only if the resulting number is divisible by 17.

This rule can be used repeatedly until the resulting number is small enough for us to check if the result is divisible by 13 (or 17) directly. The proof for these rules follow the same structure as that for 7. Let the original number be written as 10x + y. Then

\begin{aligned} 10x + y \equiv 0 \mod 13 &\Leftrightarrow 4(10x + y) \equiv 0 \mod 13 \\  &\Leftrightarrow 39x + (x + 4y) \equiv 0 \mod 13 \\  &\Leftrightarrow x + 4y \equiv 0 \mod 13, \end{aligned}


\begin{aligned} 10x + y \equiv 0 \mod 17 &\Leftrightarrow 5(10x + y) \equiv 0 \mod 17 \\  &\Leftrightarrow 51x - (x - 5y) \equiv 0 \mod 17 \\  &\Leftrightarrow x - 5y \equiv 0 \mod 17. \end{aligned}

Here are two examples:

\begin{aligned} 4836 \text{ is divisible by } 13 &\Leftrightarrow 483 + 4 \cdot 6 = 507 \text{ is divisible by } 13 \\  &\Leftrightarrow 50 + 4 \cdot 7 = 78 \text{ is divisible by } 13. \end{aligned}

\begin{aligned} 16252 \text{ is divisible by } 17 &\Leftrightarrow 1625 - 5 \cdot 2 = 1615 \text{ is divisible by } 17 \\  &\Leftrightarrow 161 - 5 \cdot 5 = 136 \text{ is divisible by } 17. \end{aligned}


Test of divisibility for 7

I recently learned of a simple test of divisibility for the number 7 from this Numberphile video. The rule of divisibility is as follows:

For a given number, remove the last digit from it and then add 5 times the digit that was just removed. The original number is divisible by 7 if and only if the resulting number is divisible by 7.

This rule can be used repeatedly until the resulting number is small enough for us to check if the result is divisible by 7 directly.

Here is an example: say we want to test if 123557 is divisible by 7. Using this rule:

\begin{aligned} 123557 \text{ is divisible by } 7 &\Leftrightarrow 12355 + 5 \cdot 7 = 12390 \text{ is divisible by } 7 \\  &\Leftrightarrow 1239 + 5 \cdot 0 = 1239 \text{ is divisible by } 7 \\  &\Leftrightarrow 123 + 5 \cdot 9 = 168 \text{ is divisible by } 7 \\  &\Leftrightarrow 16 + 5 \cdot 8 = 56 \text{ is divisible by } 7. \end{aligned}

The last statement is true, and so 123557 is divisible by 7.

This rule is also easy to prove. Let the original number be written as 10x + y. Then

\begin{aligned} 10x + y \equiv 0 \mod 7 &\Leftrightarrow 5(10x + y) \equiv 0 \mod 7 \\  &\Leftrightarrow 49x + (x + 5y) \equiv 0 \mod 7 \\  &\Leftrightarrow x + 5y \equiv 0 \mod 7. \end{aligned}

Finally, there is a variation of this rule involving subtraction instead of addition (whose proof is essentially the same as the above):

For a given number, remove the last digit from it and then subtract 2 times the digit that was just removed. The original number is divisible by 7 if and only if the resulting number is divisible by 7.

What is a block-angular matrix?

A matrix is said to be in block-angular form if it is of the form

\begin{pmatrix} A_{01} & A_{02} & \dots & A_{0n} \\  A_{11} & 0 & \dots & 0 \\  0 & A_{22} & \dots & 0 \\  \vdots & \vdots & \ddots & \vdots \\  0 & 0 & \dots & A_{mn} \end{pmatrix}

where the A_{ij}‘s are arbitrary rectangular matrices (they need not be square).

These matrices appear in optimization in the context of the Dantzig-Wolfe decomposition in linear programming (LP). The problem with the Dantzig-Wolfe decomposition solves is of the form

\begin{aligned} \text{minimize} \quad& c_1^\top x_1 + \dots + c_n^\top x_n \\  \text{subject to} \quad & \begin{pmatrix} A_{01} & A_{02} & \dots & A_{0n} \\  A_{11} & 0 & \dots & 0 \\  0 & A_{22} & \dots & 0 \\  \vdots & \vdots & \ddots & \vdots \\  0 & 0 & \dots & A_{mn} \end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{pmatrix} = \begin{pmatrix} b_1 \\ b_2 \\ \vdots \\ b_n \end{pmatrix} , \\  &x_1, \dots, x_n \geq 0. \end{aligned}

The constraint matrix is in block-angular form. The constraints associated with the top part of the matrix \begin{pmatrix} A_{01} & \dots & A_{0n} \end{pmatrix} are known as the “connecting” or “coupling” constraints. Without these constraints, the LP decouples into n independent LPs.


  1. Dantzig, G. B., and Wolfe, P. (1960). Decomposition Principle for Linear Programs.
  2. Weil, R. L., and Kettler, P. C. (1971). Rearrange Matrices to Block-Angular form for Decomposition (And Other) Algorithms.

What is the Birkhoff polytope?

A square matrix A \in \mathbb{R}^{n \times n} is said to be doubly stochastic if its entries are all non-negative and each row and column sums to one, i.e.

\begin{aligned} \sum_i a_{ij} = 1 \text{ for all } i, \quad \sum_j a_{ij} = 1 \text{ for all } j. \end{aligned}

The Birkhoff polytope, denoted by B_n, is the set of all n \times n doubly stochastic matrices.

One interesting fact about the Birkhoff polytope is that it is a convex polytope, with the vertices being the n! permutation matrices of size n (n \times n matrices with exactly one 1 in each row and column, all other entries being 0). This is known as the Birkhoff-von Neumann theorem. An immediate consequence of this is that for any doubly stochastic matrix A, there exist permutation matrices P_1, \dots, P_k and non-negative numbers \theta_1, \dots, \theta_k such that \theta_1 + \dots + \theta_k = 1 and A = \theta_1 P_1 + \dots + \theta_k P_k.

Proof that the dual norm of the dual norm is the norm itself

Assume that \| \cdot \| is some norm on \mathbb{R}^n. The dual norm, denoted by \| \cdot \|_*, is defined by

\begin{aligned} \| x\|_* = \max_{\| z \| \leq 1} z^\top x. \end{aligned}

The definition gives us the inequality |z^\top x| \leq \|z\| \|x\|_*, which will come in handy later.

Since the dual norm is also a norm, it too has a dual norm. It turns out that the dual norm of the dual norm is the norm itself, i.e. \| x \|_{**} = \|x\|. I found an interesting proof of this statement recently in Reference 1, which I share in the remainder of this post.

Fix x \in \mathbb{R}^n and consider the trivial looking optimization problem

\begin{aligned} \min_y \| y \| \quad \text{subject to} \quad y = x. \end{aligned}

This (primal) optimization problem has optimal value \| x\|. The Lagrangian associated with this problem is

\begin{aligned} L(y, u) = \| y \| + u^T (x - y) = \| y \| - y^\top  u + x^\top u, \end{aligned}

and the Lagrange dual function is

\begin{aligned} g(u) = \min_y L(y, u) = x^\top u + \min_y \{ \| y \| - y^\top u \} . \end{aligned}

  • If \|u\|_* > 1, by definition of the dual norm there is some y with \| y \| = 1 such that y^\top u > 1, i.e. 0 > \| y \| - y^\top u. Multiplying y by a constant and letting that constant go to infinity, we see that \min_y \{ \| y \| - y^\top u \} = -\infty.
  • If \|u\|_* \leq 1, then the inequality above gives \| y \| - y^\top u \geq \|y\| \|u\|_* - |y^\top u| \geq 0. Equality can hold (e.g. y = 0), so we have \min_y \{ \| y \| - y^\top u \} = 0 in this case.

Putting the pieces together, the Lagrange dual problem is

\begin{aligned} \max_u u^\top x \quad \text{subject to} \quad \| u \|_* \leq 1. \end{aligned}

By definition, the optimal value of this dual problem is \| x \|_{**}. Since strong duality holds, the primal and dual problems have the same optimal value, i.e. \| x \| = \| x \|_{**}.


  1. Tibshirani, R. (2016). Duality Uses and Correspondences. (CMU Convex Optimization: Fall 2016, Lecture on Oct 12.)

Clearing up directional semi-derivatives for one dimension

This post clears up a confusion I ran into recently with directional semi-derivatives in one dimension.

Directional semi-derivatives

Consider a function f: \mathbb{R}^d \mapsto \mathbb{R}, where d \geq 1, and let \mathbf{u} \in \mathbb{R}^d be a direction in \mathbb{R}^d (i.e. a vector with length 1). For any point \mathbf{x} \in \mathbb{R}^d, the directional semi-derivative of f at \mathbf{x} in the direction \mathbf{u} is given by

\begin{aligned} \nabla_{\mathbf{u}} f(\mathbf{x}) = \lim_{h \rightarrow 0^+} \dfrac{f(\mathbf{x} + h \mathbf{u}) - f(\mathbf{x})}{h}  \end{aligned}

if the limit exists. By h \rightarrow 0^+, we mean that h approaches zero from the positive direction. Intuitively, the directional semi-derivative says that if I were to move a small positive distance h from \mathbf{x} in the direction \mathbf{u}, f would increase by about h \nabla_{\mathbf{u}}f(\mathbf{x}).

In one dimension (d = 1), there are only two directions: \mathbf{u} = +1 and \mathbf{u} = -1. The formula above simplifies to

\begin{aligned} \nabla_{+1} f(x) = \lim_{h \rightarrow 0^+} \dfrac{f(x + h) - f(x)}{h}, \\  \nabla_{-1} f(x) = \lim_{h \rightarrow 0^+} \dfrac{f(x - h) - f(x)}{h}. \end{aligned}

One-sided derivatives

Some introductory courses on single-variable calculus introduce the concept of a one-sided derivative, also called left and right derivatives. For a function f: \mathbb{R} \mapsto \mathbb{R}, the right and left derivatives of f at x are defined respectively as

\begin{aligned} \partial_+ f(x) &= \lim_{x' \rightarrow x^+} \dfrac{f(x') - f(x)}{x' - x}, \\  \partial_- f(x) &= \lim_{x' \rightarrow x^-} \dfrac{f(x') - f(x)}{x' - x}. \end{aligned}

Writing x' = x+h in the first equation and x' = x - h in the second equation, we can express the one-sided derivatives equivalently as

\begin{aligned} \partial_+ f(x) &= \lim_{h \rightarrow 0^+} \dfrac{f(x+h) - f(x)}{h}, \\  \partial_- f(x) &= \lim_{h \rightarrow 0^+} \dfrac{f(x - h) - f(x)}{-h}. \end{aligned}

Notice that the directional semi-derivative matches the one-sided derivative for the positive direction but not for the negative direction! We have \nabla_{+1} f(x) = \partial_+ f(x) but \nabla_{-1} f(x) = -\partial_- f(x). Thus, one needs to be clear when defining these quantities in one dimension to avoid confusion.


  1. Wikipedia. Directional derivative.
  2. Wikipedia. Semi-differentiability.

Generalizing projections to Mahalanobis-type metrics

In this previous post, we defined projections onto a subspace and obtained an expression for the matrix representing this projection. Specifically, if A is a full rank matrix and \text{proj}_A v is the projection of v onto the column space of A, then \text{proj}_A v = Pv, where P = A (A^T A)^{-1} A^T. In this post, we generalize the idea of projections to Mahalanobis-type metrics.

Let M \in \mathbb{R}^{d \times d} be a symmetric positive definite matrix, and define the metric \| \cdot \|_M by \| x \|_M^2 = x^\top Mx for x \in \mathbb{R}^d. Let A \in \mathbb{R}^{d \times k} be some full-rank matrix with k \leq d. We can define the projection onto the column space of A using the metric \| \cdot \|_M, denoted by \Pi, as the matrix such that for all x, \Pi x = Ay, where y \in \mathbb{R}_k minimizes the quantity \| x - Ay \|_M^2. That is, the projection of x is the vector in the column space of A that is closest to x in the metric \| \cdot \|_M.

We can determine the projection matrix \Pi explicitly. Differentiating the objective function by y and setting it to zero,

\begin{aligned} \dfrac{\partial}{\partial y} \| x - Ay \|_M^2 &= \dfrac{\partial}{\partial y} (x - Ay)^\top M (x - Ay) \\  &= -2 A^\top M (x - Ay) = 0, \\  A^\top M A y &= A^\top Mx. \end{aligned}

Since A has full rank and M is non-singular, A^\top M A is non-singular and so y = \left( A^\top M A \right)^{-1} A^\top Mx. This implies that

\begin{aligned} \Pi =  A \left( A^\top M A \right)^{-1} A^\top M. \end{aligned}

How does this relate to our previous post? Setting M = I reduces \Pi to our initial expression for the projection matrix, A (A^\top A)^{-1} A^T. That is, our original expression for the projection matrix was for the (orthogonal) projection according to the standard Euclidean metric.


  1. Ferguson, T. S. (1996). A Course in Large Sample Theory. Chapter 23.

Rank and trace are equal for a real symmetric idempotent matrix

Proposition. Let \mathbf{X} \in \mathbb{R}^{n \times n} be a matrix that is symmetric (\mathbf{X}^\top = \mathbf{X}) and idempotent (\mathbf{X}^2 = \mathbf{X}). Then the rank of \mathbf{X} is equal to the trace of \mathbf{X}. In fact, they are both equal to the sum of the eigenvalues of \mathbf{X}.

The proof is relatively straightforward. Since \mathbf{X} is real and symmetric, it is orthogonally diagonalizable, i.e. there is an orthogonal matrix \mathbf{U} (\mathbf{U}^\top \mathbf{U} = \mathbf{I}) and a diagonal matrix \mathbf{D} such that \mathbf{D} = \mathbf{UXU}^\top (see here for proof).

Since \mathbf{X} is idempotent,

\begin{aligned} \mathbf{X}^2 &= \mathbf{X}, \\  \mathbf{U}^\top \mathbf{D}^2 \mathbf{U} &= \mathbf{U}^T \mathbf{DU}, \\  \mathbf{D}^2 &= \mathbf{D}. \end{aligned}

Since \mathbf{D} is a diagonal matrix, it implies that the entries on the diagonal must be zeros or ones. Thus, the number of ones on the diagonal (which is \text{rank}(\mathbf{D}) = \text{rank}(\mathbf{X})) is equal to the sum of the diagonal (which is \text{tr}(\mathbf{D}) = \text{tr}(\mathbf{X})).

Conway’s fruitful fractions (or how to generate all the prime numbers by multiplying one of 14 fractions)

(Credit: I recently learned of the fruitful fractions from this blog post on The Aperiodical.)

Conway‘s fruitful fractions first appeared as a problem in the Mathematical Intelligencer in 1980 (Reference 1), and are described in more detail in Guy (1983) (Reference 2). The discussion here closely follows that of Reference 2, and the figures in this post are taken from that paper.

The fruitful fractions are a set of 14 innocent-looking fractions:

Consider the following algorithm:

  1. Start with the input x = 2.
  2. While TRUE (i.e. repeat the steps in the following loop forever):
    1. Multiply x by the leftmost fraction in the row above such that the result is still a whole number.
    2. If x is a pure power of 2, i.e. x = 2^t for some non-negative integer t, output the exponent t.

It turns out that the output sequence is simply the list of all primes! Here are the steps taken to go from the initial input 2 to the first output 2 (when x = 2^2 = 4):

Amazed?! The rest of this post outlines why this is the case. TLDR: We can reformulate the algorithm above as a flow diagram, which can be interpreted as a computer program that finds all the primes using just 3 types of operations (adding 1 to the contents of a register, subtracting 1 from the contents of a register, and checking if a register is 0).

Step 1: Write out the computer program to generate primes with simple operations

The figure below depicts the computer program that generates all the primes as a flow diagram:

The box in blue (“Is n prime?”) is a complex operation which we need to break down even further; the figure below demonstrates how we can do so. For a given n, for d = n, n-1 , n-2, \dots , 1, we check if d divides n (i.e. d is a divisor of n). If d \mid n for some d > 1, then n is composite, otherwise it is prime.

The box in red (“Does d \mid n?”) is still too complex: we can break it down further as shown in the figure below. For given d and n, repeatedly subtract d from n until we reach a number r < d. If r = 0, then d \mid n, if not d does not divide n.

If we combine these 3 flow diagrams into one, we get a flow diagram that generates all the primes using 3 simple operations: adding 1 to a variable, subtracting 1 from a variable, and checking if a variable is equal to 0. The rest of the steps will show that the fruitful fractions replicate this flow diagram.

Step 2: Think of x‘s prime factorization as encoding a state

Rewrite the 14 fruitful fractions with their prime factorizations:

It’s not immediately obvious, but one can show that at any point, the number x can be written in the form

x = 2^t 3^s 5^r 7^q p,

where t, s, r, q are non-negative integers and p \in \{1, 11, 13, 17, 19, 23, 29\}. We can think of x as keeping track of 5 pieces of information, which we write as x = (t, s, r, q)_p. Thinking of this as a computer program / machine, p represents the state of the machine while t, s, r, q are the contents of 4 registers of the machine.

Step 3: Draw the possible transitions between states

Assume that the machine is in a particular state. What other states can the machine reach from there? Each of the 14 fractions corresponds to one state transition, depicted in the state transition diagram below. The numbers in the circle represent the state (the value of p), the letters correspond to the fraction being multiplied, and the numbers correspond to changes to the registers (t, s, r, q). For example, step B takes the machine from state 17 to state 13, increases t and s (the 2- and 3-registers) by 1, decreases r (the 5-register) by 1, and leaves q (the 7-register) unchanged.

Step 4: Amalgamate steps into subroutines

Running through the state diagram several times, you will find that some sequences of steps always happen together: we can amalgamate these into subroutines. For example, in state 19, the machine will make a pair of steps HI t times (where t is the 2-register), transferring t to r (the 5-register), and then do step R to arrive at state 11. State 19 can only be reached by step D from state 17 or by step I from state 23, so we can replace stats 19 and 23 by the subroutine

D_n = D(HI)^n R

which goes directly from state 17 to state 11. With some work we can come up with the following subroutines:

We can then simplify our original state diagram with these subroutines:

Step 5: Assign meaning to registers and subroutines

This is the trickiest step to work out in my opinion. Notice the following 3 things:

  • The sum of the 2- and 5-registers, t + r is always equal to n (except when we have finished checking if n is prime and are moving on to n+1).
  • The sum of the 3- and 7- registers, s + q is always equal to d (except when we have finished checking if d \mid n and are moving on to checking if (d-1) \mid n).
  • The “Next Number” subroutine N_n is always followed by T_n, the “Decrease Divisor” subroutine D_n is always followed by T_{r-1}, and each of these is followed by the “division routine” (S_d T_d)^l A_r.

With some tedious work, we can now match our original prime producing flow diagram to the states and subroutines we have developed! The blue and red boxes match with those earlier in this post. (Because of the font face used in this diagram, the variable t looks like a slightly bigger version of the variable r.)


  1. Conway, J. H. (1980). Problem 2.4.
  2. Guy, R. K. (1983). Conway’s Prime Producing Machine.

Two ways to approximate the root of an equation involving a combinatorial sum

I recently came across two approximations of an equation involving a particular combinatorial sum which I thought were pretty neat. Let r be a positive integer and let x \in [0, 1]. The combinatorial sum in question is

\begin{aligned} f(x, r) &= \sum_{i=1}^r \dfrac{1}{i} \cdot \binom{r}{i} x^{r-i}(1-x)^i  \\  &= \binom{r}{1}x^{r-1}(1-x) + \dfrac{1}{2}\binom{r}{2}x^{r-2}(1-x)^2 + \dots + \dfrac{1}{r}\binom{r}{r}(1-x)^r. \end{aligned}

Think of r as fixed and large. The equation to be solved is

\begin{aligned} x^r = f(x, r). \end{aligned}

This equation comes from the solution of Problem 48 in Mosteller (1987) (Reference 1), and in the solution Mosteller provides two ways to approximate the solution to the equation.

The first approximation notes that for large r, (1-x)^r gets small. Hence, we can approximate the combinatorial sum by its leading term, i.e.

\begin{aligned} f(x, r) &\approx \binom{r}{1}x^{r-1}(1-x) = r x^{r-1} (1-x). \end{aligned}

With this approximation, the equation becomes

\begin{aligned} x^r &= r x^{r-1} (1-x), \\  x &= r(1-x), \\  x &= \dfrac{r}{r+1}. \end{aligned}

The plot below compares the actual value of the root (black line) with this approximation (red line) for values of r from 1 to 20. It’s not a bad approximation, and we can see the gap narrowing as r increases.

The second approximation is a bit more sophisticated. If we let z = \dfrac{1-x}{x}, we can transform the equation:

\begin{aligned} x^r &= \sum_{i=1}^r \dfrac{1}{i} \cdot \binom{r}{i} x^{r-i}(1-x)^i, \\  1 &= \sum_{i=1}^r \dfrac{1}{i} \cdot \binom{r}{i} x^{-i}(1-x)^i, \\  1 &= \sum_{i=1}^r \dfrac{1}{i} \cdot \binom{r}{i} z^i. \end{aligned}

From the first approximation, we know that

z = \dfrac{1-x}{x} \approx \dfrac{1 - r/(r+1)}{r/(r+1)} = \dfrac{1}{r}.

Let’s refine this approximation a bit by letting z = \alpha / r, where \alpha is a constant for us to determine. Since we want a solution for large r, let’s take r to infinity and see what value of \alpha might make sense there. Plugging this value of z into the equation above,

\begin{aligned} 1 &= \sum_{i=1}^\infty \dfrac{1}{i} \cdot \binom{r}{i} \dfrac{\alpha^i}{r^i} \\  &= \sum_{i=1}^\infty \dfrac{1}{i} \cdot \dfrac{r(r-1) \dots (r-i+1)}{i!} \dfrac{\alpha^i}{r^i}  \\  &\approx \sum_{i=1}^\infty \dfrac{1}{i} \cdot \dfrac{r^i}{i!} \dfrac{\alpha^i}{r^i} \\  &= \sum_{i=1}^\infty \dfrac{\alpha^i}{i \cdot i!}. \end{aligned}

Since the RHS is increasing in \alpha, we can use root finding techniques like bisection search to approximate the value of \alpha. We find that \alpha \approx 0.8043, i.e. x \approx \dfrac{r}{r+0.0843}.

The plot below shows this second approximation in blue (the true value is in black and the first approximation is in red). You can see that this is a much better approximation!

This last plot compares the relative error of the two approximations (i.e. |approx - true| / true \times 100\%).


  1. Mosteller, F. (1987). Fifty challenging problems in probability with solutions.