MUST BE DONE IN LATEX 1 Convexity (Chudi) Convexity is very important for optimization. In this question, we will explore the convexity of some functions. 1.1 Let h(x) = f(x) g(x) where f(x) and g(x) are both convex functions. Show that h(x) is also convex. 1.2 In mathematics, a concave function is the negative of a convex function. Let g1(x), g2(x), …., gk(x) be concave functions and gi(x) 0. Show that log(Qk i=1 gi(x)) is concave. (Note the base of log function is e.) Hint 1: the sum of two concave function is concave. Hint 2: log(a ⇥ b) = log(a) log(b). Hint 3: log function is concave and monotonically increasing in R . 1.3 A line, represented by y = ax b, is both convex and concave. Draw four lines with di↵erent values of a and b and highlight their maximum and minimum in two di↵erent colors. What does this reveal about the convexity of the maximum and minimum of these lines respectively? 1.4 Now we consider the general case. Let f(x) and g(x) be two convex functions. Show that h(x) = max(f(x), g(x)) is also convex or give a counterexample. 1.5 Let f(x) and g(x) be two convex functions. Show that h(x) = min(f(x), g(x)) is also convex or give a counterexample. 1.6 Since convex optimization is much more desirable then non-convex optimization, many commonly used risk functions in machine learning are convex. In the following questions, we will explore some of these empirical risk functions. Suppose we have a dataset {X, y} = {xi, yi}n i=1 that we will perform classification on, where xi 2 Rp is a p-dimensional feature vector, and yi 2 { 1, 1} is the binary label. Please identify and show by proof whether the following empirical risk functions are convex with respect to weight parameter vector !. In the model below, ! = [!1, !2, …, !p] T 2 Rp, b 2 R are the weight and bias. We would like to show convexity with respect to ! and b. (a) L(!, b) = Xn i=1 log(1 e yi(!T xi b) ) (b) L(!, b, C) = 1 2 ||!||2 2 CXn i=1 max(0, 1 yi(!T xi b)), C 0 Page 2 1.7 Consider the lasso regression problem. Given a dataset {X, y} = {xi, yi}n i=1 where yi 2 R, please identify and show by proof whether the following empirical risk function is convex with respect to the weight parameter vector !. L(!, b, C) = ky (X! b1n)k2 2 Ck!k1, C 0 2 Gradient Descent and Newton’s Method (Chudi) Optimization algorithms are important in machine learning. In this problem, we will explore ideas behind gradient descent and Newton’s method. Suppose we want to minimize a convex function given by f(x)=2×2 1 3×2 2 4×1 6×2 2x1x2 13. 2.1 (a) Find the expression for rf(x), the first derivative of f(x). (b) Find the expression for Hf(x), the Hessian of f(x). (c) Compute analytically the value of x⇤, at which the optimum is achieved for f(x). Hint: use the first two parts. 2.2 Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a di↵eren- tiable function. Since f(x) is strictly convex and di↵erentiable, gradient descent can be used to find the unique optimal solution. Recall that gradient descent takes repeated steps in the opposite direction of the gradient of the function at the current point until convergence. The algorithm is shown below. Given an initial point x0 and a step size ↵, t = 0 while xt is not a minimum – xt 1 = xt ↵rf(xt) – t = t 1 end For each iteration, xt 1 = xt ↵rf(xt). Suppose ↵ = 0.1. Show that xt 1 converges to x⇤ obtained in 2.1 as t ! 1. Hint 1: Let A be a square matrix. I A A2 …Ak = (I A) 1(I Ak 1). Hint 2: Let A be a square matrix. Ak converges to zero matrix if all eigenvalues of matrix A have absolute value smaller than 1. 2.3 Will xt 1 converge to x⇤ when ↵ grows larger? Hint: the previous question’s calculations should provide you the answer. Page 3 2.4 Gradient descent only uses the first derivative to make updates. If a function is twice di↵erentiable, is the second derivative able to help find optimal solutions? Newton’s method takes the second derivative into consideration and the key update step is modified as xt 1 = xt (Hf(xt)) 1rf(xt). The algorithm is shown below. t = 0 while xt is not a minimum – xt 1 = xt (Hf(xt)) 1rf(xt) – t = t 1 end (a) Using the initial point x0 = [1, 2]T , give the value of x1 that Newton’s method would find in one iteration. Does this method converge in one iteration? If not, give the value of x2 that Newton’s method would find and report whether x2 is equal to x⇤. (b) Using the initial point x0 = [1, 2]T and a fixed step size ↵ = 0.1, give the value of x1 that gradient descent would find in one iteration. Does this method converge in one iteration? If not, give the value of x2 that gradient descent would find and report whether x2 is equal to x⇤. (c) According to the results above, which method converges faster? Please briefly explain why it happens. 3 Bias and Variance of Random Forest In this question, we will analyze the bias and variance of the random forest algorithm. Bias is the amount that a model’s prediction di↵ers from the target value in the training data. An underfitted model can lead to high bias. Variance indicates how much the estimate of the target function will alter if di↵erent training data were used. An overfitted model can lead to high variance. 3.1 Let {X, y} = {(xi, yi)}n i=1 be the training dataset, where xi is a feature vector and yi 2 { 1, 1} is the binary label. Suppose all overlapping samples are consistently labeled, i.e. 8i 6= j, yi = yj if xi = xj . Show that 100% training accuracy tree can be achieved if there is no depth limit on the trees in the forest. Hint: the proof is short. One way to proof it is by induction. 3.2 Let D1, D2, …, DT be random variables that are identically distributed with positive pairwise correlation ⇢ and V ar(Di) = 2, 8i 2 {1, …, T}. Let D ̄ be the average of D1, …, DT , i.e. D ̄ = 1 T ( PT i=1 Di). Show that V ar(D ̄) = 1 ⇢ T 2 ⇢ 2. (1) 3.3 In the random forest algorithm, we can view the decision tree grown in each iteration as approximately Di in Problem 3.2. Given this approximation, briefly explain how we expect the variance of the average of the decision trees to change as T increases. Page 4 3.4 As T becomes large, the first term in Eq(1) goes away. But the second term does not. What parameter(s) in random forest control ⇢ in that second term?