1
CAP 5625: Programming Assignment 4
Due on Canvas by Friday, December 1, 2023 at 11:59pm
Preliminary instructions
You may consult with other students currently taking CAP 5625 in your section at FAU on this
programming assignment. If you do consult with others, then you must indicate this by
providing their names with your submitted assignment. However, all analyses must be
performed independently, all source code must be written independently, and all students
must turn in their own independent assignment. Note that for this assignment, you may choose
to pair up with one other student in your section of CAP 5625 and submit a joint assignment. If
you choose to do this, then both your names must be associated with the assignment and you
will each receive the same grade.
Though it should be unnecessary to state in a graduate class, I am reminding you
that you may not turn in code (partial or complete) that is written or inspired by
others, including code from other students, websites, past code that I release
from prior assignments in this class or from past semesters in other classes I
teach, or any other source that would constitute an academic integrity violation.
All instances of academic integrity violations will receive a zero on the
assignment and will be referred to the Department Chair and College Dean for
further administrative action. A second offense could lead to dismissal from the
University and any offense could result in ineligibility for Departmental Teaching
Assistant and Research Assistant positions.
You may choose to use whatever programming language you want. However, you must provide
clear instructions on how to compile and/or run your source code. I recommend using a
modern language, such as Python, R, or Matlab as learning these languages can help you if you
were to enter the machine learning or artificial intelligence field in the future.
All analyses performed and algorithms run must be written from scratch. That is, you may not
use a library that can perform gradient descent, cross validation, ridge regression, logistic
(multinomial) regression, optimization, etc. to successfully complete this assignment (though
you may reuse your relevant code from Programming Assignments 1, 2, and 3). The goal of this
assignment is not to learn how to use particular libraries of a language, but it is to instead
understand how key methods in statistical machine learning are implemented. With that
stated, I will provide 5% extra credit if you additionally implement the assignment using built-in
statistical or machine learning libraries (see Deliverable 7 at end of the document).
Note, credit for deliverables that request graphs, discussion of results, or specific values will not
be given if the instructor must run your code to obtain these graphs, results, or specific values.
Brief overview of assignment
2
In this assignment you will still be analyzing human genetic data from ???? = 183 training
observations (individuals) sampled across the world. The goal is to fit a model that can predict
(classify) an individual’s ancestry from their genetic data that has been projected along ???? = 10
top principal components (proportion of variance explained is 0.2416) that we use as features
rather than the raw genetic data, as the training would take too long to complete for this
assignment with the raw data. I recognize that we have not covered precisely what principal
components are, but we will get to this in Module 10, and for now you should just treat them as
highly informative features as input to your classifier. Specifically, you will perform a penalized
(regularized) logistic (multinomial) regression fit using ridge regression, with the model
parameters obtained by batch gradient descent. Your predictions will be based on ???? = 5
continental ancestries (African, European, East Asian, Oceanian, or Native American). Ridge
regression will permit you to provide parameter shrinkage (tuning parameter ???? ≥ 0) to mitigate
overfitting. The tuning parameter ???? will be chosen using five-fold cross validation, and the best-
fit model parameters will be inferred on the training dataset conditional on an optimal tuning
parameter. This trained model will be used to make predictions on new test data points.
Training data
Training data for these observations are given in the attached
TrainingData_N183_p10.csv comma-separated file, with individuals labeled on each
row (rows 2 through 184), and input features (PC1, PC2, …, PC10) and ancestry label given on
the columns (with the first row representing a header for each column).
Test data
Test data are given in the attached TestData_N111_p10.csv comma-separated file, with
individuals labeled on each row (rows 2 through 112), and input features (PC1, PC2, …, PC10),
and ancestry label given on the columns (with the first row representing a header for each
column). There are five individuals with Unknown ancestry, 54 individuals with Mexican
ancestry, and 52 individuals with African American ancestry. Each of the five Unknown
individuals belong to one of the five ancestries represented in the training set, and each
ancestry is represented once in the five Unknown individuals. The Mexican and African
American individuals have a range of ancestry proportions based on historical mixing of
ancestors of diverse ancestry. You will use the class probabilities from logistic (multinomial)
regression to predict the ancestry proportion of each of these putatively mixed samples.
Detailed description of the task
Recall that the task of performing a ridge-penalized logistic regression fit to training data
{(????1, ????1), (????2, ????2), … , (???????? , ????????)} is to minimize the cost function
????(????, ????) = − log ℒ(????) + ????∑∑????????????
2
????
????=1
????
????=1
where
3
???? = [
????01 ????02 ⋯ ????0????
????11 ????12 ⋯ ????1????
⋮ ⋮ ⋱ ⋮
????????1 ????????2 ⋯ ????????????
]
is the (???? + 1) × ???? matrix of model parameters, where
log ℒ(????) = ∑[∑???????????? (????0???? +∑????????????????????????
????
????=1
)
????
????=1
− log(∑exp(????0ℓ +∑????????????????????ℓ
????
????=1
)
????
ℓ=1
)]
????
????=1
is the likelihood function, where ????????????, ???? = 1,2, … ,???? and ???? = 1,2, … , ????, is an indicator variable
defined as
???????????? = {
1 if ???????? = ????
0 if ???????? ≠ ????
and where the input ???? features are standardized (i.e., centered and divided by their standard
deviation). Moreover, recall that batch gradient descent updates each parameter ????????????, ???? =
0,1, … , ???? and ???? = 1,2, … , ????, as follows:
???????????? ≔ ???????????? − ????
????
????????????????
????(????, ????)
where ???? is the learning rate and where the partial derivative of the cost function with respect
to parameter ???????????? is
????
????????????????
????(????, ????) =
{
−∑[???????????? − ????????(????????; ????)]
????
????=1
if ???? = 0
−∑????????????[???????????? − ????????(????????; ????)]
????
????=1
+ 2???????????????? if ???? > 0
where
????????(????????; ????) =
exp(????0???? + ∑ ????????????????????????
????
????=1 )
∑ exp (????0ℓ + ∑ ????????????????????ℓ
????
????=1 )????
ℓ=1
To implement this algorithm, depending on whether your chosen language can quickly compute
vectorized operations, you may implement batch gradient descent using either Algorithm 1 or
Algorithm 2 below (choose whichever you are more comfortable implementing). Note that in
languages like R, Python, or Matlab, Algorithm 2 (which would be implemented by several
nested loops) may be much slower than Algorithm 1. Also note that if you are implementing
Algorithm 1 using Python, use numpy arrays instead of Pandas data frames for computational
speed. For this assignment, assume that we will reach the minimum of the cost function within
a fixed number of steps, with the number of iterations being 10,000.
4
You may need to explore different learning rate values to identify one that is not too large and
not too small, such that it is likely for the algorithm to converge in a reasonable period of time. I
would consider a learning rate of ???? = 10−5, though I encourage you to explore how your model
trains for smaller and larger learning rates as well. For this assignment, assume that we will
reach the minimum of the cost function within a fixed number of steps, with the number of
iterations being 10,000 (a large number as we have many parameters). Keep in mind that due
to this large number of iterations, it could take a long time to train your classifier.
Algorithm 1 (vectorized):
Step 1. Choose learning rate ???? and fix tuning parameter ????
Step 2. Generate ???? × (???? + 1) augmented design matrix
???? =
[
1 ????11 ????12 ⋯ ????1????
1 ????21 ????22 ⋯ ????2????
⋮ ⋮ ⋮ ⋱ ⋮
1 ????????1 ????????2 ⋯ ????????????]
where column ???? + 1 has been centered and standardized such that feature ????, ???? =
1,2, … , ????, has mean zero and standard deviation one, and generate ???? × ???? indicator
response matrix
???? = [
????11 ????12 ⋯ ????1????
????21 ????22 ⋯ ????2????
⋮ ⋮ ⋱ ⋮
????????1 ????????2 ⋯ ????????????
]
where ???????????? = 1 if observation ???? is from class ????, and 0 otherwise.
Step 3. Initialize the (???? + 1) × ????-dimensional parameter matrix
???? = [
????01 ????02 ⋯ ????0????
????11 ????12 ⋯ ????1????
⋮ ⋮ ⋱ ⋮
????????1 ????????2 ⋯ ????????????
]
to all zeros, so that initially each class has the same probability.
Step 4. Compute ???? × ???? unnormalized class probability matrix as
???? = [
????11 ????12 ⋯ ????1????
????21 ????22 ⋯ ????2????
⋮ ⋮ ⋱ ⋮
????????1 ????????2 ⋯ ????????????
] = exp(????????)
where exp(????????) indicates exponentiation of each element of the ???????? matrix, and not
the matrix exponential of ????????.
Step 5. Compute ???? × ???? normalized class probability matrix as
???? = [
????11 ????12 ⋯ ????1????
????21 ????22 ⋯ ????2????
⋮ ⋮ ⋱ ⋮
????????1 ????????2 ⋯ ????????????
]
5
where
????????(????????; ????) = ???????????? =
????????????
∑ ????????ℓ
????
ℓ=1
Step 6. For ease of vectorization, generate (???? + 1) × ???? intercept matrix
???? = [
????01 ????02 ⋯ ????0????
0 0 ⋯ 0
⋮ ⋮ ⋱ ⋮
0 0 ⋯ 0
]
Step 7. Update the parameter matrix as
???? ∶= ???? + ????[????????(???? − ????) − 2????(???? − ????)]
Step 8. Repeat Steps 4 to 7 for 10,000 iterations
Step 9. Set the last updated parameter matrix as �̂�
Algorithm 2 (non-vectorized):
Step 1. Choose learning rate ???? and fix tuning parameter ????
Step 2. Generate ???? × ???? design matrix
???? = [
????11 ????12 ⋯ ????1????
????21 ????22 ⋯ ????2????
⋮ ⋮ ⋱ ⋮
????????1 ????????2 ⋯ ????????????
]
where column ???? has been centered and standardized such that feature ????, ???? =
1,2, … , ????, has mean zero and standard deviation one, and generate ???? × ???? indicator
response matrix
???? = [
????11 ????12 ⋯ ????1????
????21 ????22 ⋯ ????2????
⋮ ⋮ ⋱ ⋮
????????1 ????????2 ⋯ ????????????
]
where ???????????? = 1 if observation ???? is from class ????, and 0 otherwise.
Step 3. Initialize the (???? + 1) × ????-dimensional parameter matrix
???? = [
????01 ????02 ⋯ ????0????
????11 ????12 ⋯ ????1????
⋮ ⋮ ⋱ ⋮
????????1 ????????2 ⋯ ????????????
]
to all zeros, so that initially each class has the same probability.
Step 4. Create temporary (???? + 1) × ????-dimensional parameter matrix
6
????temp =
[
????01
temp
????02
temp
⋯ ????0????
temp
????11
temp
????12
temp
⋯ ????1????
temp
⋮ ⋮ ⋱ ⋮
????????1
temp
????????2
temp
⋯ ????????????
temp
]
Step 5. Compute ???? × ???? unnormalized class probability matrix as
???? = [
????11 ????12 ⋯ ????1????
????21 ????22 ⋯ ????2????
⋮ ⋮ ⋱ ⋮
????????1 ????????2 ⋯ ????????????
]
where
???????????? = exp(????0???? +∑????????????????????????
????
????=1
)
Step 6. Compute ???? × ???? normalized class probability matrix as
???? = [
????11 ????12 ⋯ ????1????
????21 ????22 ⋯ ????2????
⋮ ⋮ ⋱ ⋮
????????1 ????????2 ⋯ ????????????
]
where
????????(????????; ????) = ???????????? =
????????????
∑ ????????ℓ
????
ℓ=1
Step 7. For each ????, ???? = 0,1, … , ????, and ????, ???? = 1,2, … , ????, find next value for parameter ???? for
class ???? as
????????????
temp
≔
{
???????????? + ???? (∑[???????????? − ????????????]
????
????=1
) if ???? = 0
???????????? + ????(∑????????????[???????????? − ????????????]
????
????=1
− 2????????????????) if ???? > 0
Step 8. Update the parameter matrix as ???? = ????temp
Step 9. Repeat Steps 5 to 8 for 10,000 iterations
Step 10. Set the last updated parameter matrix as �̂�
Effect of tuning parameter on inferred regression coefficients
You will consider a discrete grid of nine tuning parameter values ???? ∈
{10−4, 10−3, 10−2, 10−1, 100, 101, 102, 103, 104} where the tuning parameter is evaluated
across a wide range of values on a log scale. For each tuning parameter value, you will use
batch gradient descent to infer the best-fit model.
7
Deliverable 1: Illustrate the effect of the tuning parameter on the inferred ridge regression
coefficients by generating five plots (one for each of the ???? = 5 ancestry classes) of 10 lines
(one for each of the ???? = 10 features), with the ????-axis as �̂�????????, ???? = 1,2, … ,10 for the graph of
class ????, and ????-axis the corresponding log-scaled tuning parameter value log10(????) that
generated the particular �̂�????????. Label both axes in all five plots. Without the log scaling of the
tuning parameter, the plot will look distorted.
Choosing the best tuning parameter
You will consider a discrete grid of nine tuning parameter values ???? ∈
{10−4, 10−3, 10−2, 10−1, 100, 101, 102, 103, 104} where the tuning parameter is evaluated
across a wide range of values on a log scale. For each tuning parameter value, perform five-fold
cross validation and choose the ???? value that gives the smallest
CV(5) =
1
5
∑ CategoricalCrossEntropy????
5
????=1
where
CategoricalCrossEntropy???? = −
1
????????
∑ (∑???????????? log10 ????????(????????; �̂�)
????
????=1
)
????∈ Validation Set ????
is a measure of the cost for a classifier with ???? classes in fold ???? and where ???????? is the number of
observations in Validation set ????.
Note that during the five-fold cross validation, you will hold out one of the five sets (here either
36 or 37 observations) as the Validation Set and the remaining four sets (the other 147 or 146
observations) will be used as the Training Set. On this Training Set, you will need to standardize
(center and divided by the standard deviation across samples) each feature. These identical
values used for standardizing the input will need to be applied to the corresponding Validation
Set, so that the Validation set is on the same scale. Because the Training Set changes based on
which set is held out for validation, each of the five pairs of Training and Validation Sets will
have different standardization parameters.
Deliverable 2: Illustrate the effect of the tuning parameter on the cross validation error by
generating a plot with the ????-axis as CV(5) error, and the ????-axis the corresponding log-scaled
tuning parameter value log10(????) that generated the particular CV(5) error. Label both axes
in the plot. Without the log scaling of the tuning parameter ????, the plots will look distorted.
Deliverable 3: Indicate the value of ???? value that generated the smallest CV(5) error.
Deliverable 4: Given the optimal ????, retrain your model on the entire dataset of ???? = 183
observations to obtain an estimate of the (???? + 1) × ???? model parameter matrix as �̂� and
8
make predictions of the probability for each of the ???? = 5 classes for the 111 test
individuals located in TestData_N111_p10.csv. That is, for class ????, compute
????????(????; �̂�) =
exp(�̂�0???? + ∑ ????????�̂�????????
????
????=1 )
∑ exp (�̂�0ℓ + ∑ ????????�̂�????ℓ
????
????=1 )????
ℓ=1
for each of the 111 test samples ????, and also predict the most probable ancestry label as
�̂�(????) = arg max
????∈{1,2,…,????}
????????(????; �̂�)
Report all six values (probability for each of the ???? = 5 classes and the most probable
ancestry label) for all 111 test individuals.
Deliverable 5: How do the class label probabilities differ for the Mexican and African American
samples when compared to the class label probabilities for the unknown samples? Are these
class probabilities telling us something about recent history? Explain why these class
probabilities are reasonable with respect to knowledge of recent history.
Deliverable 6: Provide all your source code that you wrote from scratch to perform all analyses
(aside from plotting scripts, which you do not need to turn in) in this assignment, along with
instructions on how to compile and run your code.
Deliverable 7 (extra credit): Implement the assignment using statistical or machine learning
libraries in a language of your choice. Compare the results with those obtained above, and
provide a discussion as to why you believe your results are different if you found them to be
different. This is worth up to 5% additional credit, which would allow you to get up to 105% out
of 100 for this assignment.