Project 3 – Ensemble Methods and Unsupervised Learning
In this project you will explore some techniques in unsupervised learning
as well as ensemble methods. It is important to realize that understanding
an algorithm or technique requires understanding how it behaves under a
variety of circumstances. You will go through the process of choosing and
exploring two classification datasets, tuning the algorithms you have
learned about, writing a thorough analysis of your findings, and presenting
your findings. The most crucial part of this assignment is the analysis and
your ability to explain and justify your results.
I. Choosing Datasets
The first task in this assignment is choosing two interesting classification
datasets, these can be binary or multiclass. The features can be of any
type, and it is recommended that you choose datasets with diverse feature
sets. I don’t care where you get the data from. You can download some,
take some from your own research, or make some up on your own. What I
do care about is that the datasets must be interesting. They should
contain a decent amount of features and a sufficiently large amount of
examples. Do not choose an “easy” dataset, however don’t go crazy either
trying to find the perfect one. Your two datasets should also differ in some
way such that you can compare and contrast your results between the
two. You should also be following standard machine learning practice by
splitting your dataset into training and testing, and only touching the
testing dataset at the very end when you are ready to report results. (Cross
validation is highly recommended).
II. Coding (10%)
After choosing your datasets you will now be tasked with writing code to apply
the machine learning algorithms you have learned about. Your code must be
written in python, but you may use any libraries that have already implemented
the machine learning algorithms (e.g scikit-learn). You are not expected to code
the algorithms from scratch, and in fact I would highly discourage it. What you
may not do is copy code from the internet. Below are the analyses you are
required to run.
1) Run K-means and Hierarchical Clustering on your datasets and analyze
what you observe.
2) Run two dimensionality reduction algorithms (PCA and UMAP) on your
datasets. Observe and analyze the results.
3) Re-run the K-means and Hierarchical Clustering on your dimensionality
reduced datasets and compare the results to part (1).
4) Tune and train two ensemble models (AdaBoost and Random Forests) on
both your original and dimensionality reduced datasets. Compare and
analyze the results.
Your code does not have to be pretty or well written. However, it must be written
in python and I must be able to run one script (main.py) that will produce all the
results and figures in your report.
III. Report (80%)
You will then produce a report describing and analyzing your methods and
results. Here you will describe the datasets you have chosen and why they are
interesting. You will then provide an analysis on how the different machine
learning algorithms performed on each dataset. The report must be limited to 10
pages maximum. Plots and figures are highly recommended. It is up to you
how you wish to demonstrate your understanding of the machine learning
algorithms you have explored, but below I have listed some potential ideas for
analysis and items you may wish to include in the report.
• A description of your two datasets and why you feel that they are interesting.
• Hypotheses on how you believe the learning algorithms will perform on each
dataset and why.
• How you dealt with different features in your datasets? missing data? different
scalings?
• Training and testing error rates you obtained for your various learning
algorithms (some sort of cross validation is highly recommended)
• The effect of hyperparameters on performance
• Comparing and contrasting results between datasets
• Comparing and contrasting results between learning algorithms
• Training and testing error rates as a function of training dataset size
• Timing analysis of how long it takes to train/test each algorithm
• Conclusions
• Ideas for future analyses
• What you may have done differently
• References
You are NOT being graded on how well the algorithms perform on your datasets.
What is most important is WHY? You should be explaining and justifying all of
your figures and results, and demonstrating that you understand the intricate
details of the machine learning process, and the machine leaning algorithms you
are using.
IV. Presentation (10%)
Finally you will give a maximum 7 minute presentation of your results (You will be
cut off exactly at the 7 minute mark). In this presentation you will describe your
datasets, your methods, and any interesting results you found!
What to turn in?
Below is a list of items you will be required to turn in via canvas. Please make
sure all documents are named as described bellow.
• report.pdf – Your maximum 10 page report in pdf format. Do not use super
tiny or large font. No specific formatting is required but use common sense.
• presentation.pptx or presentation.key – Your presentation slides either in a
powerpoint or keynote document.
• code.zip – A zip file with all of the code you have written. Within the folder
there should a file called README.txt that contains instructions on how to run
your code, and a python file called main.py that will produce all figures and
plots in your report/presentation. I should be able to reproduce your results
easily.
• data.zip – A zip file that contains the two datasets you have chosen.
Grading
You are being scored on your analysis more than anything else. Roughly
speaking, implementing everything and getting it to run is worth very little for
this assignment. Of course though, analysis without proof of working code
makes the analysis suspect. The key thing is that your explanations should be
both thorough and concise, and your analysis should prove to me you have a
deep understanding of the machine learning process and the machine learning
algorithms you are using.