CS 4641 B: Machine Learning (Spring 2021)
Course Information
- Lecture: Tuesdays and Thursdays, 12:30pm-1:45pm on BlueJeans with credentials
- Piazza: piazza.com/gatech/spring2021/cs4641b
Course Overview
This course introduces techniques in machine learning with an emphasis on algorithms and their applications to real-world data. We will investigate the following question: how to computationally extract useful knowledge from data for decision making and task support? We will focus on machine learning methods, which are organized into three parts:
Basic math for data science and machine learning
- Linear algebra
- Probability and statistics
- Information theory
- Optimization
Supervised learning for predictive data analysis
- Tree-based models
- Support vector machines
- Linear classification and regression
- Neural networks
Unsupervised machine learning for data exploration
- Clustering analysis
- Dimensionality reduction
- Kernel density estimation
Advanced Topics
- Hidden Markov Models
- Reinforcement Learning
Prerequisites for this course include (1) basic knowledge of probability, statistics, and linear algebra; (2) Basic programming experience in Python.
In addition to the technical content, this class includes the following learning objectives:
- Understanding the Ethical Ramifications of these algorithms
- Structuring a task into a machine learning work flow
- Collaborating effectively on team projects in a remote environment
- Conducting peer evaluation in a constructive format
- Communicating technical content in a concise and effective manner
Schedule
For all dates used in this course, their times are 23:59 Anywhere on Earth (11:59 pm AoE). For example, a due date of "January 8" is the same as "January 8, 23:59pm AoE". Convert the times to your local times using a Time Zone Converter .
Course policies
- Attendance: Our class will be offered in a Hybrid mode without attendance requirements. Every lecture will be live-streamed and recorded. The recordings will be made available to all students after class time. Attendance is not required for this class, but I highly recommend students to come to the live-streaming lectures. The live attendence will help me teach by point out what is clear and what should need more explanation. There are three touchpoints, which can be in-person on the explicit request of individual student teams. All touchpoints can be remote and do not require any mandatory on-campus activities.
- Class deliverables and late days: All class deliverables will be handled via Gradescope and Canvas. The time span offered to complete the course objectives is plentiful. There are a total of 3 late days that students can use over the semester to be used across all assignments. To ensure the class is fair for all students, you will receive zero credit for work submitted after the 3 late days have been used up. Regrade requests should be submitted directly on Gradescope within one week of grade publication. Should you find yourself in an impasse with the TA responsible for your grading, feel free to contact the head TA or course instructor.
- Piazza:
- Piazza will be the main and only place for the course discussions and announcements. If you have questions, please ask it on Piazza first because 1) other students may have the same question; 2) you will get help much faster.
- If it’s something you do not like to discuss publicly on Piazza, you can use private messaging on Piazza.
- Anytime you want to send a private message to just me on Piazza, please make sure to add our HEAD TAs too in case I may miss your message.
-
Piazza GOOD questions
- I don't understand this part of the lecture, can you explain it to me?
- This certain part of the hw is not clear to me, would it be possible to explain that more?
- I have a question about the project ...
- I found an issue on the website, hw or the lectures, can you clarify ...
- Any feedback, suggestions, ... would be greatly appreciated.
- Usually, most of the questions are good.
-
Piazza BAD questions
- Can you debug my code? [our team will not do that. You need to be specific about your question]
- Can you find where the problem is in my code?
- Exceptional circumstances: Any request for exceptions to these policies should be made in advance when at all possible. Requests should be due to incapacitating illness, personal emergencies, or similarly serious events. Your request should be accompanied by a supporting letter issued by the Dean of Students.
Diversity and inclusion
Just as machine learning algorithms cannot accomplish complex tasks if trained on datasets of limited variability, our course cannot be successful without appreciating the diversity of our students. In this class we aim to create an environment where all voices are valued, respecting the diversity of gender, sexuality, age, socioeconomic status, ability, ethnicity, race, and culture. We always welcome suggestions that can help us achieve this goal. Additionally, if any of our class scheduled activities conflicts with religious events, please inform the instruction team so that we can make appropriate arrangements for you.
Students with disabilities: your access to this course is extremely important to us. The institute has policies regarding disability accommodation, which are administered through the Office of Disability Services: http://disabilityservices.gatech.edu. Please request your accommodation letter as early in the semester as possible, so that we have adequate time to arrange your approved academic accommodations.
Office hours and questions
The BlueJeans links and Microsoft forms link for each office hour are provided on the calendar. Please register in the Forms at the start of the office hour, and we will invite you into the private breakout session for councelling. You will need to log in with your BlueJeans ID. Each student can only be advised for 10 minutes. If you need a longer consultation, please let the TAs know, and can attempt to help you once they have completed their appointments with other students.
Grading
Assignments (50%)
- There will be four assignments. Each one is designed to improve and test your understanding of the materials. Assignments will have both programming and written analysis components.
- You will need to submit all your assignments using Gradescope. Instructions on how to submit your code and written portions will follow with every assignment. Handwritten solutions WILL NOT BE ACCEPTED and you will not receive credit for a handwritten submission.
- You are required to use Markdown, LaTeX or a word processing software to generate your solutions to the written questions. Because handwritten solutions WILL NOT BE ACCEPTED OR GRADED. ([For additional resources watch the LaTeX tutorial created in the previous version of the course] and OverLeaf Latex Example in the Video).
- All students are expected to follow the Georgia Tech Academic Honor Code.
- You can easily export your Jupyter Notebook to a Python file and import that to your desired python IDE to debug your code for assignments.
- You are NOT allowed to share any assignment codes or answers with other students. Piazza is the best place to have discussion regarding assignments and course topics. Discussions are just for the better understanding of questions and should not directly answer the questions.
Project (35%)
Touchpoint 1 and the Project Proposal (12%)
- A project proposal should be written on your GitHub page. It is also a good starter to come up with the first draft of your project.
- You need to provide us the link to your GitHub page. Make sure your GitHub repository is private.
- It should be less than 500 words single spaced. References are not the part of the word count.
- A project proposal should include:
- Abstract
- Introduction/Background
- Problem definition
- Methods
- Metrics
- Potential results
- Discussion
- At least three peer reviewed references. You need to properly cite the references on your proposal.
- A checkpoint to make sure you are working on a proper machine learning related project.
- Your group needs to submit a presentation of your proposal. Please provide us a public link which includes a 3 minutes recorded video. I found that OBS Studio and GT subscribed Kaltura are good tools to record your screen. Please make your visuals are clearly visible in your video presentation. This includes redable labels for figures and axes.
Touchpoint 2 and the Midterm deliverables (8%)
- A checkpoint to make sure that you have had major progress in your project. You will add information to your project Proposal and turn it into your midterm report.
- You need to provide us the link to your GitHub page. Make sure your GitHub repository is private.
- The midterm report does not have a word count limitation.
- A project midterm report is quite similar to your proposal with the exception of having actual results instead of potential ones:
- Abstract
- Introduction/Background
- Problem definition
- Data collection
- Methods
- Metrics
- Results
- Discussion
- We need to see where you obtain your data and how you have done your data cleaning. Make sure to talk about the features and different feature selection approaches.
- You need to a submit a 3 minute video for the midterm report.
Touchpoint 3 and the Final report (15%)
- You need to provide us the link to your GitHub page. Make sure your GitHub repository is private.
- The final report should use the Neurips style files and template. A final report should be between 4-8 pages long and include:
- Abstract
- Introduction/Background
- Problem definition
- Data collection
- Methods
- Metrics
- Results
- Discussion
- Ethics Statement
- References
- Your group needs to submit a presentation of your final report. Please provide us a public link which includes a 7 minutes recorded video. I found that OBS Studio and GT subscribed Kaltura are good tools to record your screen. Please make your visuals are clearly visible in your video presentation.
General project guidance
- Your project will be graded based on the following criteria in the reports, code and the touchpoints. Try to answer these questions clearly in your reports and presentations to recieve full points: Was the motivation clear?
- Was the project motivation clear in the abstract and introduction?
- What is the problem?
- Are the inputs and outputs clearly defined in the problem statement?
- Why is it important and why we should care?
- What have others tried? Why was it not sufficient?
- How did you get your dataset?
- What are its characteristics (e.g. number of features, # of records, temporal or not, etc.)
- Did you perform any cleaning or processing on your data?
- What methods are you using to solve this problem?
- Is this a novel approach? Did you borrow it from another paper?
- Do you have a baseline approach? A naive method to solve the problem which can serve as a comparison to your approach.
- How did you evaluate your approach? What are your metrics? Are they useful?
- What are the results? Are they visual or numeric?
- How do you compare your method to other methods? Are the comparisons to the baseline or competing methods fair?
- What did we learn from your experiment?
- Why should we care?
- What the ethical implications of this work? Are there ways to mitigate its undesirable effects?
- Finished on time?
- Effective visualizations? (Are they relevant? Do they help you better understand the project's approaches and ideas?)
- Use of text (Succinct or verbose?)
- Can a TA run your code? Are there instructions on your github and they accurte for running the code? Even for a single epoch to produce a result.
- Is the code readable? Can a TA understand your code from its comments and structure? Is the code doing what you claim it is doing?
- Can someone else use this code?
- In order for you to obtain hands-on experience applying the topics covered in this course, you are expected to complete a term project utilizing real-world data.
- Each project needs to be completed in a team of five people (you will be forming your team on your own. In case you cannot find a team, we will randomly assign you a team). Team members need to clearly claim their contributions in the project report. Once your teams have been formed and you have selected a topic, you will be assigned a mentor, who will provide you with general guidance on your project. It is important to note that your team will lead the project effort: obtaining the data, researching data-driven approaches to accomplish your project goal and coordinate your own activities. The role of the mentor is solely to advise you, should you find yourself stuck and unable to make progress. We also accept a team of four, if you really cannot find the fifth team member.
- You will create a GitHub page page for your project, which you will use to publish your main deliverables. There will be three deliverables published to your GitHub: a proposal, a midterm checkpoint, and a final report.
- Seminars:To help you conduct your project successfully, we have project seminars where one TA will present their research or industrial projects. Doing so, you will gain a good sense of what it is being done in both Academia and Industry. We will have Piazza post for each seminar, where students can ask general questions about the TAs project. Seminars will be recorded and they will be published on Piazza. Please ensure that you are able to watch at-least two seminars and get yourself familiar with the practical and real-world application of ML. All the information will be updated for each seminar on the course website with a BJ Event link.
- Google colaboratory allows free access to run your Jupyter Notebook. I strongly suggest you use it for your project, especially for teams that are going to employ Deep Learning. Don't forget to take advatage of Google Cloud Platform and AWS Educate as well.
Quizzes (10%)
- There will be 13 quizzes throughout the semester.
- We will consider your top 9 quizzes' scores. Each quiz will have ~1.7% of your final score.
- All quizzes are mandatory to be taken even if they do not count toward your final grade.
- The topic of each quiz will coincide roughly with the content covered in class on that week.
- Quizzes will have a duration of seven-minutes. Each quiz will have five multiple choice questions. They will be available from 12:00 am AOE Thursday until 11:59 pm AOE Thursday.
- Quizzes measure your understanding of the topics and they will be more conceptual questions.
- Quizzes' answers will be released on Saturdays. Please do not ask any questions about a quiz that you just take on Piazza before we release the answers.
- Quizzes questions are selected randomly from our question bank, which means that students will not receive the same questions for their quiz.
Class participation (5%)
- Piazza has statistics which give us many measurements regarding how much a student has been involved on Piazza's activities such as viewing posts, answering questions well, asking good questions and so on. We use this to account for your Class Participation score. Beyond that students will be invited to participate in the online class session as a co-host, and their participation here will also account for class participation scores. At the end of the semester, we will define a minimum and maximum number of involvement considering all the students and your grade will be defined based on that.
Bonus points (up to 6%)
- About bonus points: Bonus points will be counted to always be beneficial for your final grade. More information on bonus points for assignments will be provided as the semester progresses. If it becomes necessary to curve grades, bonus points will be applied after curving, not before.
- You can obtain up to 6% bonus points would be answering the challenging questions we may have in some of the hws.
- How does it work? For example, hw 1 may have 30 bonus points, hw 2 may have 20 bonus points and so on. If you receive all the bonus points for all your hws, we will add 6% to your final grade.
COVID-19 Policy
The semester can be especially challenging due to the Covid-19 pandemic and a growing awareness of racial inequities. The following information relates to specific services and guidelines for courses during this semester. The most up-to-date information on Covid-19 is on the TECH Moving Forward website and in the Academic Restart Frequently Asked Questions.
Expectations and Guidelines
Each of us has a responsibility to ourselves and our fellow Yellow Jackets to be mindful of our shared commitment.
- We are all required to wear a face covering while inside any campus facilities/buildings, including during in-person classes, and to adhere to social distancing of at least 6 feet. If an individual forgets to bring a face covering to class or into any indoor space, there will be a clearly marked supply of these in each building. If a student fails to follow Georgia Tech’s policies on social distancing and face coverings, they will initially be reminded of the policy and if necessary, asked to leave the class, meeting, or space. If they still fail to follow the policy, they may be referred to the Office of the Dean of Students. Information on the Institute’s policy on face coverings.
- Students are expected to sit in assigned seats and to come to class only on days that are assigned to them.
- Papers, projects, tests, homework, and other assignments will only be accepted in electronic form unless the assignment is a physical artifact.
Additional information is available in the Student Guidebook.
Instructor Illness or Exposure to Covid-19
During the Spring 2021 semester, some faculty members may be required to quarantine due to exposure or isolate due to a Covid-19 diagnosis. Some disruption to classes or services is inevitable, but Georgia Tech is making every effort to ensure continuity of operations. As is the case in any semester, faculty may cancel a class if they have an illness or emergency situation and cover any missed material at their own discretion. If an instructor needs to cancel a class, they should notify students as early as possible.
Faculty who are staying home due to symptoms should monitor their health closely and consult with their school chair to determine if remote instruction or substitute instruction is most appropriate for the course. If they need to cancel a class repeatedly, a backup will be supplied in the form of a temporary substitute instructor or asynchronous work. No course will be canceled after the first class has occurred.
If you have not tested positive but are ill or have been exposed to someone who is ill, please follow the Covid-19 Exposure Decision Tree for reporting your illness.
Student Illness or Exposure to Covid-19
During the semester, you may be required to quarantine or self-isolate to avoid the risk of infection to others. Quarantine is the separation of those who have been exposed to someone with Covid-19 but who are not ill; isolation is the separation of those who have tested positive for Covid-19 or been diagnosed with Covid-19 by symptoms.
If you have not tested positive but are ill or have been exposed to someone who is ill, please follow the Covid-19 Exposure Decision Tree for reporting your illness.
During the quarantine or isolation period you may feel completely well, ill but able to work as usual, or too ill to work until you recover.
Remote courses and remote class sessions during hybrid courses. Unless you are too ill to work, you should be able to complete your remote work while in quarantine or isolation.
If you are ill and unable to do course work this will be treated similarly to any student illness. The Dean of Students will have been contacted when you report your positive test or are told that it is necessary to quarantine and will notify your instructor that you may be unable to attend class events or finish your work as the result of a health issue. Your instructor will not be told the reason. We have asked all faculty to be lenient and understanding when setting work deadlines or expecting students to finish work, and so you should be able to catch up with any work that you miss while in quarantine or isolation. Your instructor may make available any video recordings of classes or slides that have been used while you are absent, and may prepare some complementary asynchronous assignments that compensate for your inability to participate in class sessions. Ask your instructor for the details.
CARE Center, Counseling Center, Stamps Health Services, and the Student Center
These uncertain times can be difficult, and many students may need help in dealing with stress and mental health. The CARE Center and the Counseling Center, and Stamps Health Services will offer both in-person and virtual appointments. Face-to-face appointments will require wearing a face covering and social distancing, with exceptions for medical examinations. Student Center services and operations are available on the Student Center website. For more information on these and other student services, contact the Vice President and Dean of Students or the Division of Student Life.
Accommodations for Students at Higher Risk for Severe Illness with Covid-19
Students may request an accommodation through the Office of Disability Services (ODS) due to 1) presence of a condition as defined by the Americans with Disabilities Act (ADA), or 2) identification as an individual of higher risk for Covid-19, as defined by the Centers for Disease Control (CDC). Registering with ODS is a 3-step process that includes completing an application, uploading documentation related to the accommodation request, and scheduling an appointment for an “intake meeting” (either in person or via phone or video conference) with a disability coordinator.
If you have been approved by ODS for an accommodation, I will work closely with you to understand your needs and make a good faith effort to investigate whether or not requested accommodations are possible for this course. If the accommodation request results in a fundamental alteration of the stated learning outcome of this course, ODS, academic advisors, and the school offering the course will work with you to find a suitable alternative that as far as possible preserves your progress toward graduation.
Resources
No textbook will be required for this course, however you are strongly encouraged to complete the readings indicated for each class. You may also find the following books very helpful:
- Learning from data, by Yaser S. Abu-Mostafa
- Pattern recognition and machine learning, by Christopher Bishop
- Machine learning, by Tom Mitchell
- Data Mining: Concepts and Techniques, by Jiawei Han, Micheline Kamber, and Jian Pei
- The Elements of Statistical Learning, by Trevor Hastie, Robert Tibshirani, and Jerome Friedman
- Deep Learning, by Ian Goodfellow, Yoshua Bengio, and Aaron Courville
Other resources, such as machine learning toolboxes and datasets, will be provided throughout the course.
Dataset Ideas (may need API, or scraping) - Thanks to Polo and everyone who contributed with suggestions to these datasets
- Google Dataset Search
- Google public datasets.
- Kaggle public datasets
- Awesome Public Datasets.
- NYC Taxi data for 2013 Trip Data (11.0GB). 2013 Fare Data (7.7GB). Visualization for a days trip.
- Large datasets publicly available.
- Georgia Tech's campus data (has APIs): bus info, directory, building, T-square, room reservation, building facilities usage (e.g., electricity, lights, A/C, etc.), Oscar/course info/registration, etc.
- Yahoo WebScope
- Data.gov: U.S. Government's open data
- IPEDS data: Postsecondary education data from National Centre for Education Statistics
- Bureau of Labor Statistics data
- Uber data: Anonymized data from over 2 billion trips
- Freebase
- Yelp
- Microsoft Academic Graph
- Numerous APIs from Google (e.g., Maps, Freebase, YouTube, etc.)
- Zillow: real estate listing site
- Numerous graph datasets (large and small): SNAP, Konect
- Movies data: IMDB
- List of lists of datasets for recommendations.
- Million song dataset by Echo Nest.
It contains not only the basic information of songs (artist, genre, year, length etc), but also some musical features(like tempo, pitch, key, brightness).
- Dataset about soccer games, players, clubs.
No API, but easy to scrape.
For a soccer player: transfer history, performance, nationality, birth date, etc.
For a soccer club: performance, squad, etc.
- The Free 'Big Data' Sources Everyone Should Know
-
Quandl - a dataset search engine for time-series data.
-
UCI also has a collection of links to various datasets sorted for various tasks (Classification, Regression, etc)
- Amazon AWS Public Data Sets
- KDD Cup: annual competition in data mining, like Kaggle
- Academic domain: Microsoft Academic Search, DBLP
- Retrosheet: MLB statistics (Game/Play logs)
-
Classification datasets
-
Various geophysical datasets for the oceans (magnetism, gravity, seismology, etc).
- Social trends
- Beer data Website offline :( . Older version at web.archive.org
- Academic torrents (terabytes)
- Article Search API from the New York Times (all the way back to 1851!)
- Civil Engineering Dataset
- (Kayak: flight, hotel, car, etc.)
- Data Science Initiative - Microsoft Research has various datasets and access to tools that can aid in data science research