CPSC 436C Cloud Computing for Data Science
CPSC 436C: CLOUD COMPUTING FOR DATA SCIENCE
Fall 2023
COURSE DESCRIPTION
This course is an introduction to cloud computing designed for the students who wish to use the cloud for data science applications. It covers the topics of how cloud computing can be used to support data science workflows, including data storage, processing, analysis, and visualization. It also includes security considerations for the entire pipeline. Overall, the course provides students with the skills and knowledge necessary to effectively use cloud computing for design, implementation, test, and deployment of data science applications.
LECTURES & CLASSROOMS
Monday - 11:00 am to 12:00 pm, ISSCS Room X150 - Demco Table 1,
Project Room - ICCS X239
Friday - 3:00 pm to 4:00 pm, ISSCS Room X150 - Demco Table 1,
Project Room - ICCS X239
TEXTBOOKS
The course will rely mainly on the following textbook.
- Learning Spark: Lightning-fast Data Analytics by: Jules Damji, Brooke Wenig, Tathagata Das
Topics
- Cloud service delivery models
- Cloud storage systems
- Batch processing
- Stream processing
- Cloud security
Team
SYLLABUS
Download the syllabus (v1.0)
HANDOUT
Lecture 1
Introduction to Datacentres and
Cloud [SLIDES]
Lecture 3
Virtualization [SLIDES]
Lecture 4
Big Data [SLIDES]
Lecture 6
Data Management Systems [SLIDES]
Lecture 8
Structured Data Processing - Machine Learning [SLIDES]
Lecture 9
Distributed Machine Learning [SLIDES]
Lecture 13
Guest Speaker-Advanced topics [SLIDES]
Assignments
- Assignment 0: Go Serverless (5%); [AWS] [Azure] [Rubric]
- Assignment 1: Containerization Vs. Serverless (5%); [AWS] [Azure] [Rubric]
- Assignment 2: Running Image recognition on a Virtual Machine (5%); [AWS] [Azure] [Rubric]
- Assignment 3: Running image recognition in a VM using Object Store (5%); [AWS] [Azure] [Rubric]
- Assignment 4: Building a Machine Learning Pipeline through Jupyter Notebook (5%); [AWS] [Azure] [Rubric]
- Assignment 5: Comparing Single Node vs Cluster Performance and Cost in Image Classification Using Amazon EMR (15%); [AWS] [Azure] [Rubric]
- Assignment 6: Streaming Text Analysis using Spark; [AWS] [Azure] [Rubric]
Tutorials
Recourses
Online resources of data:
- A Survey on the Evolution of Stream Processing Systems
- Bigtable: A Distributed Storage System for Structured Data
- Clipper: A Low-Latency Online Prediction Serving System | USENIX
- Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing
and Advanced Analytics - The Snowflake Elastic Data Warehouse
- The Datacenter as a Computer, An Introduction to the Design of Warehouse-Scale Machines
- Dominant Resource Fairness: Fair Allocation of Multiple Resource Types
- Dynamo: Amazon's Highly Available Key-value Store
- Apache Flink™: Stream and Batch Processing in a Single Engine
- GraphX: Graph Processing in a Distributed Dataflow Framework
- The Hadoop Distributed File System
- Twitter Heron: Stream Processing at Scale
- Hive - A Warehousing Solution Over a Map-Reduce Framework
- Cassandra - A Decentralized Structured Storage System
- MapReduce: Simplified Data Processing on Large Clusters
- Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
- Delta Lake: High-Performance ACID Table Storage over Cloud Obiect Stores
- Scaling Distributed Machine Learning with the Parameter Server
- PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs
- Pregel: A System for Large-Scale Graph Processing
- PyTorch Distributed: Experiences on Accelerating Data Parallel Training
- Ray: A Distributed Framework for Emerging Al Applications
- Locking the sky: a survey on laaS cloud security
- Discretized Streams: Fault-Tolerant Streaming Computation at Scale
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for
In-Memory Cluster Computing - Spark SQL: Relational Data Processing in Spark
- The Deep Learning Compiler: A Comprehensive Survey
- TVM: An Automated End-to-End Optimizing Compiler for Deep Learning