CMPT 732 Lecture Notes

  1. Course Introduction [“Course Introduction” slides]
    1. Us [Us slides]
    2. Welcome [Welcome slides]
    3. This Course [This Course slides]
    4. What is Big Data? [What is Big Data? slides]
    5. How big is “Big Data”? [How big is “Big Data”? slides]
    6. “Big data” isn't always big. [“Big data” isn't always big. slides]
    7. None [None slides]
    8. Clusters [Clusters slides]
    9. Hadoop [Hadoop slides]
    10. Our Environment [Our Environment slides]
    11. Things you will do [Things you will do slides]
    12. Lecture and Labs [Lecture and Labs slides]
    13. Course Topics [Course Topics slides]
    14. Expectations [Expectations slides]
  2. Hadoop Concepts [“Hadoop Concepts” slides]
    1. Our Cluster [Our Cluster slides]
    2. Hadoop Pieces [Hadoop Pieces slides]
    3. HDFS [HDFS slides]
    4. YARN [YARN slides]
    5. (Simplified) Cluster Overview [(Simplified) Cluster Overview slides]
    6. Work on Hadoop [Work on Hadoop slides]
    7. MapReduce [MapReduce slides]
    8. MapReduce Stages [MapReduce Stages slides]
    9. Example: word count [Example: word count slides]
    10. Whiteboard: Fall 2018 [Whiteboard: Fall 2018 slides]
    11. MapReduce Anatomy [MapReduce Anatomy slides]
    12. Hadoop MapReduce Details [Hadoop MapReduce Details slides]
    13. Summary Output [Summary Output slides]
    14. MapReduce Parallelism [MapReduce Parallelism slides]
    15. Writables [Writables slides]
    16. Example: word count [Example: word count slides]
    17. About MapReduce [About MapReduce slides]
    18. MapReduce: One more way [MapReduce: One more way slides]
    19. MapReduce Data Flow [MapReduce Data Flow slides]
  3. Python Preliminaries [“Python Preliminaries” slides]
    1. About Python [About Python slides]
    2. Data Types [Data Types slides]
    3. Unpacking Tuples [Unpacking Tuples slides]
    4. First-Class Functions [First-Class Functions slides]
    5. Lambda Functions [Lambda Functions slides]
    6. Iterators and Generators [Iterators and Generators slides]
    7. Imperative vs declarative [Imperative vs declarative slides]
  4. Spark Concepts [“Spark Concepts” slides]
    1. Spark [Spark slides]
    2. An Example [An Example slides]
    3. RDDs [RDDs slides]
    4. RDD Operations [RDD Operations slides]
    5. Operations and Partitions [Operations and Partitions slides]
    6. Partitions [Partitions slides]
    7. Lazy Evaluation [Lazy Evaluation slides]
    8. Chaining Calculations [Chaining Calculations slides]
    9. Combining Calculations [Combining Calculations slides]
    10. Shuffle Operations [Shuffle Operations slides]
    11. Drivers & Executors [Drivers & Executors slides]
    12. Controlling Executors [Controlling Executors slides]
    13. Spark Web Frontend [Spark Web Frontend slides]
    14. Spark vs MapReduce [Spark vs MapReduce slides]
    15. Spark DAG [Spark DAG slides]
    16. Stages [Stages slides]
    17. Job, Stages, Tasks [Job, Stages, Tasks slides]
    18. RDD Methods [RDD Methods slides]
  5. Spark DataFrames Concepts [“Spark DataFrames Concepts” slides]
    1. Working With Data [Working With Data slides]
    2. Spark DataFrames [Spark DataFrames slides]
    3. DataFrames [DataFrames slides]
    4. Column Expressions [Column Expressions slides]
    5. Limitations [Limitations slides]
    6. UDFs [UDFs slides]
    7. Python ↔ JVM [Python ↔ JVM slides]
    8. SQL Syntax [SQL Syntax slides]
    9. The Optimizer [The Optimizer slides]
    10. Input/Output [Input/Output slides]
    11. Parquet [Parquet slides]
    12. Partitioning [Partitioning slides]
  6. NoSQL & Cassandra Concepts [“NoSQL & Cassandra Concepts” slides]
    1. The Problem [The Problem slides]
    2. Some DB Operations [Some DB Operations slides]
    3. Non-Relational Databases [Non-Relational Databases slides]
    4. NoSQL Limitations [NoSQL Limitations slides]
    5. CAP “Theorem” [CAP “Theorem” slides]
    6. NoSQL + CAP [NoSQL + CAP slides]
    7. NoSQL Categories [NoSQL Categories slides]
    8. NewSQL [NewSQL slides]
    9. Cassandra [Cassandra slides]
    10. Cassandra Data Model [Cassandra Data Model slides]
    11. CQL [CQL slides]
    12. Fault Tolerance [Fault Tolerance slides]
    13. Consistency [Consistency slides]
    14. Relational Data [Relational Data slides]
    15. Denormalizing Data [Denormalizing Data slides]
    16. Idempotence [Idempotence slides]
    17. Pure Functions [Pure Functions slides]
  7. Data Management [“Data Management” slides]
    1. The V's [The V's slides]
    2. OLAP vs OLTP [OLAP vs OLTP slides]
    3. Extract-Transform-Load [Extract-Transform-Load slides]
    4. Data Warehousing [Data Warehousing slides]
  8. Small Data [“Small Data” slides]
    1. Spark for ETL [Spark for ETL slides]
    2. Python Data Tools [Python Data Tools slides]
    3. NumPy [NumPy slides]
    4. Pandas [Pandas slides]
    5. Pandas & Spark [Pandas & Spark slides]
    6. SciPy [SciPy slides]
    7. SciKit-Learn [SciKit-Learn slides]
    8. Python Libraries [Python Libraries slides]
  9. NumPy/Pandas Speed [“NumPy/Pandas Speed” slides]
    1. Why So Slow? [Why So Slow? slides]
    2. NumPy Expression [NumPy Expression slides]
    3. Applying to a Series [Applying to a Series slides]
    4. Vectorizing [Vectorizing slides]
    5. Applying By Row [Applying By Row slides]
    6. Using Python [Using Python slides]
    7. With numexpr [With numexpr slides]
    8. Summary [Summary slides]
  10. Spark Streaming [“Spark Streaming” slides]
    1. The Purpose [The Purpose slides]
    2. Options [Options slides]
    3. RDD-Based: The Idea [RDD-Based: The Idea slides]
    4. DStreams [DStreams slides]
    5. Structured Streaming [Structured Streaming slides]
  11. Spark Machine Learning [“Spark Machine Learning” slides]
    1. Recap: Machine Learning [Recap: Machine Learning slides]
    2. Spark ML [Spark ML slides]
    3. Pipelines [Pipelines slides]
    4. Models [Models slides]
    5. Evaluation [Evaluation slides]
    6. ML Algorithms [ML Algorithms slides]
    7. More Topics [More Topics slides]
  12. Why MapReduce? [“Why MapReduce?” slides]
    1. MapReduce History [MapReduce History slides]
    2. Fault Tolerance [Fault Tolerance slides]
    3. Where Your Data Might Be [Where Your Data Might Be slides]
  13. Hadoop/Spark Config [“Hadoop/Spark Config” slides]
    1. Config Objects [Config Objects slides]
    2. The Command Line [The Command Line slides]
    3. In Code [In Code slides]
    4. Config Options [Config Options slides]
    5. Spark Context/Session [Spark Context/Session slides]
    6. Filesystems [Filesystems slides]
    7. Spark 2.4 [Spark 2.4 slides]
  14. Other Big Data Tools [“Other Big Data Tools” slides]
    1. A Look Back [A Look Back slides]
    2. What Else Is There? [What Else Is There? slides]
    3. The Plan… [The Plan… slides]
    4. Part 1: We Have Seen An Example [Part 1: We Have Seen An Example slides]
    5. Doing Computation [Doing Computation slides]
    6. Expressing Computation [Expressing Computation slides]
    7. Data Warehousing [Data Warehousing slides]
    8. Storing Files [Storing Files slides]
    9. Databases [Databases slides]
    10. Serialization/Storage [Serialization/Storage slides]
    11. Streaming [Streaming slides]
    12. ML Libraries [ML Libraries slides]
    13. Part 2: New (to us) Categories [Part 2: New (to us) Categories slides]
    14. Visualization [Visualization slides]
    15. Extract-Transform-Load [Extract-Transform-Load slides]
    16. Message Queues [Message Queues slides]
    17. Task Queues [Task Queues slides]
    18. Text Search [Text Search slides]
    19. Hadoop Distributions [Hadoop Distributions slides]
    20. Learning More [Learning More slides]
  15. Deploying Hadoop [“Deploying Hadoop” slides]
    1. Moving Parts [Moving Parts slides]
    2. Our Cluster [Our Cluster slides]
    3. Example Configurations [Example Configurations slides]
    4. Hardware [Hardware slides]
    5. Capacity Planning [Capacity Planning slides]
    6. Decisions to Make [Decisions to Make slides]
    7. Hadoop Distributions [Hadoop Distributions slides]
    8. Cluster Demo [Cluster Demo slides]

Course home page.

Schedule

WeekDateStarting Point
1Sept 4Intro
2Sept 11MapReduce Anatomy
3Sept 18Spark RDD Operations
4Sept 25Spark: Controlling Executors
5Oct 2DataFrames: Column Expressions
6Oct 9no lecture
7Oct 16DataFrames: Parquet, NoSQL & Cassandra
8Oct 23Data Management
9Oct 30Spark Streaming
10Nov 6Why MapReduce?
11Nov 13no lecture
12Nov 20Other Big Data Tools Spark 2.4
13Nov 27Other Big Data: Task Queues