Print Settings

CS 246: Mining Massive Data Sets

Availability of massive datasets is revolutionizing science and industry. This course discusses data mining and machine learning algorithms for analyzing very large amounts of data. Topics include: Big data systems (Hadoop, Spark); Link Analysis (PageRank, spam detection); Similarity search (locality-sensitive hashing, shingling, minhashing, random hyperplanes); Stream data processing; Analysis of social-network graphs; Association rules; Dimensionality reduction (UV, SVD, and CUR decompositions); Algorithms for very-large-scale mining (clustering, nearest-neighbor search); Large-scale machine learning (gradient descent, decision tree ensembles); Multi-armed bandit; Computational advertising. We also offer a sister class CS246H (Hadoop Labs) and a follow-up project-based class CS341 (Project in Mining Massive Datasets). Prerequisites: At least one of CS107 or CS145.
Terms: Win | Units: 3-4 | Grading: Letter or Credit/No Credit
Instructors: ; Leskovec, J. (PI)

CS 246H: Mining Massive Data Sets Hadoop Lab

Supplement to CS 246 providing additional material on the Apache Hadoop family of technologies. Students will learn how to implement data mining algorithms using Hadoop and Apache Spark, how to implement and debug complex data mining and data transformations, and how to use two of the most popular big data SQL tools. Topics: data mining, machine learning, data ingest, and data transformations using Hadoop, Spark, Apache Impala, Apache Hive, Apache Kafka, Apache Sqoop, Apache Flume, Apache Avro, and Apache Parquet. Prerequisite: CS 107 or equivalent.
Terms: Win | Units: 1 | Grading: Satisfactory/No Credit
© Stanford University | Terms of Use | Copyright Complaints