CS 246: Mining Massive Data Sets
Distributed file systems: Hadoop, map-reduce; PageRank, topic-sensitive PageRank, spam detection, hubs-and-authorities; similarity search; shingling, minhashing, random hyperplanes, locality-sensitive hashing; analysis of social-network graphs; association rules; dimensionality reduction: UV, SVD, and CUR decompositions; algorithms for very-large-scale mining: clustering, nearest-neighbor search, gradient descent, support-vector machines, classification, and regression; submodular function optimization. Prerequisites: At lease one of CS107 or
CS145; at least one of CS109 or STAT116, or equivalent.
Terms: Win
| Units: 3-4
Instructors:
Leskovec, J. (PI)
CS 246H: Mining Massive Data Sets Hadoop Lab
Supplement to
CS 246 providing additional material on Hadoop. Students will learn how to implement data mining algorithms using Hadoop, how to implement and debug complex MapReduce jobs in Hadoop, and how to use some of the tools in the Hadoop ecosystem for data mining and machine learning. Topics: Hadoop, MapReduce, HDFS, combiners, secondary sort, distributed cache, SQL on Hadoop, Hive, Cloudera ML/Oryx, Mahout, Hadoop streaming, implementing Hadoop jobs, debugging Hadoop jobs, TF-IDF, Pig, Sqoop, Oozie, HBase, Impala. Prerequisite:
CS 107 or equivalent.
Terms: Win
| Units: 1
Instructors:
Leskovec, J. (PI)
;
Templeton, D. (PI)
Filter Results: