For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
Lecture 1
Introduction and Document Distance
6.006 Spring 2008
Lecture 1: Introduction and the Document
Distance Problem
Course Overview
• Efficient procedures for solving problems on large inputs (Ex: entire works of Shake speare, human genome, U.S. Highway map) • Scalability • Classic data structures and elementary algorithms (CLRS text)
• Real implementations in Python ⇔ Fun problem sets!
• β version of the class - feedback is welcome!
Pre-requisites
• Familiarity with Python and Discrete Mathematics
Contents
The course is divided into 7 modules - each of which has a motivating problem and problem set (except for the last module). Modules and motivating problems are as described below: 1. Linked Data Structures: Document Distance (DD) 2. Hashing: DD, Genome Comparison 3. Sorting: Gas Simulation 4. Search: Rubik’s Cube 2 × 2 × 2 5. Shortest Paths: Caltech → MIT 6. Dynamic Programming: Stock Market √ 7. Numerics: 2
Document Distance Problem
Motivation Given two documents, how similar are they? • Identical - easy? • Modified or related (Ex: DNA, Plagiarism, Authorship) 1
Lecture 1
Introduction and Document Distance
6.006 Spring 2008
• Did Francis Bacon write Shakespeare’s plays? To answer the above, we need to define practical metrics. Metrics are defined in terms of word frequencies. Definitions 1. Word : Sequence of alphanumeric characters. For example, the phrase “6.006 is fun” has 4 words. 2. Word Frequencies: Word frequency D(w) of a given word w is the number of times it occurs in a document D. For example, the words and word frequencies for the above phrase are as below: Count : 1 0 1 1 0 1 W ord : 6 the is 006 easy f un In practice, while counting, it is easy to choose some canonical ordering of words. 3. Distance...
No comments