# India

• Date Submitted: 09/25/2011 03:48 AM
• Flesch-Kincaid Score: 64.9
• Words: 1453
• Report this Essay
MIT OpenCourseWare http://ocw.mit.edu

6.006 Introduction to Algorithms
Spring 2008

Lecture 1

Introduction and Document Distance

6.006 Spring 2008

Lecture 1: Introduction and the Document
Distance Problem

Course Overview
• Eﬃcient procedures for solving problems on large inputs (Ex: entire works of Shake­ speare, human genome, U.S. Highway map) • Scalability • Classic data structures and elementary algorithms (CLRS text)
• Real implementations in Python ⇔ Fun problem sets!
• β version of the class - feedback is welcome!

Pre-requisites
• Familiarity with Python and Discrete Mathematics

Contents
The course is divided into 7 modules - each of which has a motivating problem and problem set (except for the last module). Modules and motivating problems are as described below: 1. Linked Data Structures: Document Distance (DD) 2. Hashing: DD, Genome Comparison 3. Sorting: Gas Simulation 4. Search: Rubik’s Cube 2 × 2 × 2 5. Shortest Paths: Caltech → MIT 6. Dynamic Programming: Stock Market √ 7. Numerics: 2

Document Distance Problem
Motivation Given two documents, how similar are they? • Identical - easy? • Modiﬁed or related (Ex: DNA, Plagiarism, Authorship) 1

Lecture 1

Introduction and Document Distance

6.006 Spring 2008

• Did Francis Bacon write Shakespeare’s plays? To answer the above, we need to deﬁne practical metrics. Metrics are deﬁned in terms of word frequencies. Deﬁnitions 1. Word : Sequence of alphanumeric characters. For example, the phrase “6.006 is fun” has 4 words. 2. Word Frequencies: Word frequency D(w) of a given word w is the number of times it occurs in a document D. For example, the words and word frequencies for the above phrase are as below: Count : 1 0 1 1 0 1 W ord : 6 the is 006 easy f un In practice, while counting, it is easy to choose some canonical ordering of words. 3. Distance...