cc5212-1 procesamiento masivo de datos...
TRANSCRIPT
![Page 1: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/1.jpg)
CC5212-1
PROCESAMIENTO MASIVO DE DATOS
OTOÑO 2017
Lecture 7: Information Retrieval II
Aidan Hogan
![Page 2: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/2.jpg)
How does Google know about the Web?
![Page 3: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/3.jpg)
Inverted Index: Example
Term List Posting List
a (1,2,…)
american (1,5,…)
and (1,2,…)
by (1,2,…)
directed (1,2,…)
drama (1,16,…)
… …
Inverted index:
1
Fruitvale Station is a 2013 American drama film written and directed by Ryan Coogler.
![Page 4: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/4.jpg)
• Inverted Index
– They built one so you don’t have to!
– Open Source in Java
Apache Lucene
![Page 5: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/5.jpg)
• Inverted Index
– They built one so you don’t have to!
– Open Source in Java
Apache Lucene
![Page 6: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/6.jpg)
INFORMATION RETRIEVAL:
RANKING
![Page 7: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/7.jpg)
How Does Google Get Such Good Results?
obama
![Page 8: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/8.jpg)
How Does Google Get Such Good Results?
aidan hogan
![Page 9: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/9.jpg)
How does Google Get Such Good Results?
![Page 10: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/10.jpg)
Two Sides to Ranking: Relevance
≠
![Page 11: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/11.jpg)
Two Sides to Ranking: Importance
>
![Page 12: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/12.jpg)
RANKING:
RELEVANCE
![Page 13: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/13.jpg)
Example Query
Which of these three keyword terms is most “important”?
![Page 14: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/14.jpg)
Matches in a Document
freedom• 7 occurrences
![Page 15: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/15.jpg)
Matches in a Document
freedom• 7 occurrences
movie• 16 occurrences
![Page 16: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/16.jpg)
Matches in a Document
freedom• 7 occurrences
movie• 16 occurrences
wallace• 88 occurrences
![Page 17: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/17.jpg)
Usefulness of Words
movie• occurs very frequently
freedom• occurs frequently
wallace• occurs occassionally
![Page 18: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/18.jpg)
Estimating Relevance
• Rare words more important than common words
– wallace (49M) more important than freedom (198M)
more important than movie (835M)
• Words occurring more frequently in a document
indicate higher relevance
– wallace (88) more matches than movie (16) more
matches than freedom (7)
![Page 19: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/19.jpg)
Relevance Measure: TF–IDF
• TF: Term Frequency
– Measures occurrences of a term in a document
– … various options
• Raw count of occurrences
• Logarithmically scaled
• Normalised by document length
• A combination / something else
![Page 20: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/20.jpg)
Relevance Measure: TF–IDF
• IDF: Inverse Document Frequency
– How common a term is across all documents
– …
• Logarithmically scaled document occurrences
• Note: The more rare, the larger the value
![Page 21: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/21.jpg)
Relevance Measure: TF–IDF
• TF–IDF: Combine Term Frequency and Inverse
Document Frequency:
• Score for a query
– Let query
– Score for a query:
(There are other possibilities)
![Page 22: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/22.jpg)
Relevance Measure: TF–IDF
Term Frequency Inverse Document Frequency
![Page 23: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/23.jpg)
Relevance Measure: TF–IDF
Term Frequency Inverse Document Frequency
![Page 24: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/24.jpg)
Relevance Measure: TF–IDF
Term Frequency Inverse Document Frequency
![Page 25: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/25.jpg)
Relevance Measure: TF–IDF
Term Frequency Inverse Document Frequency
![Page 26: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/26.jpg)
Relevance Measure: TF–IDF
Term Frequency Inverse Document Frequency
![Page 27: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/27.jpg)
Relevance Measure: TF–IDF
Term Frequency Inverse Document Frequency
![Page 28: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/28.jpg)
Relevance Measure: TF–IDF
Term Frequency Inverse Document Frequency
![Page 29: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/29.jpg)
Vector Space Model (a mention)
![Page 30: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/30.jpg)
Vector Space Model (a mention)
![Page 31: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/31.jpg)
Vector Space Model (a mention)
![Page 32: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/32.jpg)
Vector Space Model (a mention)
• Cosine Similarity
• Note:
![Page 33: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/33.jpg)
Relevance Measure: TF–IDF
• TF–IDF: Combine Term Frequency and Inverse
Document Frequency:
• Score for a query
– Let query
– Score for a query:
(There are other possibilities)
… we could also use cosine similarity between
query and document using TF–IDF weights
![Page 34: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/34.jpg)
Two Sides to Ranking: Relevance
≠
![Page 35: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/35.jpg)
Field-Based Boosting
• Not all text is equal: titles, headers, etc.
![Page 36: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/36.jpg)
Anchor Text
• See how the Web views/tags a page
![Page 37: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/37.jpg)
Anchor Text
• See how the Web views/tags a page
![Page 38: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/38.jpg)
• Inverted Index
– They built one so you don’t have to!
– Open Source in Java
Lucene uses relevance scoring
![Page 39: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/39.jpg)
RANKING:
IMPORTANCE
![Page 40: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/40.jpg)
Two Sides to Ranking: Importance
>
How could we determine that Barack Obama is more important
than Mount Obama as a search result on the Web?
![Page 41: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/41.jpg)
Link Analysis
Which will have more links from other pages?
The Wikipedia article for Mount Obama?
The Wikipedia article for Barack Obama?
![Page 42: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/42.jpg)
Link Analysis
• Consider links as votes of confidence in a page
• A hyperlink is the open Web’s version of …
(… even if the page is linked in a negative way.)
![Page 43: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/43.jpg)
Link Analysis
So if we just count links to a page we can
determine its importance and we are done?
![Page 44: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/44.jpg)
Link Spamming
![Page 45: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/45.jpg)
Link Importance
So which should count for more?
A link from http://en.wikipedia.org/wiki/Barack_Obama?
Or a link from http://freev1agra.com/shop.html?
![Page 46: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/46.jpg)
Link Importance
Maybe we could consider links from some
domains as having more “vote”?
![Page 47: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/47.jpg)
PageRank
![Page 48: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/48.jpg)
PageRank
• Not just a count of inlinks
– A link from a more important page is more important
– A link from a page with fewer links is more important
∴ A page with lots of inlinks from important pages (which
have few outlinks) is more important
![Page 49: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/49.jpg)
PageRank is Recursive
• Not just a count of inlinks
– A link from a more important page is more important
– A link from a page with fewer links is more important
∴ A page with lots of inlinks from important pages (which
have few outlinks) is more important
![Page 50: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/50.jpg)
PageRank Model
• The Web: a directed graph
Vertices(pages)
Edges(links)
f a
e b
d c
0.2650.225
0.138 0.127
0.0740.172
Which vertex is most important?
![Page 51: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/51.jpg)
PageRank Model
• The Web: a directed graph
Vertices(pages)
Edges(links)
f a
e b
d c
![Page 52: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/52.jpg)
PageRank Model
Vertices(pages)
Edges(links)
f
e b
d
![Page 53: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/53.jpg)
PageRank Model
Vertices(pages)
Edges(links)
f
e b
d c
a
![Page 54: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/54.jpg)
PageRank: Random Surfer Model
f a
e b
d c
= someone surfing the web, clicking links randomly
• What is the probability of being at page x after n hops?
![Page 55: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/55.jpg)
PageRank: Random Surfer Model
f a
e b
d c
= someone surfing the web, clicking links randomly
• What is the probability of being at page x after n hops?
• Initial state: surfer equally likely to start at any node
![Page 56: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/56.jpg)
PageRank: Random Surfer Model
f a
e b
d c
g
What would happen with gover time?
= someone surfing the web, clicking links randomly
• What is the probability of being at page x after n hops?
• Initial state: surfer equally likely to start at any node
• PageRank applied iteratively for each hop: score indicates probability of being at that page after that many hops
![Page 57: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/57.jpg)
PageRank: Random Surfer Model
f a
e b
d c
g
= someone surfing the web, clicking links randomly
• What is the probability of being at page x after n hops?
• Initial state: surfer equally likely to start at any node
• PageRank applied iteratively for each hop: score indicates probability of being at that page after than many hops
• If the surfer reaches a page without links, the surfer randomly jumps to another page
![Page 58: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/58.jpg)
PageRank: Random Surfer Model
f a
e b
d c
g i
= someone surfing the web, clicking links randomly
• What is the probability of being at page x after n hops?
• Initial state: surfer equally likely to start at any node
• PageRank applied iteratively for each hop: score indicates probability of being at that page after than many hops
• If the surfer reaches a page without links, the surfer randomly jumps to another page
What would happen with g and i over time?
![Page 59: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/59.jpg)
PageRank: Random Surfer Model
f a
e b
d c
g i
What would happen with g and i over time?
= someone surfing the web, clicking links randomly
• What is the probability of being at page x after n hops?
• Initial state: surfer equally likely to start at any node
• PageRank applied iteratively for each hop: score indicates probability of being at that page after than many hops
• If the surfer reaches a page without links, the surfer randomly jumps to another page
![Page 60: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/60.jpg)
PageRank: Random Surfer Model
= someone surfing the web, clicking links randomly
f a
e b
d c
• What is the probability of being at page x after n hops?
• Initial state: surfer equally likely to start at any node
• PageRank applied iteratively for each hop: score indicates probability of being at that page after than many hops
• If the surfer reaches a page without links, the surfer randomly jumps to another page
• The surfer will jump to a random page at any time with a probability 1 – d … this avoids traps and ensures convergence!
g i
![Page 61: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/61.jpg)
PageRank Model: Final Version
• The Web: a directed graph
Vertices(pages)
Edges(links)
f a
e b
d c
![Page 62: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/62.jpg)
PageRank: Benefits
More robust than a simple link count
Fewer ties than link counting
Scalable to approximate (for sparse graphs)
Convergence guaranteed
![Page 63: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/63.jpg)
Two Sides to Ranking: Importance
>
![Page 64: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/64.jpg)
GOOGLE: A GUESS
![Page 65: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/65.jpg)
How Modern Google ranks results (maybe)
According to survey of SEO experts, not people in Google
![Page 66: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/66.jpg)
How Modern Google ranks results (maybe)
According to survey of SEO experts, not people in Google
Why so secretive?
![Page 67: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/67.jpg)
INFORMATION RETRIEVAL:
RECAP
![Page 68: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/68.jpg)
How Does Google Get Such Good Results?
aidan hogan
![Page 69: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/69.jpg)
Ranking in Information Retrieval
• Relevance: Is the document relevant for the query?
– Term Frequency * Inverse Document Frequency
– Cosine similarity
• Importance: Is the document a popular one?
– Links analysis
– PageRank
![Page 70: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/70.jpg)
Ranking: Science or Art?
![Page 71: CC5212-1 Procesamiento Masivo de Datos 2017aidanhogan.com/teaching/cc5212-1-2017/lectures/MDP2017... · 2017-05-03 · CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7:](https://reader034.vdocumento.com/reader034/viewer/2022042415/5f2fafe9ab9d887b1d76954c/html5/thumbnails/71.jpg)
Questions?