Ayudando a los Viajeros usando 500 millones de Reseñas Hoteleras al Mes

Helping Travellers Make Better Hotel Choices

500 Million Times a Month

Miguel Cabrera

@mfcabrera

https://www.flickr.com/photos/18694857@N00/5614701858/

ABOUT ME

•  Neuberliner •  Ing. Sistemas e Inf. Universidad Nacional - Med •  M.Sc. In Informatics TUM, Hons. Technology

Management. •  Work for TrustYou as Data (Scientist|Engineer|

Juggler)™ •  Founder and former organizer of Munich DataGeeks

ABOUT ME

TODAY

•  What we do •  Architecture •  Technology •  Crawling •  Textual Processing •  Workflow Management and Scale •  Sample Application

AGENDA

WHAT WE DO

For every hotel on the planet, provide a summary of traveler reviews.

•  Crawling •  Natural Language Processing / Semantic

Analysis •  Record Linkage / Deduplication •  Ranking •  Recommendation •  Classification •  Clustering

Tasks

ARCHITECTURE

Data Flow

Crawling

Seman-c Analysis Database API

Clients • Google • Kayak+ • TY

Analytics

Batch Layer

• Hadoop • Python • Pig* • Java*

Service Layer

• PostgreSQL • MongoDB • Redis • Cassandra

DATA DATA

Hadoop Cluster Application Machines

Stack

SOME NUMBERS

25 supported languages

500,000+ Properties

30,000,000+ daily crawled reviews

Deduplicated against 250,000,000+ reviews

300,000+ daily new reviews

https://www.flickr.com/photos/22646823@N08/2694765397/

Lots of text

TECHNOLOGY

•  Numpy •  NLTK •  Scikit-Learn •  Pandas •  IPython / Jupyter •  Scrapy

Python

•  Hadoop Streaming •  MRJob •  Oozie •  Luigi •  …

Python + Hadoop

Crawling

•  Build your own web crawlers •  Extract data via CSS selectors, XPath,

regexes, etc. •  Handles queuing, request parallelism,

cookies, throttling … •  Comprehensive and well-designed •  Commercial support by

http://scrapinghub.com/

•  2 - 3 million new reviews/week •  Customers want alerts 8 - 24h after review

publication! •  Smart crawl frequency & depth, but still high

overhead •  Pools of constantly refreshed EC2 proxy IPs •  Direct API connections with many sites

Crawling at TrustYou

•  Custom framework very similar to scrapy •  Runs on Hadoop cluster (100 nodes) •  Not 100% suitable for MapReduce •  Nodes mostly waiting •  Coordination/messaging between nodes

required: – Distributed queue – Rate Limiting

Crawling at TrustYou

Text Processing

Raw text Setence spli:ng Tokenizing Stopwords

Stemming

Topic Models

Word Vectors

Classification

Text Processing

•  “great rooms” •  “great hotel” •  “rooms are terrible” •  “hotel is terrible”

Text Processing

JJ NN JJ NN NN VB JJ NN VB JJ

>> nltk.pos_tag(nltk.word_tokenize("hotel is terrible")) [('hotel', 'NN'), ('is', 'VBZ'), ('terrible', 'JJ')]

•  25+ languages •  Linguistic system (morphology, taggers,

grammars, parsers …) •  Hadoop: Scale out CPU •  ~1B opinions in the database •  Python for ML & NLP libraries

Semantic Analysis

Word2Vec/Doc2Vec

Group of algorithms

An instance of shallow learning

Feature learning model

Generates real-valued vectors represenation of words

“king” – “man” + “woman” = “queen”

Word2Vec