ayudando a los viajeros usando 500 millones de reseñas hoteleras al mes
Post on 12-Feb-2017
152 Views
Preview:
TRANSCRIPT
Helping Travellers Make Better Hotel Choices
500 Million Times a Month
Miguel Cabrera
@mfcabrera
https://www.flickr.com/photos/18694857@N00/5614701858/
• Neuberliner • Ing. Sistemas e Inf. Universidad Nacional - Med • M.Sc. In Informatics TUM, Hons. Technology
Management. • Work for TrustYou as Data (Scientist|Engineer|
Juggler)™ • Founder and former organizer of Munich DataGeeks
ABOUT ME
• What we do • Architecture • Technology • Crawling • Textual Processing • Workflow Management and Scale • Sample Application
AGENDA
• Crawling • Natural Language Processing / Semantic
Analysis • Record Linkage / Deduplication • Ranking • Recommendation • Classification • Clustering
Tasks
Batch Layer
• Hadoop • Python • Pig* • Java*
Service Layer
• PostgreSQL • MongoDB • Redis • Cassandra
DATA DATA
Hadoop Cluster Application Machines
Stack
• Build your own web crawlers • Extract data via CSS selectors, XPath,
regexes, etc. • Handles queuing, request parallelism,
cookies, throttling … • Comprehensive and well-designed • Commercial support by
http://scrapinghub.com/
• 2 - 3 million new reviews/week • Customers want alerts 8 - 24h after review
publication! • Smart crawl frequency & depth, but still high
overhead • Pools of constantly refreshed EC2 proxy IPs • Direct API connections with many sites
Crawling at TrustYou
• Custom framework very similar to scrapy • Runs on Hadoop cluster (100 nodes) • Not 100% suitable for MapReduce • Nodes mostly waiting • Coordination/messaging between nodes
required: – Distributed queue – Rate Limiting
Crawling at TrustYou
Text Processing
Raw text Setence spli:ng Tokenizing Stopwords
Stemming
Topic Models
Word Vectors
Classification
• “great rooms” • “great hotel” • “rooms are terrible” • “hotel is terrible”
Text Processing
JJ NN JJ NN NN VB JJ NN VB JJ
>> nltk.pos_tag(nltk.word_tokenize("hotel is terrible")) [('hotel', 'NN'), ('is', 'VBZ'), ('terrible', 'JJ')]
• 25+ languages • Linguistic system (morphology, taggers,
grammars, parsers …) • Hadoop: Scale out CPU • ~1B opinions in the database • Python for ML & NLP libraries
Semantic Analysis
Word2Vec
Source: h*p://technology.s4tchfix.com/blog/2015/03/11/word-‐is-‐worth-‐a-‐thousand-‐vectors/
Word2Vec
Source: h*p://technology.s4tchfix.com/blog/2015/03/11/word-‐is-‐worth-‐a-‐thousand-‐vectors/
Word2Vec
Source: h*p://technology.s4tchfix.com/blog/2015/03/11/word-‐is-‐worth-‐a-‐thousand-‐vectors/
Word2Vec
Source: h*p://technology.s4tchfix.com/blog/2015/03/11/word-‐is-‐worth-‐a-‐thousand-‐vectors/
Word2Vec
Source: h*p://technology.s4tchfix.com/blog/2015/03/11/word-‐is-‐worth-‐a-‐thousand-‐vectors/
Word2Vec
Source: h*p://technology.s4tchfix.com/blog/2015/03/11/word-‐is-‐worth-‐a-‐thousand-‐vectors/
Luigi
• Build complex pipelines of
batch jobs • Dependency resolution • Parallelism • Resume failed jobs • Some support for Hadoop
Luigi
• Dependency definition • Hadoop / HDFS Integration • Object oriented abstraction • Parallelism • Resume failed jobs • Visualization of pipelines • Command line integration
Minimal Bolerplate Code class WordCount(luigi.Task): date = luigi.DateParameter() def requires(self): return InputText(date) def output(self): return luigi.LocalTarget(’/tmp/%s' % self.date_interval) def run(self): count = {}
for f in self.input(): for line in f.open('r'): for word in line.strip().split(): count[word] = count.get(word, 0) + 1
f = self.output().open('w') for word, count in six.iteritems(count): f.write("%s\t%d\n" % (word, count)) f.close()
class WordCount(luigi.Task): date = luigi.DateParameter() def requires(self): return InputText(date) def output(self): return luigi.LocalTarget(’/tmp/%s' % self.date_interval) def run(self): count = {}
for f in self.input(): for line in f.open('r'): for word in line.strip().split(): count[word] = count.get(word, 0) + 1
f = self.output().open('w') for word, count in six.iteritems(count): f.write("%s\t%d\n" % (word, count)) f.close()
Task Parameters
class WordCount(luigi.Task): date = luigi.DateParameter() def requires(self): return InputText(date) def output(self): return luigi.LocalTarget(’/tmp/%s' % self.date_interval) def run(self): count = {}
for f in self.input(): for line in f.open('r'): for word in line.strip().split(): count[word] = count.get(word, 0) + 1
f = self.output().open('w') for word, count in six.iteritems(count): f.write("%s\t%d\n" % (word, count)) f.close()
Programmatically Defined Dependencies
class WordCount(luigi.Task): date = luigi.DateParameter() def requires(self): return InputText(date) def output(self): return luigi.LocalTarget(’/tmp/%s' % self.date_interval) def run(self): count = {}
for f in self.input(): for line in f.open('r'): for word in line.strip().split(): count[word] = count.get(word, 0) + 1
f = self.output().open('w') for word, count in six.iteritems(count): f.write("%s\t%d\n" % (word, count)) f.close()
Each Task produces an ouput
class WordCount(luigi.Task): date = luigi.DateParameter() def requires(self): return InputText(date) def output(self): return luigi.LocalTarget(’/tmp/%s' % self.date_interval) def run(self): count = {}
for f in self.input(): for line in f.open('r'): for word in line.strip().split(): count[word] = count.get(word, 0) + 1
f = self.output().open('w') for word, count in six.iteritems(count): f.write("%s\t%d\n" % (word, count)) f.close()
Write Logic in Python
Hadoop Streaming
hadoop jar contrib/streaming/hadoop-*streaming*.jar \ -file /home/hduser/mapper.py -mapper /home/hduser/mapper.py \ -file /home/hduser/reducer.py -reducer /home/hduser/reducer.py \ -input /user/hduser/text.txt -output /user/hduser/gutenberg-output
class WordCount(luigi.hadoop.JobTask): date = luigi.DateParameter()
def requires(self):
return InputText(date) def output(self):
return luigi.hdfs.HdfsTarget(’%s' % self.date_interval) def mapper(self, line): for word in line.strip().split(): yield word, 1 def reducer(self, key, values): yield key, sum(values)
Luigi + Hadoop/HDFS
We use it for…
• Standalone executables • Dump data from databases • General Hadoop Streaming • Bash Scripts / MRJob • Pig* Scripts
Source: hGp://www.telegraph.co.uk/travel/hotels/11240430/TripAdvisor-‐the-‐funniest-‐reviews-‐biggest-‐controversies-‐and-‐best-‐spoofs.html
Snippets from Reviews “Hips don’t lie” “Maid was banging” “Beautiful bowl flowers” “Irish dance, I love that” “No ghost sighting” “One ghost touching” “Too much cardio, not enough squats in the gym” “it is like hugging a bony super model”
from gensim.models.doc2vec import Doc2Vec class LearnModelTask(luigi.Task):
# Parameters.... blah blah blah def output(self):
return luigi.LocalTarget(os.path.join(self.output_directory, self.model_out)) def requires(self): return LearnBigramsTask() def run(self): sentences = LabeledClusterIDSentence(self.input().path) model = Doc2Vec(sentences=sentences, size=int(self.size), dm=int(self.distmem), negative=int(self.negative), workers=int(self.workers), window=int(self.window), min_count=int(self.min_count), train_words=True) model.save(self.output().path)
Non-personalized recommenders recommend items based on what
other consumers have said about the items.
Takeaways
• It is possible to use Python as the primary language for doing large data processing on Hadoop.
• It is not a perfect setup but works well most of the time.
• Keep your ecosystem open to other technologies.
top related