Introduction to Natural Language Processing

We offer three carefully selected full-day trainings, each doing a deep dive in a vertical or horizontal that is a key theme of Data By the Bay conference matrix.


Natural Language Processing with Gabor Melli

The automated processing of text data has become a mission critical capability in industries as varied as medicine, finance, law, advertising, and engineering. This tutorial will review the best-practices in many of these application areas from the perspective of proven applications, methods, practices, tools and resources. When completed you will understand and be able to prototype components of an end-to-end NLP system that achieves best-baseline results.

  • Text Preprocessing such as tokenization, lemmatisation, and end-of-sentence detection.
  • Shallow Syntactic and Semantic Analysis such as semantic role labeling, and named entity recognition,
  • Text Classification & Clustering such as spam detection and topic modeling.
  • Information Extraction such as relation extraction in open and closed-domains.
  • Word Sense Disambiguation such as linking to an ontology.
  • Word Relatedness Functions such as from continuous word embeddings in 'Deep' neural networks.
  • Text Summarization such as data-driven question/answering.


Gabor Melli

Gabor Melli is the Director of Data Science at OpenGov where he leads their initiatives to automate knowledge-intensive text-rich processes. This work largely involves the training of predictive models for classification, sequence labeling, and estimation for tasks such as named entity recognition and disambiguation in user generated text using techniques and tools such as: CRFs, SVMs, HMMs, Logistic, LDA, NLTK, Spark, Python, R, and AWS' EC2/S3/EMR. He has led and delivered large-scale data-driven initiatives at organizations ranging from Microsoft, AT&T, T-Mobile, ICBC, Washington Mutual, and Wal*Mart to start-ups such as Datasage,, PredictionWorks, VigLink and now at OpenGov.

Gabor holds a PhD in Computing Science from Simon Fraser University in the topic of document to ontology interlinking. He has been active in the data science community for over twenty years and is the recipient ACM SIGKDD's Service Award in 2013. His current research interest include iterative semantic semi-supervised text analysis and automated business process optimization. Additional information at


Gabor Melli and his team presents Data-Driven Commerce Pipeline at SF Text