Skip to Content
Python Machine Learning, Second Edition - Second Edition
book

Python Machine Learning, Second Edition - Second Edition

by Sebastian Raschka, Jared Huffman, Vahid Mirjalili, Ryan Sun
September 2017
Intermediate to advanced content levelIntermediate to advanced
622 pages
15h 13m
English
Packt Publishing
Content preview from Python Machine Learning, Second Edition - Second Edition

Introducing the bag-of-words model

You may remember from Chapter 4, Building Good Training Sets – Data Preprocessing, that we have to convert categorical data, such as text or words, into a numerical form before we can pass it on to a machine learning algorithm. In this section, we will introduce the bag-of-words, which allows us to represent text as numerical feature vectors. The idea behind the bag-of-words model is quite simple and can be summarized as follows:

  1. We create a vocabulary of unique tokens—for example, words—from the entire set of documents.
  2. We construct a feature vector from each document that contains the counts of how often each word occurs in the particular document.

Since the unique words in each document represent only a small ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Python Machine Learning - Third Edition

Python Machine Learning - Third Edition

Sebastian Raschka, Vahid Mirjalili
Python Machine Learning

Python Machine Learning

Sebastian Raschka

Publisher Resources

ISBN: 9781787125933Supplemental Content