This tutorial will help you to get started with Spacy Python library for NLP by covering installation, developing a hello world program and its use cases.
Natural Language Processing
Natural Language Processing (NLP) is a discipline of Artifical Intelligence that bridges the communication gap between humans and computers(machines). This discipline deals with tools, algorithms and libraries that enables computers to extract information from human languages.
NLP employs various machine and deep learning algorithms to tag different part of speech like nouns, verbs, conjuctions etc in sentences.
While NLP could have many use cases, here are some of popular cases -
- Virtual Assistant Building such as Apple Siri, Google Home, Amazon Alexa etc
- Topic Modelling - Finding out trending topics from text and visualize that using word cloud
- Web Search
SpaCy Library and Features
SpaCy is one of NLP Libraries for Python that provides better accuracy and execution times. It comes with following features -
- Support for multiple languages such as English, German, Spanish, Portuguese, French, Italian, Dutch etc.
- Tagger for annotating part of speech tags on documents
- Dependency parser for annotating syntactic dependencies on documents
- Entity Recognizer for annotating named entities on documents
- Tokenizer for segmenting text into words, punctuations marks etc.
- Lemmatizer for assigning base forms of words. E.g. lemma for verb doing is do
- Matcher and Phrase Matcher for rule based pattern matching
SpaCy Data Structures
SpaCy comes with following primitive data structures or data containers -
- Doc: It is container of all types of annotations that we get on our text after NLP analysis.
- Token: It represents a single token such as word, punctuation, verb etc.
- Span: It is nothing but a slice from Doc and hence can also be called subset of tokens along with their annotations.
- Vocab: It is a storage class providing access to vocabulary and other common data shared across a language such as StringStore (a container to store string with their hash ids) and Lexeme (defines type of a word) objects.
SpaCy Installation
In order to install SpaCy, it is recommended to leverage virtual environments as it also involves adding trained models to Python library path which may require root access.
So here are the commands that you will need to run in order to install SpaCy in virtual environment in a project spacy-demo:
Ubuntu:
# Go to the directory where you want to create the project
mkdir spacy-demo
# Install virtualenv package
sudo apt-get install virtualenv
# Create virtual env in directory named venv
virtualenv venv
#For Python 3
virtualenv --python=python3 venv
#Activate venv
source ./venv/bin/activate
# Install SpaCy
pip install spacy
# Download model of your language. Below example does it for English en
python -m spacy download en
Windows 10:
# Go to the directory where you want to create the project
mkdir spacy-demo
# Install virtualenv package
pip install -U virtualenv
#For Python 3
pip3 install -U virtualenv
# Create virtual env in directory named venv
virtualenv venv
#For Python 3
virtualenv --python=python3 venv
#Activate venv
venv\Scripts\activate
# Install SpaCy. Following command may fail if you don't have "Microsoft Visual C++ Compiler for Python". In that case, install it using the link provided in error.
pip install -U spacy
# Download model of your language. Below example does it for English en
python -m spacy download en
In case of Windows, copy contents of directory venv\Lib\site-packages\en_core_web_sm to venv\Lib\site-packages\spacy\data\en as soft linking is not possible.
SpaCy Hello World Program
Now its time to write a hello world type program using SpaCy. In this program, we will run NLP analysis on following text -
The US and China's escalation of trade tariffs is expected to hit growth in both countries in 2019, when the boost from President Trump's sweeping tax cuts will also start to wane.
Mr Obstfeld said the world would become a "poorer and more dangerous place" unless world leaders worked together to raise living standards, improve education and reduce inequality.
Assuming you are in directory spacy-demo, create a file scrapy-demo.py with following contents -
# import SpaCy module
import spacy
# load English language model
nlp = spacy.load('en')
# Text needs to be in unicode string
doc = nlp(u'The US and China\'s escalation of trade tariffs is expected to hit growth in both countries in 2019'
u', when the boost from President Trump\'s sweeping tax cuts will also start to wane. '
u''
u'Mr Obstfeld said the world would become a "poorer and more dangerous place" unless world leaders worked together '
u'to raise living standards, improve education and reduce inequality.')
print("\n\nSentences in the analysed text...")
for sentence_span in doc.sents:
print(sentence_span)
print("\n\nNoun chunks in the analysed text...")
# Noun chunks are helpful where many tokens together make a composite noun
for noun_chunk in doc.noun_chunks:
print(noun_chunk)
print("\n\nName entities in the analysed text...")
print("%-15s %-15s" % ("Entity Name", "Entity Label"))
print("-----------------------------")
for entity in doc.ents:
print("%-15s %-15s" % (entity, entity.label_))
print("\n\nTokens and their POS tags in the analysed text...")
print("%-15s %-15s %-15s %-15s" % ("Token", "Token POS Tag", "Token Lemma", "Token Dependency"))
print("-----------------------------------------------------------------")
for token in doc:
print("%-15s %-15s %-15s %-15s" % (token, token.pos_, token.lemma_, token.dep_))
Run the file with following command -
# Ensure that virtual env is activated using "venv\Scripts\activate" or "source venv/bin/activate" for windows and Linux respectively
python spacy-demo.py
Here is the output of above program on my machine -
Sentences in the analysed text...
The US and China's escalation of trade tariffs is expected to hit growth in both countries in 2019, when the boost from President Trump's sweeping tax cuts will also start to wane.
Mr Obstfeld said the world would become a "poorer and more dangerous place" unless world leaders worked together to raise living standards, improve education and reduce inequality.
Noun chunks in the analysed text...
The US
China's escalation
trade tariffs
growth
both countries
the boost
President Trump's sweeping tax cuts
Mr Obstfeld
the world
a "poorer and more dangerous place
world leaders
living standards
education
inequality
Name entities in the analysed text...
Entity Name Entity Label
-----------------------------
US GPE
China GPE
2019 DATE
Trump PERSON
Obstfeld PERSON
Tokens and their POS tags in the analysed text...
Token Token POS Tag Token Lemma Token Dependency
-----------------------------------------------------------------
The DET the det
US PROPN us nsubjpass
and CCONJ and cc
China PROPN china poss
's PART 's case
escalation NOUN escalation conj
of ADP of prep
trade NOUN trade compound
tariffs NOUN tariff pobj
is VERB be auxpass
expected VERB expect ROOT
to PART to aux
hit VERB hit xcomp
growth NOUN growth dobj
in ADP in prep
both DET both det
countries NOUN country pobj
in ADP in prep
2019 NUM 2019 pobj
, PUNCT , punct
when ADV when advmod
the DET the det
boost NOUN boost nsubj
from ADP from prep
President PROPN president compound
Trump PROPN trump poss
's PART 's case
sweeping ADJ sweeping amod
tax NOUN tax compound
cuts NOUN cut pobj
will VERB will aux
also ADV also advmod
start VERB start relcl
to ADP to aux
wane NOUN wane xcomp
. PUNCT . punct
Mr PROPN mr compound
Obstfeld PROPN obstfeld nsubj
said VERB say ROOT
the DET the det
world NOUN world nsubj
would VERB would aux
become VERB become ccomp
a DET a det
" PUNCT " punct
poorer ADJ poor amod
and CCONJ and cc
more ADV more advmod
dangerous ADJ dangerous conj
place NOUN place attr
" PUNCT " punct
unless ADP unless mark
world NOUN world compound
leaders NOUN leader nsubj
worked VERB work advcl
together ADV together advmod
to PART to aux
raise VERB raise advcl
living NOUN living compound
standards NOUN standard dobj
, PUNCT , punct
improve VERB improve conj
education NOUN education dobj
and CCONJ and cc
reduce VERB reduce conj
inequality NOUN inequality dobj
. PUNCT . punct
Thank you for reading through the tutorial. In case of any feedback/questions/concerns, you can communicate same to us through your comments and we shall get back to you as soon as possible.