This tutorial will help you to get started with Spacy Python library for NLP by covering installation, developing a hello world program and its use cases.
Natural Language Processing
Natural Language Processing (NLP) is a discipline of Artifical Intelligence that bridges the communication gap between humans and computers(machines). This discipline deals with tools, algorithms and libraries that enables computers to extract information from human languages.
NLP employs various machine and deep learning algorithms to tag different part of speech like nouns, verbs, conjuctions etc in sentences.
While NLP could have many use cases, here are some of popular cases -
- Virtual Assistant Building such as Apple Siri, Google Home, Amazon Alexa etc
- Topic Modelling - Finding out trending topics from text and visualize that using word cloud
- Web Search
SpaCy Library and Features
SpaCy is one of NLP Libraries for Python that provides better accuracy and execution times. It comes with following features -
- Support for multiple languages such as English, German, Spanish, Portuguese, French, Italian, Dutch etc.
- Tagger for annotating part of speech tags on documents
- Dependency parser for annotating syntactic dependencies on documents
- Entity Recognizer for annotating named entities on documents
- Tokenizer for segmenting text into words, punctuations marks etc.
- Lemmatizer for assigning base forms of words. E.g. lemma for verb doing is do
- Matcher and Phrase Matcher for rule based pattern matching
SpaCy Data Structures
SpaCy comes with following primitive data structures or data containers -
- Doc: It is container of all types of annotations that we get on our text after NLP analysis.
- Token: It represents a single token such as word, punctuation, verb etc.
- Span: It is nothing but a slice from Doc and hence can also be called subset of tokens along with their annotations.
- Vocab: It is a storage class providing access to vocabulary and other common data shared across a language such as StringStore (a container to store string with their hash ids) and Lexeme (defines type of a word) objects.
In order to install SpaCy, it is recommended to leverage virtual environments as it also involves adding trained models to Python library path which may require root access.
So here are the commands that you will need to run in order to install SpaCy in virtual environment in a project spacy-demo:
# Go to the directory where you want to create the project mkdir spacy-demo # Create virtual env in directory named venv virtualenv venv #Activate venv source ./venv/bin/activate # Install SpaCy pip install spacy # Download model of your language. Below example does it for English en python -m spacy download en
SpaCy Hello World Program
Now its time to write a hello world type program using SpaCy. In this program, we will run NLP analysis on following text -
The US and China's escalation of trade tariffs is expected to hit growth in both countries in 2019, when the boost from President Trump's sweeping tax cuts will also start to wane. Mr Obstfeld said the world would become a "poorer and more dangerous place" unless world leaders worked together to raise living standards, improve education and reduce inequality.
And here is the program that analyzes above sentences -
# import SpaCy module import spacy # load English language model nlp = spacy.load('en') # Text needs to be in unicode string doc = nlp(u'The US and China\'s escalation of trade tariffs is expected to hit growth in both countries in 2019' u', when the boost from President Trump\'s sweeping tax cuts will also start to wane. ' u'' u'Mr Obstfeld said the world would become a "poorer and more dangerous place" unless world leaders worked together ' u'to raise living standards, improve education and reduce inequality.') print "\n\nSentences in the analysed text..." for sentence_span in doc.sents: print sentence_span print "\n\nNoun chunks in the analysed text..." # Noun chunks are helpful where many tokens together make a composite noun for noun_chunk in doc.noun_chunks: print noun_chunk print "\n\nName entities in the analysed text..." print "%-15s %-15s" % ("Entity Name", "Entity Label") print "-----------------------------" for entity in doc.ents: print "%-15s %-15s" % (entity, entity.label_) print "\n\nTokens and their POS tags in the analysed text..." print "%-15s %-15s %-15s %-15s" % ("Token", "Token POS Tag", "Token Lemma", "Token Dependency") print "-----------------------------------------------------------------" for token in doc: print "%-15s %-15s %-15s %-15s" % (token, token.pos_, token.lemma_, token.dep_)
Here is the output of above program on my machine -
Sentences in the analysed text... The US and China's escalation of trade tariffs is expected to hit growth in both countries in 2019, when the boost from President Trump's sweeping tax cuts will also start to wane. Mr Obstfeld said the world would become a "poorer and more dangerous place" unless world leaders worked together to raise living standards, improve education and reduce inequality. Noun chunks in the analysed text... The US China's escalation trade tariffs growth both countries the boost President Trump's sweeping tax cuts Mr Obstfeld the world a "poorer and more dangerous place world leaders living standards education inequality Name entities in the analysed text... Entity Name Entity Label ----------------------------- US GPE China GPE 2019 DATE Trump PERSON Obstfeld PERSON Tokens and their POS tags in the analysed text... Token Token POS Tag Token Lemma Token Dependency ----------------------------------------------------------------- The DET the det US PROPN us nsubjpass and CCONJ and cc China PROPN china poss 's PART 's case escalation NOUN escalation conj of ADP of prep trade NOUN trade compound tariffs NOUN tariff pobj is VERB be auxpass expected VERB expect ROOT to PART to aux hit VERB hit xcomp growth NOUN growth dobj in ADP in prep both DET both det countries NOUN country pobj in ADP in prep 2019 NUM 2019 pobj , PUNCT , punct when ADV when advmod the DET the det boost NOUN boost nsubj from ADP from prep President PROPN president compound Trump PROPN trump poss 's PART 's case sweeping ADJ sweeping amod tax NOUN tax compound cuts NOUN cut pobj will VERB will aux also ADV also advmod start VERB start relcl to ADP to aux wane NOUN wane xcomp . PUNCT . punct Mr PROPN mr compound Obstfeld PROPN obstfeld nsubj said VERB say ROOT the DET the det world NOUN world nsubj would VERB would aux become VERB become ccomp a DET a det " PUNCT " punct poorer ADJ poor amod and CCONJ and cc more ADV more advmod dangerous ADJ dangerous conj place NOUN place attr " PUNCT " punct unless ADP unless mark world NOUN world compound leaders NOUN leader nsubj worked VERB work advcl together ADV together advmod to PART to aux raise VERB raise advcl living NOUN living compound standards NOUN standard dobj , PUNCT , punct improve VERB improve conj education NOUN education dobj and CCONJ and cc reduce VERB reduce conj inequality NOUN inequality dobj . PUNCT . punct
Thank you for reading through the tutorial. In case of any feedback/questions/concerns, you can communicate same to us through your comments and we shall get back to you as soon as possible.