Recent Tutorials and Articles
    Getting Started with NLP using SpaCy Python Library
    Published on: 9th October 2018
    Posted By: Amit Kumar

    This tutorial will help you to get started with Spacy Python library for NLP by covering installation, developing a hello world program and its use cases.

    Natural Language Processing


    Natural Language Processing (NLP) is a discipline of Artifical Intelligence that bridges the communication gap between humans and computers(machines). This discipline deals with tools, algorithms and libraries that enables computers to extract information from human languages.

    NLP employs various machine and deep learning algorithms to tag different part of speech like nouns, verbs, conjuctions etc in sentences.

    While NLP could have many use cases, here are some of popular cases -

    1. Virtual Assistant Building such as Apple Siri, Google Home, Amazon Alexa etc
    2. Topic Modelling - Finding out trending topics from text and visualize that using word cloud
    3. Web Search

     

    SpaCy Library and Features


    SpaCy is one of NLP Libraries for Python that provides better accuracy and execution times. It comes with following features -

    1. Support for multiple languages such as English, German, Spanish, Portuguese, French, Italian, Dutch etc.
    2. Tagger for annotating part of speech tags on documents
    3. Dependency parser for annotating syntactic dependencies on documents
    4. Entity Recognizer for annotating named entities on documents
    5. Tokenizer for segmenting text into words, punctuations marks etc.
    6. Lemmatizer for assigning base forms of words. E.g. lemma for verb doing is do
    7. Matcher and Phrase Matcher for rule based pattern matching

     

    SpaCy Data Structures


    SpaCy comes with following primitive data structures or data containers -

    1. Doc: It is container of all types of annotations that we get on our text after NLP analysis.
    2. Token: It represents a single token such as word, punctuation, verb etc.
    3. Span: It is nothing but a slice from Doc and hence can also be called subset of tokens along with their annotations.
    4. Vocab: It is a storage class providing access to vocabulary and other common data shared across a language such as StringStore (a container to store string with their hash ids) and Lexeme (defines type of a word) objects.

     

    SpaCy Installation


    In order to install SpaCy, it is recommended to leverage virtual environments as it also involves adding trained models to Python library path which may require root access.

    So here are the commands that you will need to run in order to install SpaCy in virtual environment in a project spacy-demo:

    Ubuntu:

    # Go to the directory where you want to create the project
    mkdir spacy-demo
    
    # Install virtualenv package
    sudo apt-get install virtualenv
    
    
    # Create virtual env in directory named venv
    virtualenv venv
    #For Python 3
    virtualenv --python=python3 venv
    
    
    #Activate venv
    source ./venv/bin/activate
    
    # Install SpaCy
    pip install spacy
    
    # Download model of your language. Below example does it for English en
    python -m spacy download en

    Windows 10:

    # Go to the directory where you want to create the project
    mkdir spacy-demo
    
    # Install virtualenv package
    pip install -U virtualenv
    #For Python 3
    pip3 install -U virtualenv
    
    # Create virtual env in directory named venv
    virtualenv venv
    #For Python 3
    virtualenv --python=python3 venv
    
    #Activate venv
    venv\Scripts\activate
    
    # Install SpaCy. Following command may fail if you don't have "Microsoft Visual C++ Compiler for Python". In that case, install it using the link provided in error.
    pip install -U spacy
    
    # Download model of your language. Below example does it for English en
    python -m spacy download en

    In case of Windows, copy contents of directory venv\Lib\site-packages\en_core_web_sm to venv\Lib\site-packages\spacy\data\en as soft linking is not possible.

     

    SpaCy Hello World Program


    Now its time to write a hello world type program using SpaCy. In this program, we will run NLP analysis on following text -

    The US and China's escalation of trade tariffs is expected to hit growth in both countries in 2019, when the boost from President Trump's sweeping tax cuts will also start to wane.
    
    Mr Obstfeld said the world would become a "poorer and more dangerous place" unless world leaders worked together to raise living standards, improve education and reduce inequality.

    Assuming you are in directory spacy-demo, create a file scrapy-demo.py with following contents -

    # import SpaCy module
    import spacy
    
    # load English language model
    nlp = spacy.load('en')
    
    # Text needs to be in unicode string
    doc = nlp(u'The US and China\'s escalation of trade tariffs is expected to hit growth in both countries in 2019'
        u', when the boost from President Trump\'s sweeping tax cuts will also start to wane. '
        u''
        u'Mr Obstfeld said the world would become a "poorer and more dangerous place" unless world leaders worked together '
        u'to raise living standards, improve education and reduce inequality.')
    
    print("\n\nSentences in the analysed text...")
    for sentence_span in doc.sents:
        print(sentence_span)
    
    print("\n\nNoun chunks in the analysed text...")
    # Noun chunks are helpful where many tokens together make a composite noun
    for noun_chunk in doc.noun_chunks:
        print(noun_chunk)
    
    print("\n\nName entities in the analysed text...")
    print("%-15s %-15s" % ("Entity Name", "Entity Label"))
    print("-----------------------------")
    for entity in doc.ents:
        print("%-15s %-15s" % (entity, entity.label_))
    
    print("\n\nTokens and their POS tags in the analysed text...")
    print("%-15s %-15s %-15s %-15s" % ("Token", "Token POS Tag", "Token Lemma", "Token Dependency"))
    print("-----------------------------------------------------------------")
    for token in doc:
        print("%-15s %-15s %-15s %-15s" % (token, token.pos_, token.lemma_, token.dep_))
    

    Run the file with following command -

    # Ensure that virtual env is activated using "venv\Scripts\activate" or "source venv/bin/activate" for windows and Linux respectively
    python spacy-demo.py

    Here is the output of above program on my machine -

    
    Sentences in the analysed text...
    The US and China's escalation of trade tariffs is expected to hit growth in both countries in 2019, when the boost from President Trump's sweeping tax cuts will also start to wane.
    Mr Obstfeld said the world would become a "poorer and more dangerous place" unless world leaders worked together to raise living standards, improve education and reduce inequality.
    
    
    Noun chunks in the analysed text...
    The US
    China's escalation
    trade tariffs
    growth
    both countries
    the boost
    President Trump's sweeping tax cuts
    Mr Obstfeld
    the world
    a "poorer and more dangerous place
    world leaders
    living standards
    education
    inequality
    
    
    Name entities in the analysed text...
    Entity Name     Entity Label   
    -----------------------------
    US              GPE            
    China           GPE            
    2019            DATE           
    Trump           PERSON         
    Obstfeld        PERSON         
    
    
    Tokens and their POS tags in the analysed text...
    Token           Token POS Tag   Token Lemma     Token Dependency
    -----------------------------------------------------------------
    The             DET             the             det            
    US              PROPN           us              nsubjpass      
    and             CCONJ           and             cc             
    China           PROPN           china           poss           
    's              PART            's              case           
    escalation      NOUN            escalation      conj           
    of              ADP             of              prep           
    trade           NOUN            trade           compound       
    tariffs         NOUN            tariff          pobj           
    is              VERB            be              auxpass        
    expected        VERB            expect          ROOT           
    to              PART            to              aux            
    hit             VERB            hit             xcomp          
    growth          NOUN            growth          dobj           
    in              ADP             in              prep           
    both            DET             both            det            
    countries       NOUN            country         pobj           
    in              ADP             in              prep           
    2019            NUM             2019            pobj           
    ,               PUNCT           ,               punct          
    when            ADV             when            advmod         
    the             DET             the             det            
    boost           NOUN            boost           nsubj          
    from            ADP             from            prep           
    President       PROPN           president       compound       
    Trump           PROPN           trump           poss           
    's              PART            's              case           
    sweeping        ADJ             sweeping        amod           
    tax             NOUN            tax             compound       
    cuts            NOUN            cut             pobj           
    will            VERB            will            aux            
    also            ADV             also            advmod         
    start           VERB            start           relcl          
    to              ADP             to              aux            
    wane            NOUN            wane            xcomp          
    .               PUNCT           .               punct          
    Mr              PROPN           mr              compound       
    Obstfeld        PROPN           obstfeld        nsubj          
    said            VERB            say             ROOT           
    the             DET             the             det            
    world           NOUN            world           nsubj          
    would           VERB            would           aux            
    become          VERB            become          ccomp          
    a               DET             a               det            
    "               PUNCT           "               punct          
    poorer          ADJ             poor            amod           
    and             CCONJ           and             cc             
    more            ADV             more            advmod         
    dangerous       ADJ             dangerous       conj           
    place           NOUN            place           attr           
    "               PUNCT           "               punct          
    unless          ADP             unless          mark           
    world           NOUN            world           compound       
    leaders         NOUN            leader          nsubj          
    worked          VERB            work            advcl          
    together        ADV             together        advmod         
    to              PART            to              aux            
    raise           VERB            raise           advcl          
    living          NOUN            living          compound       
    standards       NOUN            standard        dobj           
    ,               PUNCT           ,               punct          
    improve         VERB            improve         conj           
    education       NOUN            education       dobj           
    and             CCONJ           and             cc             
    reduce          VERB            reduce          conj           
    inequality      NOUN            inequality      dobj           
    .               PUNCT           .               punct          
    

     

    Thank you for reading through the tutorial. In case of any feedback/questions/concerns, you can communicate same to us through your comments and we shall get back to you as soon as possible.

    Posted By: Amit Kumar
    Published on: 9th October 2018

    Comment Form is loading comments...