Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

...

General Overview of Google NLP

Google Natural Language API comprises five different services:

  • Syntax Analysis
  • Sentiment Analysis
  • Entity Analysis
  • Entity Sentiment Analysis
  • Text Classification

Syntax Analysis

For a given text, Google’s syntax analysis will return a breakdown of all words with a rich set of linguistic information for each token. The information can be divided into two parts:

  1. Part of speech. This part contains information about the morphology of each token. For each word, a fine-grained analysis is returned containing its type (noun, verb, etc.), gender, grammatical case, tense, grammatical mood, grammatical voice, and much more.
    Example sentence: “A computer once beat me at chess, but it was no match for me at kickboxing.”



    Atag: DET
    'computer'tag: NOUN number: SINGULAR
    'once'tag: ADV
    'beat'tag: VERB mood: INDICATIVE tense: PAST
    'me'tag: PRON case: ACCUSATIVE number: SINGULAR person: FIRST
    attag: ADP
    'chess'tag: NOUN number: SINGULAR
    ','tag: PUNCT
    'but'tag: CONJ
    'it'tag: PRON case: NOMINATIVE gender: NEUTER number: SINGULAR person: THIRD
    'was'tag: VERB mood: INDICATIVE number: SINGULAR person: THIRD tense: PAST
    'no'tag: DET
    'match'tag: NOUN number: SINGULAR
    'for'tag: ADP
    'kick'tag: NOUN number: SINGULAR
    'boxing'tag: NOUN number: SINGULAR
    '.'tag: PUNCT
  2. Dependency trees. The second part of the return is called a dependency tree, which describes the syntactic structure of each sentence. 

    Image Modified

Sentiment Analysis

Google’s sentiment analysis will provide the prevailing emotional opinion within a provided text. The API returns two values: The “score” describes the emotional leaning of the text from -1 (negative) to +1 (positive), with 0 being neutral.

The “magnitude” measures the strength of the emotion.

Input SentenceSentiment ResultsInterpretation
The train to London leaves at four o'clockScore: 0.0 Magnitude: 0.0A completely neutral statement, which doesn't contain any emotion at all.
This blog post is good.Score: 0.7 Magnitude: 0.7A positive sentiment, but not expressed very strongly.
This blog post is good. It was very helpful. The author is amazing.Score: 0.7 Magnitude: 2.3The same sentiment, but expressed much stronger.
This blog post is very good. This author is a horrible writer usually, but here he got lucky.Score: 0.0 Magnitude: 1.6The magnitude shows us that there are emotions expressed in this text, but the sentiment shows that they are mixed and not clearly positive or negative.


Entity Analysis

Entity Analysis is the process of detecting known entities like public figures or landmarks from a given text. Entity detection is very helpful for all kinds of classification and topic modeling tasks.

A salience score is calculated. This score for an entity provides information about the importance or centrality of that entity to the entire document text. 

Example: “Robert DeNiro spoke to Martin Scorsese in Hollywood on Christmas Eve in December 2011.”.

Detected EntityAdditional Information
Robert De Nirotype : PERSON salience : 0.5869118 wikipedia_url : https://en.wikipedia.org/wiki/Robert_De_Niro
Hollywoodtype : LOCATION salience : 0.17918482 wikipedia_url : https://en.wikipedia.org/wiki/Hollywood
Martin Scorsesetype : LOCATION salience : 0.17712952 wikipedia_url : https://en.wikipedia.org/wiki/Martin_Scorsese
Christmas Evetype : PERSON salience : 0.056773853 wikipedia_url : https://en.wikipedia.org/wiki/Christmas
December 2011type : DATE Year: 2011 Month: 12 salience : 0.0 wikipedia_url : -
2011type : NUMBER salience : 0.0 wikipedia_url : -


Entity Sentiment Analysis

If there are models for entity detection and sentiment analysis, it’s only natural to go a step further and combine them to detect the prevailing emotions towards the different entities in a text.

Example: “The author is a horrible writer. The reader is very intelligent on the other hand.”

EntitySentiment
authorSalience: 0.8773350715637207 Sentiment: magnitude: 1.899999976158142 score: -0.8999999761581421
readerSalience: 0.08653714507818222 Sentiment: magnitude: 0.8999999761581421 score: 0.8999999761581421


Text Classification

Classifies the input documents into a large set of categories. The categories are structured hierarchical, e.g. the Category “Hobbies & Leisure” has several sub-categories, one of which would be “Hobbies & Leisure/Outdoors” which itself has sub-categories like “Hobbies & Leisure/Outdoors/Fishing.”

Example: “The D3500’s large 24.2 MP DX-format sensor captures richly detailed photos and Full HD movies—even when you shoot in low light. Combined with the rendering power of your NIKKON lens, you can start creating artistic portraits with smooth background blur. With ease.”

CategoryConfidence
Arts & Entertainment/Visual Art & Design/Photographic & Digital Arts0.95
Hobbies & Leisure0.94
Computers & Electronics/Consumer Electronics/Camera & Photo Equipment0.85

Anotate text

A convenience method that provides all the features that analyzeSentiment, analyzeEntities, and analyzeSyntax provide in one call.

https://cloud.google.com/natural-language/docs/reference/rest/v1/documents/annotateText

Directives Syntax

Directives are named the same way the commands in gcloud cli are named: https://cloud.google.com/sdk/gcloud/reference/ml/language/ but this nlp prefix (please let me know if prefix is needed):

...

language: can be set as a part of API call (example 'en', 'jp', etc.). If not set, the NLP Server will try to automatically detect a language.

Authentication file

Example of authentication-file (aka. service account key json):

...

When file parameter is not provided it will take the path from env variables, which is how it will work for GCP case - the most common one.

Return value of directives

Directives will return a json string. This is because the return data can be very complex and it would be pretty troublesome for user to express it in terms of field consisting of multiple layers of nested CDAP schema maps/arrays and even records! Which would be specific and different for every nlp command.

...

https://cloud.google.com/natural-language/docs/analyzing-entities#language-entities-string-gcloud

Implementation

Example of user scenarios (getting information from JSON)

We can put these scenarios into 3 categories.

1) The result of scenario is a single value.

A lot of this can be done via json-path. Fyi Json-path supports conditional gets and nested queries (multiple json-path in each other). Which makes in pretty flexible

2) The result of scenario is a list.

Json-path can return lists. So a lot of these can be implemented as well

3) The result of scenario is a map.

I don't think this is viable to do via wrangler. It would probably require loops. User will have to pass it to something like python transform and write his custom handling code. To do what ever he wants.

Example 1. Take all the nouns from the sentence.

Example sentence: "Google, headquartered in Mountain View, unveiled the new Android phone at the Consumer Electronic Show. Sundar Pichai said in his keynote that users love their new Android phones."

Wrangler directives:

No Format
nlp-analyze-syntax body result service_account_key.json
json-path result nounsList "$.tokens[?(@.partOfSpeech.tag='NOUN')]"

Json:

...

  • Since the results of NLP responses can change we should just check for very general things, not checking details.

Examples

Example 1. Syntax analysis

No Format
nlp-analyze-syntax body result service_account_key.json

Body is "Google, headquartered in Mountain View, unveiled the new Android phone at the Consumer Electronic Show. Sundar Pichai said in his keynote that users love their new Android phones."

Result is a string column, populated with:

No Format
{
  "sentences": [
    {
      "text": {
        "content": "Google, headquartered in Mountain View, unveiled the new Android phone at the Consumer Electronic Show.",
        "beginOffset": 0
      }
    },
    {
      "text": {
        "content": "Sundar Pichai said in his keynote that users love their new Android phones.",
        "beginOffset": 105
      }
    }
  ],
  "tokens": [
    {
      "text": {
        "content": "Google",
        "beginOffset": 0
      },
      "partOfSpeech": {
        "tag": "NOUN",
        "aspect": "ASPECT_UNKNOWN",
        "case": "CASE_UNKNOWN",
        "form": "FORM_UNKNOWN",
        "gender": "GENDER_UNKNOWN",
        "mood": "MOOD_UNKNOWN",
        "number": "SINGULAR",
        "person": "PERSON_UNKNOWN",
        "proper": "PROPER",
        "reciprocity": "RECIPROCITY_UNKNOWN",
        "tense": "TENSE_UNKNOWN",
        "voice": "VOICE_UNKNOWN"
      },
      "dependencyEdge": {
        "headTokenIndex": 7,
        "label": "NSUBJ"
      },
      "lemma": "Google"
    },
    ...
    {
      "text": {
        "content": ".",
        "beginOffset": 179
      },
      "partOfSpeech": {
        "tag": "PUNCT",
        "aspect": "ASPECT_UNKNOWN",
        "case": "CASE_UNKNOWN",
        "form": "FORM_UNKNOWN",
        "gender": "GENDER_UNKNOWN",
        "mood": "MOOD_UNKNOWN",
        "number": "NUMBER_UNKNOWN",
        "person": "PERSON_UNKNOWN",
        "proper": "PROPER_UNKNOWN",
        "reciprocity": "RECIPROCITY_UNKNOWN",
        "tense": "TENSE_UNKNOWN",
        "voice": "VOICE_UNKNOWN"
      },
      "dependencyEdge": {
        "headTokenIndex": 20,
        "label": "P"
      },
      "lemma": "."
    }
  ],
  "language": "en"
}

Result is ['Google', 'Mountain View', 'Android', ...]

Example 2. Get the most important entity (has maximum salience)

No Format
nlp-analyze-entity-sentiment": "P"
      },
      "lemma": "."
    }
  ],
  "language": "en"
}


Example 2. Sentiment analysis

No Format
nlp-analyze-sentiment body result service_account_key.json


Body
 is "Enjoy your vacation!"

Result is a string column, populated with:

No Format
{
  "documentSentiment": {
    "magnitude": 0.8,
    "score": 0.8
  },
  "language": "en",
  "sentences": [
    {
      "text": {
        "content": "Enjoy your vacation!",
        "beginOffset": 0
      },
      "sentiment": {
        "magnitude": 0.8,
        "score": 0.8
      }
    }
  ]
}

Example 3. Entity analysis

No Format
nlp-analyze-entities body result service_account_key.json


Body
 is "President Trump will speak from the White House, located at 1600 Pennsylvania Ave NW, Washington, DC, on October 7."

Result is a string column, populated with:

No Format
{
  "entities": [
    {
      "name": "Trump",
      "type": "PERSON",
      "metadata": {
        "mid": "/m/0cqt90",
        "wikipedia_url": "https://en.wikipedia.org/wiki/Donald_Trump"
      },
      "salience": 0.7936003,
      "mentions": [
        {
          "text": {
            "content": "Trump",
            "beginOffset": 10
          },
          "type": "PROPER"
        },
        {
          "text": {
            "content": "President",
            "beginOffset": 0
          },
          "type": "COMMON"
        }
      ]
    },
    {
      "name": "White House",
      "type": "LOCATION",
      "metadata": {
        "mid": "/m/081sq",
        "wikipedia_url": "https://en.wikipedia.org/wiki/White_House"
      },
      "salience": 0.09172433,
      "mentions": [
        {
          "text": {
            "content": "White House",
            "beginOffset": 36
          },
          "type": "PROPER"
        }
      ]
    },
    {
      "name": "Pennsylvania Ave NW",
      "type": "LOCATION",
      "metadata": {
        "mid": "/g/1tgb87cq"
      },
      "salience": 0.085507184,
      "mentions": [
        {
          "text": {
            "content": "Pennsylvania Ave NW",
            "beginOffset": 65
          },
          "type": "PROPER"
        }
      ]
    },
    {
      "name": "Washington, DC",
      "type": "LOCATION",
      "metadata": {
        "mid": "/m/0rh6k",
        "wikipedia_url": "https://en.wikipedia.org/wiki/Washington,_D.C."
      },
      "salience": 0.029168168,
      "mentions": [
        {
          "text": {
            "content": "Washington, DC",
            "beginOffset": 86
          },
          "type": "PROPER"
        }
      ]
    }
    {
      "name": "1600 Pennsylvania Ave NW, Washington, DC",
      "type": "ADDRESS",
      "metadata": {
        "country": "US",
        "sublocality": "Fort Lesley J. McNair",
        "locality": "Washington",
        "street_name": "Pennsylvania Avenue Northwest",
        "broad_region": "District of Columbia",
        "narrow_region": "District of Columbia",
        "street_number": "1600"
      },
      "salience": 0,
      "mentions": [
        {
          "text": {
            "content": "1600 Pennsylvania Ave NW, Washington, DC",
            "beginOffset": 60
          },
          "type": "TYPE_UNKNOWN"
        }
      ]
      }
    }
    ...
  ],
  "language": "en"
}

Example 4. Entity analysis

No Format
nlp-analyze-entity-sentiment body result service_account_key.json


Body
 is "I love R&B music. Marvin Gaye is the best. 'What's Going On' is one of my favorite songs. It was so sad when Marvin Gaye died."

Result is a string column, populated with:

No Format
{
   "entities":[
      {
         "mentions":[
            {
               "sentiment":{
                  "magnitude":0.9,
                  "score":0.9
               },
               "text":{
                  "beginOffset":7,
                  "content":"R&B music"
               },
               "type":"COMMON"
            }
         ],
         "metadata":{

         },
         "name":"R&B music",
         "salience":0.5597628,
         "sentiment":{
            "magnitude":0.9,
            "score":0.9
         },
         "type":"WORK_OF_ART"
      },
      {
         "mentions":[
            {
               "sentiment":{
                  "magnitude":0.8,
                  "score":0.8
               },
               "text":{
                  "beginOffset":18,
                  "content":"Marvin Gaye"
               },
               "type":"PROPER"
            },
            {
               "sentiment":{
                  "magnitude":0.1,
                  "score":-0.1
               },
               "text":{
                  "beginOffset":109,
                  "content":"Marvin Gaye"
               },
               "type":"PROPER"
            }
         ],
         "metadata":{
            "mid":"/m/012z8_",
            "wikipedia_url":"https://en.wikipedia.org/wiki/Marvin_Gaye"
         },
         "name":"Marvin Gaye",
         "salience":0.18719898,
         "sentiment":{
            "magnitude":1.0,
            "score":0.3
         },
         "type":"PERSON"
      }
      ...
   ],
   "language":"en"
}


Example 5. Classify content:

No Format
nlp-classify-text body result service_account_key.json

Body is "Google, headquartered in Mountain View, unveiled the new Android phone at the Consumer Electronic Show. Sundar Pichai said in his keynote that users love their new Android phones."

Result is a string column, populated with:

No Format
{
   "categories":[
      {
         "confidence":0.61,
         "name":"/Computers & Electronics"
      },
      {
         "confidence":0.53,
         "name":"/Internet & Telecom/Mobile & Wireless"
      },
      {
         "confidence":0.53,
         "name":"/News"
      }
   ]
}


Example 6. Anotate text:

No Format
nlp-anotate-text body result service_account_key.json

Result is a string column, populated with a json which will contain all the data combined from all above jsons

Example of user scenarios (getting information from JSON)

We can put these scenarios into 3 categories.

1) The result of scenario is a single value.

A lot of this can be done via json-path. Fyi Json-path supports conditional gets and nested queries (multiple json-path in each other). Which makes in pretty flexible

2) The result of scenario is a list.

Json-path can return lists. So a lot of these can be implemented as well

3) The result of scenario is a map.

I don't think this is viable to do via wrangler. It would probably require loops. User will have to pass it to something like python transform and write his custom handling code. To do what ever he wants.


Example 1. Take all the nouns from the sentence.

Example sentence: "Google, headquartered in Mountain View, unveiled the new Android phone at the Consumer Electronic Show. Sundar Pichai said in his keynote that users love their new Android phones."

Wrangler directives:

No Format
nlp-analyze-syntax body result service_account_key.json
json-path result positive_entitiesnounsList "$.entitiestokens[?(@.salience = $['.entities.salience.max()'])].namepartOfSpeech.tag='NOUN')]"

Json:

No Format
{
  "entitiessentences": [
    {
      "name": "Trump",
      "type": "PERSON",
      "metadatatext": {
        "midcontent": "/m/0cqt90"Google, headquartered in Mountain      "wikipedia_url": "https://en.wikipedia.org/wiki/Donald_Trump"
      },View, unveiled the new Android phone at the Consumer Electronic Show.",
        "saliencebeginOffset": 0.7936003,
      "mentions":}
[    },
    {
   
      "text": {
            "content": "Trump",
   Sundar Pichai said in his keynote that users love their new Android phones.",
        "beginOffset": 10105
      }
    }
  ],
  "tokens": [
    {
      "typetext": {
"PROPER"         },"content": "Google",
        "beginOffset": 0
 {     },
      "textpartOfSpeech": {
   
        "contenttag": "PresidentNOUN",
  
         "beginOffsetaspect": 0"ASPECT_UNKNOWN",
          },
 "case": "CASE_UNKNOWN",
        "typeform": "COMMONFORM_UNKNOWN",
        }
      ]
    },
    {"gender": "GENDER_UNKNOWN",
        "namemood": "White HouseMOOD_UNKNOWN",
        "typenumber": "LOCATIONSINGULAR",
        "metadataperson": {"PERSON_UNKNOWN",
        "midproper": "/m/081sqPROPER",
        "wikipedia_urlreciprocity": "https://en.wikipedia.org/wiki/White_House"
      },RECIPROCITY_UNKNOWN",
        "saliencetense": 0.09172433,"TENSE_UNKNOWN",
        "mentionsvoice": ["VOICE_UNKNOWN"
      },
 {     "dependencyEdge": {
    "text": {   "headTokenIndex": 7,
        "contentlabel": "White HouseNSUBJ",
      },
      "beginOffsetlemma": 36"Google"
    },
    ...
},    {
      "typetext": "PROPER"{
         }"content": ".",
      ]  "beginOffset": 179
 },     {},
      "namepartOfSpeech": "Pennsylvania{
Ave NW",       "typetag": "LOCATIONPUNCT",
        "metadata": {"aspect": "ASPECT_UNKNOWN",
        "midcase": "/g/1tgb87cq"
      },CASE_UNKNOWN",
        "salienceform": 0.085507184,"FORM_UNKNOWN",
        "mentionsgender": ["GENDER_UNKNOWN",
        {
"mood": "MOOD_UNKNOWN",
         "textnumber": {
   "NUMBER_UNKNOWN",
        "contentperson": "Pennsylvania Ave NWPERSON_UNKNOWN",
   
        "beginOffsetproper": 65"PROPER_UNKNOWN",
          }"reciprocity": "RECIPROCITY_UNKNOWN",
          "typetense": "PROPERTENSE_UNKNOWN",
         }"voice": "VOICE_UNKNOWN"
      },
]     }, "dependencyEdge": {
  {       "nameheadTokenIndex": "Washington20,
 DC",       "typelabel": "LOCATIONP",
      "metadata": {
},
       "midlemma": "/m/0rh6k",."
    }
  ],
  "wikipedia_urllanguage": "https://en.wikipedia.org/wiki/Washington,_D.C."
      },
      "salience": 0.029168168,
      "mentions": [
        {
          "text": {
            "content": "Washington, DC",
     en"
}

Result is ['Google', 'Mountain View', 'Android', ...]

Example 2. Get the most important entity (has maximum salience)

Example sentence: "President Trump will speak from the White House, located at 1600 Pennsylvania Ave NW, Washington, DC, on October 7."

No Format
nlp-analyze-entity-sentiment body result service_account_key.json
json-path result positive_entities "$.entities[?(@.salience = $['.entities.salience.max()'])].name"

Json:

No Format
{
  "entities": [
    {
      "beginOffsetname": 86
          },
 "Trump",
        "type": "PROPERPERSON",
        }
      ]
    }"metadata": {
    {       "namemid": "1600 Pennsylvania Ave NW"/m/0cqt90",
Washington, DC",       "typewikipedia_url": "ADDRESS",https://en.wikipedia.org/wiki/Donald_Trump"
      "metadata": {
 },
      "countrysalience": "US",
 0.7936003,
      "sublocalitymentions": "Fort[
Lesley J. McNair",      {
  "locality": "Washington",         "street_nametext": "Pennsylvania{
  Avenue Northwest",         "broad_regioncontent": "District of ColumbiaTrump",
        "narrow_region": "District of Columbia",
    "beginOffset": 10
       "street_number": "1600"  },
    },      "type": "saliencePROPER":
0,       "mentions": [ },
        {
          "text": {
            "content": "1600 Pennsylvania Ave NW, Washington, DCPresident",
            "beginOffset": 600
          },
          "type": "TYPE_UNKNOWNCOMMON"
        }
      ]
 
    },
    }
{
   ...
  ],   "languagename": "enWhite House"
}

Will return 'Trump'.

Examples

Example 1. Syntax analysis

No Format
nlp-analyze-syntax body result service_account_key.json

Body is "Google, headquartered in Mountain View, unveiled the new Android phone at the Consumer Electronic Show. Sundar Pichai said in his keynote that users love their new Android phones."

Result is a string column, populated with:

No Format
{,
      "sentencestype": ["LOCATION",
    {
      "textmetadata": {
        "contentmid": "/m/081sq"Google,
 headquartered in Mountain View, unveiled the new Android phone at the Consumer Electronic Show.",
  "wikipedia_url": "https://en.wikipedia.org/wiki/White_House"
      },
      "beginOffsetsalience": 0.09172433,
      }"mentions": [
   },     {
          "text": {
            "content": "White House"Sundar,
 Pichai said in his keynote that users love their new Android phones.","beginOffset": 36
          },
          "beginOffsettype": "PROPER"
 105       }

   }   ],
   "tokens": [ },
    {
      "textname": {"Pennsylvania Ave NW",
      "contenttype": "GoogleLOCATION",

       "beginOffsetmetadata": 0{
      },  "mid": "/g/1tgb87cq"
   "partOfSpeech": {  },
      "tagsalience": "NOUN",
 0.085507184,
      "aspectmentions": "ASPECT_UNKNOWN", [
        {
"case": "CASE_UNKNOWN",         "formtext": "FORM_UNKNOWN",{
            "gendercontent": "GENDER_UNKNOWNPennsylvania Ave NW",
     
  "mood": "MOOD_UNKNOWN",          "numberbeginOffset": 65
"SINGULAR",         "person": "PERSON_UNKNOWN", },
          "propertype": "PROPER",
        }
"reciprocity": "RECIPROCITY_UNKNOWN",     ]
   "tense": "TENSE_UNKNOWN", },
    {
      "voicename": "VOICE_UNKNOWN"Washington, DC",
       }"type": "LOCATION",
      "dependencyEdgemetadata": {
        "headTokenIndexmid": 7"/m/0rh6k",
        "labelwikipedia_url": "NSUBJ"
      },
      "lemma": "Google"https://en.wikipedia.org/wiki/Washington,_D.C."
      },
    ...     {"salience": 0.029168168,
      "textmentions": {[
        "content": ".",{
          "beginOffsettext": 179{
      },       "partOfSpeechcontent": {
"Washington, DC",
       "tag": "PUNCT",         "aspectbeginOffset": "ASPECT_UNKNOWN",86
         "case": "CASE_UNKNOWN", },
          "formtype": "FORM_UNKNOWN",PROPER"
        }
  "gender": "GENDER_UNKNOWN",    ]
    }
 "mood": "MOOD_UNKNOWN",  {
      "numbername": "NUMBER_UNKNOWN"1600 Pennsylvania Ave NW, Washington, DC",
      "persontype": "PERSON_UNKNOWNADDRESS",
        "propermetadata": "PROPER_UNKNOWN",{
        "reciprocitycountry": "RECIPROCITY_UNKNOWNUS",
        "tensesublocality": "TENSE_UNKNOWNFort Lesley J. McNair",
        "voicelocality": "VOICE_UNKNOWNWashington",
      },  "street_name": "Pennsylvania Avenue   "dependencyEdge": {Northwest",
        "headTokenIndexbroad_region": 20 "District of Columbia",
        "labelnarrow_region": "P"District     of Columbia",
 },       "lemmastreet_number": "1600"."
      },
   ],   "languagesalience": "en"
}

...

No Format
nlp-analyze-sentiment body result service_account_key.json

...

Result is a string column, populated with:

No Format
{0,
      "documentSentimentmentions": [
 {     "magnitude": 0.8, {
   "score": 0.8   },   "languagetext": "en", {
     "sentences": [     { "content": "1600 Pennsylvania Ave NW, Washington, "text": {DC",
            "contentbeginOffset": "Enjoy60
your vacation!",         "beginOffset": 0},
       },       "sentiment": {"type": "TYPE_UNKNOWN"
        }
"magnitude": 0.8,     ]
   "score": 0.8  }
    }
    }...
  ],
  "language": "en"
}

Example 3. Entity analysis

No Format
nlp-analyze-entities body result service_account_key.json

Body is "President Trump will speak from the White House, located at 1600 Pennsylvania Ave NW, Washington, DC, on October 7."Result is a string column, populated with:Will return 'Trump'.


Flattened version of JSON for transform

We need to reduce the number of nested things in JSONs in order to transform them into the cdap records correctly. Here is how it's done:

1. Syntax analysis

No Format
{
  "entitiessentences": [
    {
      "nametext": "Trump", {
        "typecontent": "PERSON"Google, headquartered in Mountain View, unveiled the "metadata": {
        "mid": "/m/0cqt90new Android phone at the Consumer Electronic Show.",
        "wikipedia_urlbeginOffset": "https://en.wikipedia.org/wiki/Donald_Trump"
  0
      }
    },
    {
      "saliencetext": 0.7936003, {
        "mentionscontent": ["Sundar Pichai said in his keynote that users love their {new Android phones.",
        "textbeginOffset": 105
  {    }
    }
  ],
  "contenttokens": "Trump",[
     {
      "beginOffsettext": 10{
          },
 "content": "Google",
        "typebeginOffset": "PROPER"
0
       },
       "partOfSpeech": {
 
        "texttag": {
   "NOUN",
        "contentaspect": "PresidentASPECT_UNKNOWN",
            "beginOffset": 0"case": "CASE_UNKNOWN",
           },
 "form": "FORM_UNKNOWN",
        "typegender": "COMMONGENDER_UNKNOWN",
        }
"mood": "MOOD_UNKNOWN",
     ]     },
  "number": "SINGULAR",
 {       "nameperson": "White HousePERSON_UNKNOWN",
        "typeproper": "LOCATIONPROPER",
        "metadatareciprocity": {"RECIPROCITY_UNKNOWN",
        "midtense": "/m/081sqTENSE_UNKNOWN",
        "wikipedia_urlvoice": "https://en.wikipedia.org/wiki/White_HouseVOICE_UNKNOWN"
      },
      "dependencyEdge": {
        "salienceheadTokenIndex": 0.091724337,
        "mentionslabel": [ "NSUBJ"
      },
      {"lemma": "Google"
    }
  ],
  "textlanguage": "en"
}

Flattened:

No Format
{
  "sentences": [
    { # record.
      "content": "White House"Google, headquartered in Mountain View, unveiled the new Android phone at the Consumer Electronic "beginOffset": 36Show.",
      "beginOffset": 0
    },
    {
      "typecontent": "PROPER"
        }Sundar Pichai said in his keynote that users love their new Android phones.",
      "beginOffset": ]105
    },
  ],
 { "tokens": [
    "name": "Pennsylvania Ave NW",{ # record.
      "typecontent": "LOCATIONGoogle",
      "metadatabeginOffset": {0
        "midtag": "/g/1tgb87cq"
      }NOUN",
      "salienceapect": 0.085507184"ASPECT_UNKNOWN",
      "mentionscase": ["CASE_UNKNOWN",
        {
   "speechForm": "FORM_UNKNOWN",
      "textgender": {
     "GENDER_UNKNOWN",
      "contentmood": "Pennsylvania Ave NWMOOD_UNKNOWN",
     
      "beginOffsetnumber": 65"SINGULAR",
          },
"person": "PERSON_UNKNOWN",
         "typeproper": "PROPER",
      "reciprocity": "RECIPROCITY_UNKNOWN",
}       ]
    },
    {"tense": "TENSE_UNKNOWN",
      "namevoice": "Washington, DC",VOICE_UNKNOWN"
      "typedependencyEdgeHeadTokenIndex": "LOCATION"7,
      "metadatadependencyEdgeLabel": {"NSUBJ"
        "midlemma": "/m/0rh6k",Google"
    }
  ],
  "wikipedia_urllanguage": "https://en.wikipedia.org/wiki/Washington,_D.C."
en"
}

2. Sentiment analysis


No Format
{
  "documentSentiment": {
     },
"magnitude": 0.8,
     "saliencescore": 0.029168168,8
   },
  "mentionslanguage": ["en",
  "sentences": [
    {
   
      "text": {

           "content": "Washington, DCEnjoy your vacation!",
   
        "beginOffset": 86
 0
        },
   
      "typesentiment": "PROPER"{
        }
"magnitude": 0.8,
     ]   "score": 0.8
}     {     }
 "name": "1600 Pennsylvania Ave NW, Washington, DC",}
  ]
}

Flattened:

No Format
{
  "typemagnitude": "ADDRESS"0.8,
  "score": 0.8
  "metadatalanguage": {"en",
  "sentences": [
    "country": "US",
{ # record
       "sublocalitycontent": "FortEnjoy Lesley J. McNairyour vacation!",

       "localitybeginOffset": "Washington",
 0
      "street_namemagnitude": "Pennsylvania Avenue Northwest",0.8,
      "score": 0.8
  "broad_region": "District of}
Columbia",      ]
}

3. Entity analysis


No Format
{
  "narrow_regionentities": [
"District of Columbia",  {
      "street_numbername": "1600" Pennsylvania Ave NW,    }Washington, DC",
      "saliencetype": 0"ADDRESS",
      "mentionsmetadata": [{
        {
"country": "US",
         "textsublocality": {"Fort Lesley J.  McNair",
        "contentlocality": "1600 Pennsylvania Ave NW, Washington, DC",
 
          "beginOffsetstreet_name": 60"Pennsylvania           },
 Avenue Northwest",
        "typebroad_region": "TYPE_UNKNOWN"
        }District of Columbia",
        ]
  "narrow_region": "District of Columbia",
   }     }
"street_number": "1600"
   ...   ]},
  "language": "en"
}

Example 4. Entity analysis

No Format
nlp-analyze-entity-sentiment body result service_account_key.json

...

Result is a string column, populated with:

No Format
{"salience": 0,
      "entitiesmentions": [
        {
          "mentionstext":[ {
             {
  "content": "1600 Pennsylvania Ave NW, Washington, DC",
            "sentimentbeginOffset":{ 60
          },
          "magnitude":0.9,type": "TYPE_UNKNOWN"
        }
      ]
      }
    }
    ...
  ],
  "language": "en"
}

Flattened:

No Format
{
  "scoreentities":0.9 [
    { # record
      "name": "1600 Pennsylvania Ave NW, }Washington, DC",
      "type": "ADDRESS",
      "textmetadata": { # as a map<string,string> since fields are dynamic. Do not flatten. Looks better
        "beginOffsetcountry":7 "US",
        "sublocality": "Fort Lesley J. McNair",
        "contentlocality":"R&B music" "Washington",
        "street_name": "Pennsylvania Avenue Northwest",
    },    "broad_region": "District of Columbia",
        "typenarrow_region": "COMMONDistrict of Columbia",
        "street_number": "1600"
  }    },
     ],   "salience": 0,
      "metadatamentions":{ [
        { # record
},          "namecontent":"R&B music",1600 Pennsylvania Ave NW,       "salience":0.5597628,Washington, DC",
          "sentimentbeginOffset":{ 60
          "type": "magnitude":0.9,TYPE_UNKNOWN"
        }
       "score":0.9]
      }
    },
    ...
  ],
  "typelanguage": "WORK_OF_ART"
  en"
}


4. Entity analysis

No Format
{
   },"entities":[
      {
         "mentions":[
            {
               "sentiment":{
                  "magnitude":0.89,
                  "score":0.89
               },
               "text":{
                  "beginOffset":187,
                  "content":"MarvinR&B Gayemusic"
               },
               "type":"PROPER"
 COMMON"
            }
         ],
         "metadata":{

         },
         "name":"R&B music",
         "salience":0.5597628,
         "sentiment":{
          {
    "magnitude":0.9,
            "sentimentscore":{0.9
         },
         "magnitudetype":0.1,"WORK_OF_ART"
      }
   ],
   "language":"en"
}

Flattened:

No Format
{
   "scoreentities":-0.1[
      { # record
      },   "mentions":[
            "text":{ # record
                "beginOffsetmagnitude":109,
  0.9,
               "contentscore":"Marvin Gaye"0.9
               }"beginOffset":7,
               "typecontent":"PROPERR&B music"
            }   "type":"COMMON"
      ],      }
    "metadata":{     ],
         "midmetadata":"/m/012z8_",
            "wikipedia_url":"https://en.wikipedia.org/wiki/Marvin_Gaye"{ # as a map<string,string> since fields are dynamic. Do not flatten. Looks better

         },
         "name":"MarvinR&B Gayemusic",
         "salience":0.18719898,
         "sentiment":{
5597628,
           "magnitude":10.09,
  
         "score":0.3
         }9,
         "type":"PERSONWORK_OF_ART"
      }
      ...
   ],
   "language":"en"
}



Example 56. Classify content:

No Format
nlp-classify-text body result service_account_key.json

Body is "Google, headquartered in Mountain View, unveiled the new Android phone at the Consumer Electronic Show. Sundar Pichai said in his keynote that users love their new Android phones."

Result is a string column, populated with:


No Format
{
   "categories":[
      {
         "confidence":0.61,
         "name":"/Computers & Electronics"
      },
      {
         "confidence":0.53,
         "name":"/Internet & Telecom/Mobile & Wireless"
      },
      {
         "confidence":0.53,
         "name":"/News"
      }
   ]
}

Example 6. Anotate text:

No Format
nlp-anotate-text body result service_account_key.json


Result is a string column, populated with a json which will contain all the data combined from all above jsonsDoes not change.