Analyze: _analyze api

Saurabh Sharma

Text is a key data in most of the information that is ingested in the elasticsearch, and whether it is type text or keyword it requires careful consideration to choose which analyzer to use.

What is Analyzer?

An analyzer  which can be either builtin or custom, is just a package which contains three lower-level building blocks

  1. character filters,
  2. tokenizers, and
  3. token filters.

Main documentation here

NOTE: The analysis settings are non dynamic and can’t be updated while the index is open. 

GET _analyze
{
"text": "My simple text!",
"explain": true
}

Look at the output to understand which analyzer has been used and how it processes.

{
"detail" : {
"custom_analyzer" : false,
"analyzer" : {
"name" : "standard",
"tokens" : [
{
"token" : "my",
"start_offset" : 0,
"end_offset" : 2,
"type" : "<ALPHANUM>",
"position" : 0,
"bytes" : "[6d 79]",
"positionLength" : 1,
"termFrequency" : 1
},
{
"token" : "simple",
"start_offset" : 3,
"end_offset" : 9,
"type" : "<ALPHANUM>",
"position" : 1,
"bytes" : "[73 69 6d 70 6c 65]",
"positionLength" : 1,
"termFrequency" : 1
},
{
"token" : "text",
"start_offset" : 10,
"end_offset" : 14,
"type" : "<ALPHANUM>",
"position" : 2,
"bytes" : "[74 65 78 74]",
"positionLength" : 1,
"termFrequency" : 1
}
]
}
}
}

Look at the STANDARD analyzer being used and how it breaks the original text from

My simple text!

to

"tokens" : [
{
"token" : "my",
"start_offset" : 0,
"end_offset" : 2,
"type" : "",
"position" : 0,
"bytes" : "[6d 79]",
"positionLength" : 1,
"termFrequency" : 1
},
{
"token" : "simple",
"start_offset" : 3,
"end_offset" : 9,
"type" : "",
"position" : 1,
"bytes" : "[73 69 6d 70 6c 65]",
"positionLength" : 1,
"termFrequency" : 1
},
{
"token" : "text",
"start_offset" : 10,
"end_offset" : 14,
"type" : "",
"position" : 2,
"bytes" : "[74 65 78 74]",
"positionLength" : 1,
"termFrequency" : 1
}
]

Look at every space the input is broken and even the punctuation has been removed – text instead of text!.

Changing the analyzer to – Whitespace

GET _analyze
{
"text": "My simple text!",
"explain": true,
"analyzer": "whitespace"
}

The output is something like

{
"detail" : {
"custom_analyzer" : false,
"analyzer" : {
"name" : "whitespace",
"tokens" : [
{
"token" : "My",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0,
"bytes" : "[4d 79]",
"positionLength" : 1,
"termFrequency" : 1
},
{
"token" : "simple",
"start_offset" : 3,
"end_offset" : 9,
"type" : "word",
"position" : 1,
"bytes" : "[73 69 6d 70 6c 65]",
"positionLength" : 1,
"termFrequency" : 1
},
{
"token" : "text!",
"start_offset" : 10,
"end_offset" : 15,
"type" : "word",
"position" : 2,
"bytes" : "[74 65 78 74 21]",
"positionLength" : 1,
"termFrequency" : 1
}
]
}
}
}

Look at how the output is similar except the punctuation has not been removed and the the case has been retained.

My instead of my as in the previous example.

Other Analyzer: Stop, Keyword, English

Stop

GET _analyze
{
"text": "My simple and sweet text!",
"explain": true,
"analyzer": "stop"
}

The stop words will be removed (Look at no and below)

{
"detail" : {
"custom_analyzer" : false,
"analyzer" : {
"name" : "stop",
"tokens" : [
{
"token" : "my",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0,
"bytes" : "[6d 79]",
"positionLength" : 1,
"termFrequency" : 1
},
{
"token" : "simple",
"start_offset" : 3,
"end_offset" : 9,
"type" : "word",
"position" : 1,
"bytes" : "[73 69 6d 70 6c 65]",
"positionLength" : 1,
"termFrequency" : 1
},
{
"token" : "sweet",
"start_offset" : 14,
"end_offset" : 19,
"type" : "word",
"position" : 3,
"bytes" : "[73 77 65 65 74]",
"positionLength" : 1,
"termFrequency" : 1
},
{
"token" : "text",
"start_offset" : 20,
"end_offset" : 24,
"type" : "word",
"position" : 4,
"bytes" : "[74 65 78 74]",
"positionLength" : 1,
"termFrequency" : 1
}
]
}
}
}

English: Analyzer

GET _analyze
{
"text": "My simple and sweet text! by working.",
"explain": true,
"analyzer": "english"
}

Look at the output as under

{
"detail" : {
"custom_analyzer" : false,
"analyzer" : {
"name" : "english",
"tokens" : [
{
"token" : "my",
"start_offset" : 0,
"end_offset" : 2,
"type" : "",
"position" : 0,
"bytes" : "[6d 79]",
"keyword" : false,
"positionLength" : 1,
"termFrequency" : 1
},
{
"token" : "simpl",
"start_offset" : 3,
"end_offset" : 9,
"type" : "",
"position" : 1,
"bytes" : "[73 69 6d 70 6c]",
"keyword" : false,
"positionLength" : 1,
"termFrequency" : 1
},
{
"token" : "sweet",
"start_offset" : 14,
"end_offset" : 19,
"type" : "",
"position" : 3,
"bytes" : "[73 77 65 65 74]",
"keyword" : false,
"positionLength" : 1,
"termFrequency" : 1
},
{
"token" : "text",
"start_offset" : 20,
"end_offset" : 24,
"type" : "",
"position" : 4,
"bytes" : "[74 65 78 74]",
"keyword" : false,
"positionLength" : 1,
"termFrequency" : 1
},
{
"token" : "work",
"start_offset" : 29,
"end_offset" : 36,
"type" : "",
"position" : 6,
"bytes" : "[77 6f 72 6b]",
"keyword" : false,
"positionLength" : 1,
"termFrequency" : 1
}
]
}
}
}

Keyword

GET _analyze
{
"text": "My simple and sweet text! by working.",
"explain": true,
"analyzer": "keyword"
}

The output will change as under

{
"detail" : {
"custom_analyzer" : false,
"analyzer" : {
"name" : "keyword",
"tokens" : [
{
"token" : "My simple and sweet text! by working.",
"start_offset" : 0,
"end_offset" : 37,
"type" : "word",
"position" : 0,
"bytes" : "[4d 79 20 73 69 6d 70 6c 65 20 61 6e 64 20 73 77 65 65 74 20 74 65 78 74 21 20 62 79 20 77 6f 72 6b 69 6e 67 2e]",
"positionLength" : 1,
"termFrequency" : 1
}
]
}
}
}

Tokenizer and filters

How about playing with the some examples of tokenizer and filters.

GET _analyze
{
"tokenizer": "standard",
"filter": ["lowercase"],
"text": "My simple and sweet exmple at work(ing)!"
}

I am using the standard as tokenizer and lowercase as filter. Look at how it tokenizes the information.

{
"tokens" : [
{
"token" : "my",
"start_offset" : 0,
"end_offset" : 2,
"type" : "",
"position" : 0
},
{
"token" : "simple",
"start_offset" : 3,
"end_offset" : 9,
"type" : "",
"position" : 1
},
{
"token" : "and",
"start_offset" : 10,
"end_offset" : 13,
"type" : "",
"position" : 2
},
{
"token" : "sweet",
"start_offset" : 14,
"end_offset" : 19,
"type" : "",
"position" : 3
},
{
"token" : "exmple",
"start_offset" : 20,
"end_offset" : 26,
"type" : "",
"position" : 4
},
{
"token" : "at",
"start_offset" : 27,
"end_offset" : 29,
"type" : "",
"position" : 5
},
{
"token" : "work",
"start_offset" : 30,
"end_offset" : 34,
"type" : "",
"position" : 6
},
{
"token" : "ing",
"start_offset" : 35,
"end_offset" : 38,
"type" : "",
"position" : 7
}
]
}

adding snowball to the filter list now, and adding the missing a in example for the text.

{
"tokens" : [
{
"token" : "my",
"start_offset" : 0,
"end_offset" : 2,
"type" : "",
"position" : 0
},
{
"token" : "simpl",
"start_offset" : 3,
"end_offset" : 9,
"type" : "",
"position" : 1
},
{
"token" : "and",
"start_offset" : 10,
"end_offset" : 13,
"type" : "",
"position" : 2
},
{
"token" : "sweet",
"start_offset" : 14,
"end_offset" : 19,
"type" : "",
"position" : 3
},
{
"token" : "exampl",
"start_offset" : 20,
"end_offset" : 27,
"type" : "",
"position" : 4
},
{
"token" : "at",
"start_offset" : 28,
"end_offset" : 30,
"type" : "",
"position" : 5
},
{
"token" : "work",
"start_offset" : 31,
"end_offset" : 35,
"type" : "",
"position" : 6
},
{
"token" : "ing",
"start_offset" : 36,
"end_offset" : 39,
"type" : "",
"position" : 7
}
]
}

look how it removed the e in the example other than other things.

Defining a new analyzer

PUT myIndex
{
  "settings": {
     "analysis": {
       "char_filter": {
        },
       "filter": {
        },
       "analyzer": {
           "new_analyzer": {}
        }
     }
   }
}