{"id":637,"date":"2020-05-02T10:21:09","date_gmt":"2020-05-02T10:21:09","guid":{"rendered":"https:\/\/blog.samarthya.me\/wps\/?p=637"},"modified":"2020-05-02T10:21:13","modified_gmt":"2020-05-02T10:21:13","slug":"analyze-_analyze-api","status":"publish","type":"post","link":"https:\/\/blog.samarthya.me\/wps\/2020\/05\/02\/analyze-_analyze-api\/","title":{"rendered":"Analyze: _analyze api"},"content":{"rendered":"<p>Text is a key data in most of the information that is ingested in the <strong>elasticsearch<\/strong>, and whether it is type <code>text<\/code> or <code>keyword<\/code> it requires careful consideration to choose which analyzer to use.<\/p>\n<p>What is Analyzer?<\/p>\n<p>An <em>analyzer<\/em> \u2009which can be either builtin or custom,\u2009is just a package which contains three lower-level building blocks<\/p>\n<ol>\n<li><em>character filters<\/em>,<\/li>\n<li><em>tokenizers<\/em>, and<\/li>\n<li><em>token filters<\/em>.<\/li>\n<\/ol>\n<p>Main documentation <a href=\"https:\/\/www.elastic.co\/guide\/en\/elasticsearch\/reference\/7.4\/_testing_analyzers.html\">here<\/a><\/p>\n<blockquote>\n<p>NOTE: The analysis settings are non dynamic and can\u2019t be updated while the index is open.\u00a0<\/p>\n<\/blockquote>\n<pre>GET _analyze<br \/>{<br \/>\"text\": \"My simple text!\",<br \/>\"explain\": true<br \/>}<\/pre>\n<p>Look at the output to understand which analyzer has been used and how it processes.<\/p>\n<pre>{<br \/>\"detail\" : {<br \/>\"custom_analyzer\" : false,<br \/>\"analyzer\" : {<br \/><strong>\"name\" : \"standard\",<\/strong><br \/>\"tokens\" : [<br \/>{<br \/>\"token\" : \"my\",<br \/>\"start_offset\" : 0,<br \/>\"end_offset\" : 2,<br \/>\"type\" : \"&lt;ALPHANUM&gt;\",<br \/>\"position\" : 0,<br \/>\"bytes\" : \"[6d 79]\",<br \/>\"positionLength\" : 1,<br \/>\"termFrequency\" : 1<br \/>},<br \/>{<br \/>\"token\" : \"simple\",<br \/>\"start_offset\" : 3,<br \/>\"end_offset\" : 9,<br \/>\"type\" : \"&lt;ALPHANUM&gt;\",<br \/>\"position\" : 1,<br \/>\"bytes\" : \"[73 69 6d 70 6c 65]\",<br \/>\"positionLength\" : 1,<br \/>\"termFrequency\" : 1<br \/>},<br \/>{<br \/>\"token\" : \"text\",<br \/>\"start_offset\" : 10,<br \/>\"end_offset\" : 14,<br \/>\"type\" : \"&lt;ALPHANUM&gt;\",<br \/>\"position\" : 2,<br \/>\"bytes\" : \"[74 65 78 74]\",<br \/>\"positionLength\" : 1,<br \/>\"termFrequency\" : 1<br \/>}<br \/>]<br \/>}<br \/>}<br \/>}<\/pre>\n<p>Look at the STANDARD analyzer being used and how it breaks the original text from<\/p>\n<pre><span style=\"color: #ff0000;\"><strong>My simple text!<\/strong><\/span><\/pre>\n\n\n<p>to<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">\"<strong>tokens<\/strong>\" : [\n{\n\"token\" : \"my\",\n\"start_offset\" : 0,\n\"end_offset\" : 2,\n\"type\" : \"\",\n\"position\" : 0,\n\"bytes\" : \"[6d 79]\",\n\"positionLength\" : 1,\n\"termFrequency\" : 1\n},\n{\n\"token\" : \"simple\",\n\"start_offset\" : 3,\n\"end_offset\" : 9,\n\"type\" : \"\",\n\"position\" : 1,\n\"bytes\" : \"[73 69 6d 70 6c 65]\",\n\"positionLength\" : 1,\n\"termFrequency\" : 1\n},\n{\n\"token\" : \"text\",\n\"start_offset\" : 10,\n\"end_offset\" : 14,\n\"type\" : \"\",\n\"position\" : 2,\n\"bytes\" : \"[74 65 78 74]\",\n\"positionLength\" : 1,\n\"termFrequency\" : 1\n}\n]<\/pre>\n\n\n\n<p>Look at every space the input is broken and even the punctuation has been removed &#8211; <code>text<\/code> instead of <code>text!<\/code>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Changing the analyzer to &#8211; Whitespace<\/h3>\n\n\n\n<pre class=\"wp-block-preformatted\">GET _analyze<br>{<br>\"text\": \"My simple text!\",<br>\"explain\": true,<br>\"analyzer\": \"whitespace\"<br>}<\/pre>\n\n\n\n<p>The output is something like<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">{\n\"detail\" : {\n\"custom_analyzer\" : false,\n\"analyzer\" : {\n\"name\" : \"<strong>whitespace<\/strong>\",\n\"tokens\" : [\n{\n\"token\" : \"My\",\n\"start_offset\" : 0,\n\"end_offset\" : 2,\n\"type\" : \"word\",\n\"position\" : 0,\n\"bytes\" : \"[4d 79]\",\n\"positionLength\" : 1,\n\"termFrequency\" : 1\n},\n{\n\"token\" : \"simple\",\n\"start_offset\" : 3,\n\"end_offset\" : 9,\n\"type\" : \"word\",\n\"position\" : 1,\n\"bytes\" : \"[73 69 6d 70 6c 65]\",\n\"positionLength\" : 1,\n\"termFrequency\" : 1\n},\n{\n\"token\" : \"text!\",\n\"start_offset\" : 10,\n\"end_offset\" : 15,\n\"type\" : \"word\",\n\"position\" : 2,\n\"bytes\" : \"[74 65 78 74 21]\",\n\"positionLength\" : 1,\n\"termFrequency\" : 1\n}\n]\n}\n}\n}<\/pre>\n\n\n\n<p>Look at how the output is similar except the punctuation has not been removed and the the case has been retained.<\/p>\n\n\n\n<p><code>My<\/code> instead of <code>my<\/code> as in the previous example.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Other Analyzer: Stop, Keyword, English<\/h3>\n\n\n\n<h4 class=\"wp-block-heading\">Stop<\/h4>\n\n\n\n<pre class=\"wp-block-preformatted\">GET _analyze<br>{<br>\"text\": \"My simple and sweet text!\",<br>\"explain\": true,<br>\"analyzer\": \"stop\"<br>}<\/pre>\n\n\n\n<p>The stop words will be removed (Look at no and below)<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">{\n\"detail\" : {\n\"custom_analyzer\" : false,\n\"analyzer\" : {\n\"name\" : \"stop\",\n\"tokens\" : [\n{\n\"token\" : \"my\",\n\"start_offset\" : 0,\n\"end_offset\" : 2,\n\"type\" : \"word\",\n\"position\" : 0,\n\"bytes\" : \"[6d 79]\",\n\"positionLength\" : 1,\n\"termFrequency\" : 1\n},\n{\n\"token\" : \"simple\",\n\"start_offset\" : 3,\n\"end_offset\" : 9,\n\"type\" : \"word\",\n\"position\" : 1,\n\"bytes\" : \"[73 69 6d 70 6c 65]\",\n\"positionLength\" : 1,\n\"termFrequency\" : 1\n},\n{\n\"token\" : \"sweet\",\n\"start_offset\" : 14,\n\"end_offset\" : 19,\n\"type\" : \"word\",\n\"position\" : 3,\n\"bytes\" : \"[73 77 65 65 74]\",\n\"positionLength\" : 1,\n\"termFrequency\" : 1\n},\n{\n\"token\" : \"text\",\n\"start_offset\" : 20,\n\"end_offset\" : 24,\n\"type\" : \"word\",\n\"position\" : 4,\n\"bytes\" : \"[74 65 78 74]\",\n\"positionLength\" : 1,\n\"termFrequency\" : 1\n}\n]\n}\n}\n}<\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">English: Analyzer<\/h4>\n\n\n\n<pre class=\"wp-block-preformatted\">GET _analyze\n{\n\"text\": \"My simple and sweet text! by working.\",\n\"explain\": true,\n\"analyzer\": \"<strong>english<\/strong>\"\n}<\/pre>\n\n\n\n<p>Look at the output as under<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">{\n\"detail\" : {\n\"custom_analyzer\" : false,\n\"analyzer\" : {\n\"name\" : \"english\",\n\"tokens\" : [\n{\n\"token\" : \"my\",\n\"start_offset\" : 0,\n\"end_offset\" : 2,\n\"type\" : \"\",\n\"position\" : 0,\n\"bytes\" : \"[6d 79]\",\n\"keyword\" : false,\n\"positionLength\" : 1,\n\"termFrequency\" : 1\n},\n{\n<strong><span class=\"has-inline-color has-vivid-red-color\">\"token\" : \"simpl\",<\/span><\/strong>\n\"start_offset\" : 3,\n\"end_offset\" : 9,\n\"type\" : \"\",\n\"position\" : 1,\n\"bytes\" : \"[73 69 6d 70 6c]\",\n\"keyword\" : false,\n\"positionLength\" : 1,\n\"termFrequency\" : 1\n},\n{\n\"token\" : \"sweet\",\n\"start_offset\" : 14,\n\"end_offset\" : 19,\n\"type\" : \"\",\n\"position\" : 3,\n\"bytes\" : \"[73 77 65 65 74]\",\n\"keyword\" : false,\n\"positionLength\" : 1,\n\"termFrequency\" : 1\n},\n{\n\"token\" : \"text\",\n\"start_offset\" : 20,\n\"end_offset\" : 24,\n\"type\" : \"\",\n\"position\" : 4,\n\"bytes\" : \"[74 65 78 74]\",\n\"keyword\" : false,\n\"positionLength\" : 1,\n\"termFrequency\" : 1\n},\n{\n<strong><span class=\"has-inline-color has-luminous-vivid-orange-color\">\"token\" : \"work\",<\/span><\/strong>\n\"start_offset\" : 29,\n\"end_offset\" : 36,\n\"type\" : \"\",\n\"position\" : 6,\n\"bytes\" : \"[77 6f 72 6b]\",\n\"keyword\" : false,\n\"positionLength\" : 1,\n\"termFrequency\" : 1\n}\n]\n}\n}\n}<\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">Keyword<\/h4>\n\n\n\n<pre class=\"wp-block-preformatted\">GET _analyze\n{\n\"text\": \"My simple and sweet text! by working.\",\n\"explain\": true,\n\"analyzer\": \"keyword\"\n}<\/pre>\n\n\n\n<p>The output will change as under<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">{\n\"detail\" : {\n\"custom_analyzer\" : false,\n\"analyzer\" : {\n\"name\" : \"keyword\",\n\"tokens\" : [\n{\n\"token\" : \"My simple and sweet text! by working.\",\n\"start_offset\" : 0,\n\"end_offset\" : 37,\n\"type\" : \"word\",\n\"position\" : 0,\n<strong><span class=\"has-inline-color has-luminous-vivid-orange-color\">\"bytes\" : \"[4d 79 20 73 69 6d 70 6c 65 20 61 6e 64 20 73 77 65 65 74 20 74 65 78 74 21 20 62 79 20 77 6f 72 6b 69 6e 67 2e]\"<\/span><\/strong>,\n\"positionLength\" : 1,\n\"termFrequency\" : 1\n}\n]\n}\n}\n}<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Tokenizer and filters<\/h2>\n\n\n\n<p>How about playing with the some examples of tokenizer and filters.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">GET _analyze\n{\n\"tokenizer\": \"standard\",\n\"filter\": [\"lowercase\"],\n\"text\": \"My simple and sweet exmple at work(ing)!\"\n}<\/pre>\n\n\n\n<p>I am using the <code>standard<\/code> as tokenizer and <code>lowercase<\/code> as filter. Look at how it tokenizes the information.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">{\n\"tokens\" : [\n{\n\"token\" : \"my\",\n\"start_offset\" : 0,\n\"end_offset\" : 2,\n\"type\" : \"\",\n\"position\" : 0\n},\n{\n\"token\" : \"simple\",\n\"start_offset\" : 3,\n\"end_offset\" : 9,\n\"type\" : \"\",\n\"position\" : 1\n},\n{\n\"token\" : \"and\",\n\"start_offset\" : 10,\n\"end_offset\" : 13,\n\"type\" : \"\",\n\"position\" : 2\n},\n{\n\"token\" : \"sweet\",\n\"start_offset\" : 14,\n\"end_offset\" : 19,\n\"type\" : \"\",\n\"position\" : 3\n},\n{\n\"token\" : \"exmple\",\n\"start_offset\" : 20,\n\"end_offset\" : 26,\n\"type\" : \"\",\n\"position\" : 4\n},\n{\n\"token\" : \"at\",\n\"start_offset\" : 27,\n\"end_offset\" : 29,\n\"type\" : \"\",\n\"position\" : 5\n},\n{\n\"token\" : \"work\",\n\"start_offset\" : 30,\n\"end_offset\" : 34,\n\"type\" : \"\",\n\"position\" : 6\n},\n{\n\"token\" : \"ing\",\n\"start_offset\" : 35,\n\"end_offset\" : 38,\n\"type\" : \"\",\n\"position\" : 7\n}\n]\n}<\/pre>\n\n\n\n<p>adding snowball to the filter list now, and adding the missing <code>a<\/code> in <code>example<\/code> for the text.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">{\n\"tokens\" : [\n{\n\"token\" : \"my\",\n\"start_offset\" : 0,\n\"end_offset\" : 2,\n\"type\" : \"\",\n\"position\" : 0\n},\n{\n\"token\" : \"simpl\",\n\"start_offset\" : 3,\n\"end_offset\" : 9,\n\"type\" : \"\",\n\"position\" : 1\n},\n{\n\"token\" : \"and\",\n\"start_offset\" : 10,\n\"end_offset\" : 13,\n\"type\" : \"\",\n\"position\" : 2\n},\n{\n\"token\" : \"sweet\",\n\"start_offset\" : 14,\n\"end_offset\" : 19,\n\"type\" : \"\",\n\"position\" : 3\n},\n{\n\"token\" : \"exampl\",\n\"start_offset\" : 20,\n\"end_offset\" : 27,\n\"type\" : \"\",\n\"position\" : 4\n},\n{\n\"token\" : \"at\",\n\"start_offset\" : 28,\n\"end_offset\" : 30,\n\"type\" : \"\",\n\"position\" : 5\n},\n{\n\"token\" : \"work\",\n\"start_offset\" : 31,\n\"end_offset\" : 35,\n\"type\" : \"\",\n\"position\" : 6\n},\n{\n\"token\" : \"ing\",\n\"start_offset\" : 36,\n\"end_offset\" : 39,\n\"type\" : \"\",\n\"position\" : 7\n}\n]\n}<\/pre>\n\n\n\n<p>look how it removed the <code>e<\/code> in the example other than other things.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Defining a new analyzer<\/h3>\n\n\n\n<pre class=\"wp-block-preformatted\">PUT myIndex\n{\n  \"settings\": {\n     \"analysis\": {\n       \"char_filter\": {\n        },\n       \"filter\": {\n        },\n       \"analyzer\": {\n           \"new_analyzer\": {}\n        }\n     }\n   }\n}<\/pre>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Text is a key data in most of the information that is ingested in the elasticsearch, and whether it is type text or keyword it requires careful consideration to choose which analyzer to use. What is Analyzer? An analyzer \u2009which can be either builtin or custom,\u2009is just a package which contains three lower-level building blocks [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":641,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"image","meta":{"_exactmetrics_skip_tracking":false,"_exactmetrics_sitenote_active":false,"_exactmetrics_sitenote_note":"","_exactmetrics_sitenote_category":0,"footnotes":""},"categories":[34],"tags":[66,53],"class_list":["post-637","post","type-post","status-publish","format-image","has-post-thumbnail","hentry","category-technical","tag-analyze","tag-elasticsearch","post_format-post-format-image"],"_links":{"self":[{"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/posts\/637","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/comments?post=637"}],"version-history":[{"count":0,"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/posts\/637\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/media\/641"}],"wp:attachment":[{"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/media?parent=637"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/categories?post=637"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.samarthya.me\/wps\/wp-json\/wp\/v2\/tags?post=637"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}