Elasticsearch ICU now understands emoji!
And that simple change in Elastic 6.4 may have a bigger impact on your indices that you might think.
Elasticsearch 6.4 is shipped with Lucene 7.4 – this is a one-liner in the official Release Notes but if you look closer, this new version ships updated ICU data and real support for emoji. And that’s a game changer 😎 (for some!).
International Components for Unicode (ICU) is an open source project of mature C/C++ and Java libraries for Unicode support, software internationalization and software globalization, it’s used everywhere (on your computer, your phone, and probably even your connected fridge).
All usages of icu_tokenizer
are impacted, that means that everyone using the must-need icu_tokenizer
should probably reindex everything, because “🍕” is now a token!
Section intitulée the-new-behavior-of-icu-tokenizerThe new behavior of icu_tokenizer
With this simple query we are testing Elasticsearch ICU Tokenizer:
GET /_analyze
{
"tokenizer": "icu_tokenizer",
"text": "I live in 🇫🇷 and I'm 👩🚀"
}
The 👩🚀 emoji is very peculiar as it’s a combination of the more classic 👩 and 🚀 emoji. The flag of France is also a special one, it’s the combination of 🇫 and 🇷. So we are not just talking about splitting Unicode code points properly but really understanding emoji here.
Let’s compare the resulting tokens of this _analyze
call with both Elasticsearch 6.3 and Elasticsearch 6.4:
Section intitulée elasticsearch-6–3Elasticsearch 6.3
- I
- live
- in
- and
- I’m
Emoji are just dropped like punctuation.
Section intitulée elasticsearch-6–4Elasticsearch 6.4
- I
- live
- in
- 🇫🇷
- and
- I’m
- 👩🚀
Emoji are kept and understood!
More tokens means more relevant search results! Here is how to take advantage of this new capability.
Section intitulée using-the-power-of-emoji-for-searchUsing the power of emoji for search
Now that you have the tokens, you can search for meaning and relevance inside the emoji world. Thousand of new words, new meanings and ways to communicate are available to you.
Your users will be able to search for a pizza place by typing “pizza”, or “🍕”, or both.
In order to add this capability to your indices, you must use the CLDR annotation for each emoji and add them as synonyms via a custom Token Filter. Here is an example:
PUT /emoji-capable
{
"settings": {
"analysis": {
"filter": {
"english_emoji": {
"type": "synonym",
"synonyms_path": "analysis/cldr-emoji-annotation-synonyms-en.txt"
}
},
"analyzer": {
"english_with_emoji": {
"tokenizer": "icu_tokenizer",
"filter": [
"english_emoji"
]
}
}
}
}
}
You may wonder how to populate the cldr-emoji-annotation-synonyms-en.txt
file? Wonder no more! Everything is already done and it looks like this:
🤧 => 🤧, face, gesundheit, sneeze, sneezing face
🧞 => 🧞, djinn, genie
🕺 => 🕺, dance, man, man dancing
👂 => 👂, body, ear
🐅 => 🐅, tiger
🍺 => 🍺, bar, beer, drink, mug
🆘 => 🆘, help, sos, SOS button
👩🚒 => 👩🚒, firefighter, firetruck, woman
🇮🇪 => 🇮🇪, Ireland
I took the time to generate properly formatted and Elasticsearch-compatible synonym files for all the languages and emoji supported by CLDR, and you can find this on Github.
That way when you search for 🐅, “tiger” is also searched, and the other way around.
Section intitulée a-complete-emoji-search-exampleA complete Emoji Search example
Let’s build a simple index, add some documents and search for them (gosh, I wish found.no/play was still working, it was like a JS Fiddle but for Elasticsearch!):
Section intitulée the-index-with-english-based-analysisThe index with English based analysis
PUT /tweets
{
"settings": {
"number_of_replicas": 0,
"number_of_shards": 1,
"analysis": {
"filter": {
"english_emoji": {
"type": "synonym",
"synonyms": [
"🐅 => 🐅, tiger",
"🍺 => 🍺, bar, beer, drink, mug",
"🍍 => 🍍, fruit, pineapple",
"🍕 => 🍕, cheese, pizza, slice"
]
},
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_keywords": {
"type": "keyword_marker",
"keywords": ["example"]
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"english_with_emoji": {
"tokenizer": "icu_tokenizer",
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_emoji",
"english_stop",
"english_keywords",
"english_stemmer"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"content": {
"type": "text",
"analyzer": "english_with_emoji"
}
}
}
}
}
Section intitulée add-some-documentsAdd some documents
POST /tweets/_doc
{
"author": "NotFunny",
"content": "Pineapple on pizza, are you kidding me? #peopleAreCrazy"
}
POST /tweets/_doc
{
"author": "JulFactor93",
"content": "🍍🍕 is the best #food"
}
Section intitulée and-searchAnd search!
GET /tweets/_search
{
"query": {
"match": {
"content": "pineapple pizza"
}
}
}
GET /tweets/_search
{
"query": {
"match": {
"content": "🍍🍕"
}
}
}
Both those searches will return our two documents, because they both search with the emoji and the words.
{
"took": 16,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.653994,
"hits": [
{
"_index": "tweets",
"_type": "_doc",
"_id": "-PJ172YBCoXstbpG75QM",
"_score": 0.653994,
"_source": {
"author": "JulFactor93",
"content": "🍍🍕 is the best #food"
}
},
{
"_index": "tweets",
"_type": "_doc",
"_id": "-fJ172YBCoXstbpG9pRv",
"_score": 0.3971361,
"_source": {
"author": "NotFunny",
"content": "Pineapple on pizza, are you kidding me? #peopleAreCrazy"
}
}
]
}
}
Section intitulée final-wordsFinal words
This change is great news for me as my Elasticsearch plugin Emoji Search is no longer needed! It’s a relief since building and shipping it is no fun at all. The documentation is not really helping and you have to compile your plugin for each Elasticsearch release.
Supporting emoji is now easier than ever, if you take a look at this technical article from 2016 your eyes will bleed a little, we had to use a whitespace tokenizer, some char filters and hacks everywhere. Elasticsearch 6.4 improves the search engine Unicode support and we will enjoy it.
Happy emoji searching 🔎!
Commentaires et discussions
Search for Emoji with Elasticsearch
Edit on November 2018: please have a look at our new article on emoji search because Elastic 6.4 now has a better support for Emoji! Edit on December 2016: there is now an Elasticsearch plugin for that! Since 2011 and Unicode 6.0, emoji is an integral and standardized part of…
Lire la suite de l’article Search for Emoji with Elasticsearch
Nos articles sur le même sujet
Nos formations sur ce sujet
Notre expertise est aussi disponible sous forme de formations professionnelles !
Elasticsearch
Indexation et recherche avancée, scalable et rapide avec Elasticsearch
Ces clients ont profité de notre expertise
Nous avons développé un outil statistique complet développé pour ORPI. Basé sur PHP, Symfony et Elasticsearch, cet outil offre à toutes les agences du réseau une visibilité accrue sur leurs annonces. Il garantit également une transparence totale envers les clients, en fournissant des statistiques détaillées sur les visualisations et les contacts de…
JoliCode accompagne l’équipe technique Dayuse dans l’optimisation des performances de sa plateforme. Nous sommes intervenus sur différents sujets : La fonctionnalité de recherche d’hôtels, en remplaçant MongoDB et Algolia par Redis et Elasticsearch. La mise en place d’un workflow de réservation, la migration d’un site en Twig vers une SPA à base de…
Dans le cadre d’une refonte complète de son architecture Web, Expertissim a sollicité l’expertise de JoliCode afin de tenir les délais et le niveau de qualité attendus. Le domaine métier d’Expertissim n’est pas trivial : les spécificités du marché de l’art apportent une logique métier bien particulière et un processus complexe. La plateforme propose…