Search for Emoji with Elasticsearch
Edit on November 2018: please have a look at our new article on emoji search because Elastic 6.4 now has a better support for Emoji!
Edit on December 2016: there is now an Elasticsearch plugin for that!
Since 2011 and Unicode 6.0, emoji is an integral and standardized part of the computer environment. They are more and more available on users keyboards (thanks to Android and iPhone) and as web developers and engineers, it is our duty to support them everywhere we can, even if you don’t like them.
There is no reason to believe that users will not want to use an emoji inside their usernames, biographies or even passwords, as they are valid characters in the same way as © or é.
A great deal of websites and applications are broken right now, because MySQL’s utf8 character set only allows to store a subset of Unicode characters. So, if you try to save an emoji in a utf8_general_ci
table, it might go wrong pretty badly. the first thing to do is to migrate to the utf8mb4
character set, introduced in MySQL 5.5.3.
In this article, I would like to expose my solutions to index emoji and search for it in Elasticsearch. The guys at Yelp do it already and it’s pretty sweet, you can search for donuts using an 🍩 emoji!
Let’s dive into Lucene and emoji!
Section intitulée what-is-an-emojiWhat is an emoji
What I am talking about here are the pictorial symbols presented in a colorful form and used inline in text; as we see them in Twitter, Slack, WhatsApp, git commit messages, …
We must distinguish:
-
emoticon: text supposed to represent an expression, like
;-)
or¯\_(ツ)_/¯
; There is not much we can do here as there is no standard or semantic to extract from them; -
emoji: pictograms that can be used in text. They can be display as image or text:
- emoji characters are the text representation of Emoji, normal glyphs encoded in fonts like other characters, like ; That’s what we will be searching for;
- emoji presentation are the graphical representation, only to be considered by the display side, like . Each system can bring his own images, so they don’t look identical everywhere but all are supposed to mean the same thing.
The specification is complex and introduces more than just glyphs:
- there is a variation selector character, that lets you choose between the text or graphical representations (
[U+FE0E](http://www.fileformat.info/info/unicode/char/fe0e/index.htm)
andU+FE0F
); - some emoji can be modified via Emoji Modifier, for example to change skin tone (
U+1F3FB..1F3FF
); - you can combine some emoji with a zero-width joiner (ZWJ) character, to display them as a unique image (👨︎ ❤︎ 👨︎ can be displayed as ).
In terms of availability, MacOS 10 and iOS 5 were the firsts to implement (badly) the specification, you can expect users to have emoji support on Windows 8+, Android, iPhone, iPad, MacOS… and on Linux of course. This means emoji can be displayed anywhere as emoji characters, and in some place as emoji presentation. I strongly recommend to have a look at caniemoji.com.
![An example of an emoji keyboard](/media/original/2016/emoji/2016–03–03 15.17.25.png)
User can type emoji everywhere. Be ready!
Section intitulée how-are-we-supposed-to-search-themHow are we supposed to search them
The specification describes how to search with emoji:
Searching includes both searching for emoji characters in queries, and finding emoji characters in the target. These are most useful when they include the annotations as synonyms or hints. For example, when someone searches for ⛽︎ on yelp.com, they see matches for “gas station”. Conversely, searching for “gas pump” in a search engine could find pages containing ⛽︎.
Annotations are language-specific: searching on yelp.de, someone would expect a search for ⛽︎ to result in matches for “Tankstelle”.
That’s why we can search for donuts on Yelp by typing 🍩︎, it is translated to annotations: dessert
, donut
and sweet
for exemple. Then it’s only matching with text like a normal search.
On the other side, if I store a tweet containing 🍩︎, I must be able to find it when searching for dessert
or donut
. So each supported emoji must have a textual equivalent, which of course needs to be translated depending on your content language.
Skin tone modifiers and variation selectors can safely be ignored in search because we are only interested in the glyphs, not the way they are displayed.
But combined emoji should have their own annotations:
- 👨︎ ❤︎ 👨︎ or must not only match for “man” and “love”, but also “couple”;
- 👨︎ 👨︎ 👧︎ or must not only match for “man” and “girl”, but also “family”…
And finally, searching for 🍩︎ should return documents containing the emoji 🍩︎ with a higher rank than the ones talking about desserts, the glyph itself must be a search criteria too.
Section intitulée elasticsearch-analyzer-for-emojiElasticsearch analyzer for emoji
The default analyzer in Elasticsearch is called standard, it behaves like this when given a text containing an emoji:
GET /_analyze?analyzer=standard
{
"text": "Give me a 🍩︎ please."
}
{
"tokens": [
{
"token": "give", ...
},
{
"token": "me", ...
},
{
"token": "a", ...
},
{
"token": "please", ...
}
]
}
The emoji is considered as “other” and is removed from the tokens. We are going to need a custom analyzer with a tokenizer that keeps 🍩︎; and whitespace is one of them.
As you can guess, it breaks your content on every whitespace, the produced tokens are Give
, me
, a
, 🍩︎
, please.
. We could think our job is done! But it’s not, notice the dot at the end of please.
? Punctuation is not removed by the whitespace tokenizer!
Sadly there is no other tokenizer as smart as standard and as permissive as whitespace, if you feel like writing one, be my guest!
In the meantime we are going to remove punctuation in our analyzer by adding two token filters. This is not ideal because we will not be very smart about it. Sometimes punctuation is part of the token, but this is the better solution I could make:
"punctuation_filter": {
"type": "pattern_replace",
"pattern": "\\p{Punct}",
"replace": ""
},
"remove_empty_filter": {
"type": "length",
"min": 1
}
The first token filter removes any punctuation sign (including: !"#$%&'()*+,-./:;<=>?@[]^_
{|}~`) and the second one removes empty tokens.
We also need to clean-up modifiers and variation selectors at this stage, before the synonyms filter takes place, as we can have some hidden characters sticked to the real emoji. For example, here is a “Smiling Face With Sunglasses” emoji: 😎︎.
It composed like this: \uD83D\uDE0E
. Now if I add a variation selector to force the display as text: \uFE0E
, my whole token in the analysis process will be \uD83D\uDE0E\uFE0E
. This is not a whitespace, so our whitespace tokenizer didn’t removed it. And this does not match our synonym either, so we need to get rid of it.
Here is the list of all the characters likely to get bonded to an emoji; and that could break things for our synonym filter:
-
\uFE0E
: VARIATION SELECTOR-15 (force text representation); -
\uFE0F
: VARIATION SELECTOR-16 (force graphic representation); -
\uD83C\uDFFB
: EMOJI MODIFIER FITZPATRICK TYPE-1–2 (skin tone); -
\uD83C\uDFFC
: EMOJI MODIFIER FITZPATRICK TYPE-3 (skin tone); -
\uD83C\uDFFD
: EMOJI MODIFIER FITZPATRICK TYPE-4 (skin tone); -
\uD83C\uDFFE
: EMOJI MODIFIER FITZPATRICK TYPE-5 (skin tone); -
\uD83C\uDFFF
: EMOJI MODIFIER FITZPATRICK TYPE-6 (skin tone);
There is also a \u200D
: ZERO WIDTH JOINER; it’s used to merge compatible emoji. We are going to replace it with a space before the tokenization. This way, we can index separately all the members of grouped emoji.
Our final pattern now looks impressive!
"punctuation_filter": {
"type": "pattern_replace",
"pattern": "\\p{Punct}|\\uFE0E|\\uFE0F|\\uD83C\\uDFFB|\\uD83C\\uDFFC|\\uD83C\\uDFFD|\\uD83C\\uDFFE|\\uD83C\\uDFFF",
"replace": ""
}
We then add the filter for our ZWJ and our emoji synonyms, and we are good to go!
PUT /en-emoji
{
"settings": {
"analysis": {
"char_filter": {
"zwj_char_filter": {
"type": "mapping",
"mappings": [
"\\u200D=>"
]
}
},
"filter": {
"english_emoji": {
"type": "synonym",
"synonyms_path": "analysis/cldr-emoji-annotation-synonyms-en.txt"
},
"punctuation_filter": {
"type": "pattern_replace",
"pattern": "\\p{Punct}|\\uFE0E|\\uFE0F|\\uD83C\\uDFFB|\\uD83C\\uDFFC|\\uD83C\\uDFFD|\\uD83C\\uDFFE|\\uD83C\\uDFFF",
"replace": ""
},
"remove_empty_filter": {
"type": "length",
"min": 1
}
},
"analyzer": {
"english_with_emoji": {
"char_filter": "zwj_char_filter",
"tokenizer": "whitespace",
"filter": [
"lowercase",
"punctuation_filter",
"remove_empty_filter",
"english_emoji"
]
}
}
}
}
}
Of course you should add stemming, stop words… as you please (look at the core analyzer for english if you need inspiration).
Our new english_emoji
synonym token filter is reading a file called analysis/cldr-emoji-annotation-synonyms-en.txt
, I’m using the Solr format here to tell Elasticsearch that () translate to and france
.
# We use explicit mapping
# Because "dessert" is not supposed to index "donut".
🍩︎ => 🍩︎, dessert, donut, sweet
🇫︎🇷︎ => 🇫︎🇷︎, france
Here are some example of what going to be indexed with this sample:
GET /en-emoji/_analyze?analyzer=english_with_emoji
{
"text": "Eat dessert in 🇫︎🇷︎"
}
# eat dessert in 🇫︎🇷︎ france
GET /en-emoji/_analyze?analyzer=english_with_emoji
{
"text": "Eat dessert in france"
}
# eat dessert in france
GET /en-emoji/_analyze?analyzer=english_with_emoji
{
"text": "Give me a 🍩︎ please."
}
# give me a 🍩︎ dessert donut sweet please
So if I search for “France”, I get both document containing the word “France” and the emoji !
As you may guess the harder part here is to build the synonyms file. As I didn’t found any on the great internet, I started a repository where I provide synonyms dictionary for all languages included in the Unicode Common Locale Data Repository!
The version 27 of the CLDR started including emoji annotations but as a provisional state (not supposed to be used), and we are currently waiting for the 29th stable version, with much better content.
You can get all of those emoji synonyms dictionaries on github, alongside the scripts used to build them.
I hope it become the “go to” destination to build an emoji capable search, in any language. You will also find more complete examples of Elasticsearch implementation.
Section intitulée highlight-emoji-in-search-resultsHighlight emoji in search results?
With our synonyms and tokenizer, highlight will work as expected too:
GET en-emoji/tweet/_search
{
"query": {
"match": { "tweet": "donut" }
},
"highlight": {
"fields": { "tweet": {} }
}
}
Will answer:
"highlight": {
"tweet": [
"I love <em>🍩︎!</em>"
]
}
Twitter does not support this, now you do!
Section intitulée support-for-emoticonSupport for emoticon
As I said earlier, emoticon can’t really be supported as they only are made of punctuation, but what if we used an Elasticsearch char_filter
to translate :)
to 😀︎ before any tokenization even takes place?
Yes, we would be able to search for “smile” in documents containing a simple punctuation smiley!
This can be done like this:
"char_filter": {
"emoticons_char_filter": {
"type": "mapping",
"mappings": [
":)=>😀︎",
":(=>😠︎"
]
}
}
This can be a nice addition to your emoji enable search engine! You can jump over github to get a more complete pre-configured list of emoticon to emoji for your analyzers. Of course the mapping from emoticon to emoji is subject to different interpretations and may need customization. The mapping I suggest are based on this package.
Some vendors also chose to store :alarm_clock:
instead of ⏰︎
in their databases, and the same recommendations can apply, you have to introduce more context into your index to be able to search efficiently with this glyph.
Section intitulée conclusionConclusion
Emoji search is as easy as using synonyms, and it can be a great addition to any website or product you may be building. You may think “who’s lazy enough to type 🍕︎
instead of pizza
”, but it’s much more than just search:
- you can now highlight emoji matching a text search;
- you can find documents similar to other document composed of emoji;
- you can add a real meaning to textual emoticons;
- you don’t just ignore such glyphs but instead can build on their real meaning.
Head over the emoji-search repository to find all the synonyms and emoticons and build a better search engine for your users. The analyzer we built here may not be the greatest but it’s a start, if you have any suggestion, feel free to help!
I would like to finish with a Vulcan salute 🖖 (because yes, this is an official emoji); happy searching!
PS: Oh and if you need some French Elasticsearch consulting or training… We can help 😉.
Section intitulée ressourcesRessources
- Emoji graphic representations from EmojiOne (which is awesome);
- The full emoji list from Unicode;
- Emojipedia.
Commentaires et discussions
Elasticsearch ICU now understands emoji!
And that simple change in Elastic 6.4 may have a bigger impact on your indices that you might think. Elasticsearch 6.4 is shipped with Lucene 7.4 – this is a one-liner in the official Release Notes but if you look closer, this new version ships updated ICU data and real support for…
Lire la suite de l’article Elasticsearch ICU now understands emoji!
Nos articles sur le même sujet
Nos formations sur ce sujet
Notre expertise est aussi disponible sous forme de formations professionnelles !
Elasticsearch
Indexation et recherche avancée, scalable et rapide avec Elasticsearch
Ces clients ont profité de notre expertise
À la recherche du meilleur moyen de réaliser une fonctionnalité, Weglot à fait appel à nos équipes pour de l’expertise Elasticsearch. Fort de notre grande expérience avec Elasticsearch, nous avons pu itérer sur différentes implémentations qui répondaient au besoin, et choisir la meilleure. En plus de donner des axes d’amélioration concrets et argumentés, …
Refonte complète de la plateforme d’annonces immobilières de Cushman & Wakefield France. Connecté aux outils historiques, cette nouvelle vitrine permet une bien meilleure visibilité SEO et permet la mise en avant d’actifs qui ne pouvaient pas l’être auparavant.
JoliCode a été sollicité pour accompagner le développement de la nouvelle version du site. Conçue avec le framework Symfony2, cette nouvelle version bénéficie de la performance et la fiabilité du framework français. Reposant sur des technologies comme Elasticsearch, cette nouvelle version tend à offrir une expérience optimale à l’internaute. Le développement…