How can Elasticsearch be used, optimized and how to avoid common errors - you know, for search.
Elasticsearch is one of the most powerful index, search, and analytics engines built on Apache Lucene. It is an open-source project which quickly became popular both among startups and big companies like for example Uber, Walmart, Audi or Netflix (https://www.elastic.co/customers/). But as with many other tools – great power comes with great responsibility.
That’s why, apart from mentioning what makes Elasticsearch so cool, we would like to share with you in this blogpost a few areas and issues, which you should pay special attention to when thinking seriously about using ELK Stack in your solution.
What is ELK stack?
Elasticsearch is the central component of the ELK Stack (acronym for the Elasticsearch, Logstash, Kibana). ELK Stack is a set of tools for data ingestion, transformation, storage, monitoring, reporting, analysis and visualization, which together form a complex product used in many business scenarios – from infrastructure monitoring and security incident management to advanced analytical dashboards and enterprise search.
Elasticsearch can handle numerous types of data including textual, numerical, geospatial, structured, semi-structured and even unstructured data! It is known for its scalability and flexibility, which makes Elasticsearch able to process huge volumes of data in near real time – also in distributed, multi-tenant scenarios. It also provides an easy-to-use RESTful API (operating schema-free JSON documents) and supports a large number of different programming languages. All these features make Elasticsearch a complex tool that has found applications in many areas of life, including agriculture, education, energy, financial services, government, healthcare, technology and professional services.
ELK Stack – who is who?
Elasticsearch – the core of the Elastic Stack, a search and analytics engine
Kibana – frontend application that provides data visualization, management, and monitoring functionalities
Logstash – implements ingest pipelines that simultaneously collect and transform data – especially log files – and send them to the receiver
Beats – a platform for single-purpose data shippers (Filebeat, Metricbeat, Heartbeat etc.)
Elastic Stack use cases – hall of fame
The Elasticsearch 1.0 was released in 2010 and since then, the project has grown and evolved, gaining recognition and publicity. Today, many well-known companies and institutions use ELK Stack in their products and services – and often we do not even know about it!
Netflix, for example, uses several clusters, each consisting of hundreds of nodes gathered together. Pretty impressive, right? This architecture is partly used as the foundation of Netflix’s messaging platform (emails, push notifications, text messages) and to monitor trends among users or to analyze security logs/incidents.
Tinder relies on the Elastic Stack when it comes to analysis, visualization and prediction of user patterns and match preferences.
At the University of Oxford, a next-generation SIEM system was built with ELK Stack. Finally – eBay uses Elasticsearch as its main search engine (over 800 million listings, with results returned near real time) and Beats to monitor Petabytes of logs per day. There are many more examples of companies using Elastic in everyday situations and products.
Be aware – a few use cases in which you should pay attention when deploying Elasticsearch in your products
However, Elasticsearch is a complex piece of software, and despite its many advantages, it can cause problems if used or operated in the wrong way. (*We have prepared for you a short glossary at the end of the article to follow the terms used in this part of article)
#1 – Configuration and setup
The first problems may already occur during the installation and configuration of a cluster. During cluster setup, it is important to pay attention to everything, because we are dealing with distributed systems. There are many reasons why errors occur – this can be for example, different versions, incorrectly set variables, typos, deployment mode or wrong networking. These problems cause nodes to not pass bootstrap checks and fail to form a stable cluster connection.
#2 – Index settings
This is not the end of the situations when Elasticsearch can give you a hard time. Issues related to index settings are reflected in cluster health (based on the state of its primary and replica shards). There are three statuses:
- green – all shards are assigned;
- yellow – unassigned replica shards – some data can be unavailable if a node fails;
- red – unassigned primary shards – some data is unreachable. This status means serious issues e.g., problems with space, disks or connectivity errors.
It is worth monitoring the health of the cluster and react if the status is other than green.
#3 – Security
An important aspect of working with Elastic Stack is its security, which has been enabled by default since version 8.x. Earlier versions do not have the security layer enabled by default, so if you have not set this up and your Elasticsearch is deployed in production mode, you could be in serious trouble! If your search engine is not secured, unauthorized people may gain access to your data, search, modify and even delete it. Secured Elasticsearch can be accessed by specified users – it is a good practice to create users and give them only the necessary roles for the activities they perform. The user responsible for snapshots should only have access to snapshot-related roles. There are many clusters and indices privileges, but it’s worth setting them up. Safety first.
#4 – Performance
As data volume grows, more stability and uptime issues occur, affecting the performance of ELK Stack. For example, the larger the index, the greater the chance that its shards will fail, resulting in no results being returned. As the amount of data increases, more resources are needed to maintain the current level of performance. Otherwise, we will observe more frequent and serious problems. Adding another node located on some other virtual machine to the cluster (horizontal scaling) can partially solve such issues. Even some settings modification (e.g., heap size) can increase the overall performance of the application.
If you have reviewed your cluster settings and made the necessary adjustments, and the performance of your application still leaves much to be desired, then the problem may lie in the way the data is processed.
#5 – Mappings and text analyzers
Incorrect mapping or a faulty text analyzer can affect the quality of the results returned by Elasticsearch. Mapping defines how the document, and the fields it contains, are stored and indexed, e.g., determines which text fields should be analyzed, allows for custom date format or specifies which fields contain numbers, dates, or geolocations. Analyzers are used to analyze text when indexing or searching a text fields. They allow Elasticsearch to return relevant results rather than exact matches. Analyzers consist of a character filter (operations on individual characters), a tokenizer (which converts a string into an array of individual tokens) and a token filter (operations in single tokens). A well-chosen mapping together with an appropriate analyzer create the necessary foundations for the correct operation of the application. You can also implement your own analyzer – it’s not that hard! The fields that don’t have analyzers specified are called non-analyzed fields and they store original text (which may be normalized, but not tokenized or filtered).
#6 – Queries
Elasticsearch has its own language for querying data (Elasticsearch DLS), which returns results based on relevancy scoring (a measure that tells us how well a search matches a document). There are a few rules that are worth keeping in mind when implementing search queries. Here are some of the rules – they will help you write correct queries and may improve the performance of your application:
- remember about the query context which does affect the score and filter context which does not. The returned results may differ while using different contexts;
- there are special queries for analyzed fields (full-text search queries) and other for non-analyzed (term-level search queries); this is of great importance;
- different queries vary in computational complexity; try to look for different solution if query is not fast enough;
- use pagination to return only the number of documents needed at a time;
- the more complex the query, the greater the chance of error; pay attention to the used logical operators – they also matter;
Properly selected query, mapping and analyzer guarantee success. Together, they will make your application return relevant data that the user of your application cares about the most. After all, that’s what it’s all about. In addition, if properly configured, they will allow the implementation of many interesting solutions like autocompletion, nested documents, synonyms, similar documents, etc. They will index the same field in different ways (e.g., ignoring cases, accents, or stop words) depending on your needs.
Elastic Stack is a powerful tool that, when used properly, will become your ally – you know, for search. Thanks to a multitude of solutions and built-in features it will enable the implementation of complex solutions. However, working with Elasticsearch can be challenging sometimes. If you are facing problems with your Elasticsearch cluster, want to conduct an audit your solution, or are looking for experienced partner to build a solution using ELK Stack, contact us at [email protected].
Elasticsearch – mini glossary
In order to understand how Elasticsearch works, it is necessary to introduce some basic terms and concepts that must be known:
- Cluster – a collection of one or more connected nodes. It is responsible for data redundancy, horizontal scaling and distribution of processes across all nodes in the cluster;
- Node – there are many types of nodes with different set of roles e.g., master (cluster management), data (storing indices; data-related operations after indexing) and ingest (data processing and enrichment; activities before indexing). One node can perform multiple functions;
- Index – a logical structure containing a collection of similar/related documents e.g., movie reviews or products information. It is composed of shards (primary – original piece of the index; replica – copy of primary shard, allows for data replication, redundancy and also increases search throughput);
- Document – basic data unit stored in Elasticsearch (serialized to JSON format). It is composed of fields (key-value pairs), which each have their own data type. Same field can me indexed and searched in different ways using a wide set of mapping properties and text analyzers.
Inverted index – a hashmap-like data structure that lists and directs from unique search terms (words, numbers, phrases etc.) to their locations in all documents containing them. This structure was a real breakthrough in search engine implementation, as it allows for very fast (near real time) full-text searches even on large datasets. Elasticsearch uses multiple distributed inverted indexes by default.