Category Archives: Elasticsearch

search-conf-screenshot

How OER World Map determines search result order

OER World Map collects a lot of data. This is essential for making data centrally available, but as more is collected, the difficulty of finding a specific item increases, regardless of license or data content. Therefore, as data in OER World Map increases, it is very important to implement efficient and targeted search and ranking algorithms.

There are search algorithms, whose complexity, efficiency and confidentiality are impressive. The major search engines in the world are clear examples. Of course, a relatively small non-profit project as the OER World Map can not develop a such complex search algorithm from its own resources. This is also not desirable because the platform is built on the principles of transparency and openness.

Why is transparency so important?

We can assume that the user has the ability to comprehend the search behaviour and what caused the respective ranking of a search result.  Furthermore, and as long as they feel that, in determining the rankings, no topics, authors, vendors, interests or similar parameters are preferred, the user can trust the result. However, once parts of the algorithm are hidden in the proverbial ‘black box’, is at least a theoretical possibility that some searchable items might receive preferential treatment (or be discriminated against).

Like the entire code of OER World Map, our ranking mechanism is implemented as open source. In this way, the OER World Map demonstrates that the same rules and conditions are applied to all resources (services, organizations, people etc.), and that no differences of treatment are existent.

Of course, every search algorithm includes factors that lead to the higher weighting of individual results – otherwise there could be no ordered ranking at all. (These factors are just not dependent on specific content but on universal features like morphological matching or the length of an entry for example.) In the following, the most important search ranking constituents are illuminated (as of September 2016).

The code of the OER World Map

The search for the OER World Map is based on Elasticsearch as the main container for data storage. Elasticsearch is an open source search engine based on Apache Lucene. It allows the configuration of the search mechanisms via a JSON file, called index-config.json within the OER World Map. Within this file you can define whether and how individual data should be searchable. Currently, Elasticsearch is configured as follows:

  • “name” and “alternateName” are both indexed, in original spelling and variants in order to ensure that searching with typos could still produce the intended hits.
  • All other fields are indexed in their standard format (as written in the database).
  • From the data model point of view, all resources can be associated with addresses and geo-coordinates.

Within the OER World Map, a search command to Elasticsearch is triggered by the method esQuery() in the Java class ElasticsearchRepository. The following parameters can be controlled by this method:

  • Field Boost: the field-boost determines which data fields get more weight in the search. Classically, in particular the “name” field is greatly boosted. For example, “alternate name” can (somewhat less) also be boosted. (Boostings are concretized below.)
  • Limitation to a specific partial result: to scroll through multiple search results pages, it is useful only to display the results of a partial area, so for example, only the hit “1 to 10” or “11 to 20”.
  • In very special cases, it may make sense to display search results on ascending order, meaning that the results with the smallest search result value are listed on top. The OER World Map and Elasticsearch basically allow ascending and descending order. The default provided by the OER World Map is “descending”.
  • For completeness, it should be mentioned that search results can be omitted entirely from the results list due to geo-filtering. While the source code of this feature is already written, it is corrently not yet activated. As soon as this implemented feature will be activated, a user can limit the search to a specific geographical area (through the display of a particular map section), whereby all results from outside of this area do not appear in the results list.

The global preferences of the OER World Map for field boosting are located in the file search.conf. At present, boosting provides the following weighting of fields:

  • “name” by a factor of 9
  • “alternateName” by a factor of 6
  • “provider.name” by a factor of 5
  • “provider.alternateName” by a factor of 4
  • “agent.name” by a factor of 4
  • “agent.alternateName” by a factor of 3
  • “participant.name” by a factor of 2
  • “participant.alternateName” by a factor of 1
  • “memberOf.name” by a factor of 1
  • “memberOf.alternateName” by a factor of 1
  • “member.name” by a factor of 1
  • “member.alternateName” by a factor of 1
  • “article body” by a factor of 1

Outlook

Due to continuous development of the OER World Map, details (such as boosting factors) are going to evolve over time. New search fields might be added, or existing ones eliminated. It is envisaged that there will be an additional weighting based on “likes” (or some other voting system). The amount of links to a resource is a desirable weighting parameter as well. In any case, the quality and reliability of the OER World Map will always be gauged from the preservation of transparent and evenhanded search. OER World Map users can always check and be certain that search results are determined fair and reasonable.

The code of the OER World Map is hosted on Github. In still more specific questions, the team of OER World Map would refer you first to the source code but are also very happy to answer questions!