Skip to main content

Federated Content Search (CLARIN-FCS): Technical Details

The CLARIN Federated Content Search (CLARIN-FCS) introduces an interface specification that decouples search engine functionality from its exploitation and defines data formats for structuring standardized query results (so-called Data Views).

This can be used to create user interfaces that allow accessing heterogeneous search engines in a uniform way. The most important publicly available user interface is the FCS aggregator - a search engine for language resources hosted at a variety of institutions.

More general information about the FCS can be found here.

 

Technical Functionality

The CLARIN-FCS specification defines a set of capabilities, an extensible result format, and a set of required operations. CLARIN- is built using the SRU/CQL standard that is maintained by the Library of Congress and the OASIS standardization consortium. Additional functionality required for CLARIN-FCS is added through /CQL's extension mechanisms.

Specifically, the CLARIN-FCS specification consists of two components, a set of formats and a transport protocol. The Endpoint component is a software component that acts as a bridge between the formats that are sent by a Client using the Transport Protocol, and a Search Engine. The Search Engine is a custom software component that allows the search of language resources in a specific institution. The Endpoint implements the transport protocol and acts as a mediator between the CLARIN-FCS specific formats and the idiosyncrasies of Search Engines at the individual institutions.

The following figure illustrates the overall architecture:

 

FCS Overall Architecture with Search Portal / Aggregator on Top, sending requests to the FCS endpoints which themselves translate the requests to their own formats to speak to their own (internal) search engines. The FCS Endpoints are with the institutions while the Aggregator is part of the CLARIN infrastructure.
 

 

In general, the workflow in CLARIN-FCS is as follows:

  1. a Client submits a query to an FCS Endpoint.
  2. The Endpoint translates the query from CQL to the query dialect used by the Search Engine and submits the translated query to the Search Engine.
  3. The Search Engine processes the query and generates a result set, i.e. it compiles a set of hits that match the search criterion.
  4. The Endpoint then translates the results from the Search Engine-specific result set format to the CLARIN-FCS result format and sends it to the Client
     

Specifications

CLARIN-FCS is defined in two specifications, the Core specification and the supplementary Data View specification. The first defines the general framework and the latter defines additional Data Views, which allow Endpoints to provide resources in more detailed formats.

The CLARIN FCS Core specification is currently available in version 2.0 and can be downloaded as PDF document. The outdated CLARIN FCS Core specification version 1.0 is available here.

The CLARIN FCS Data View specification defines additional ways to represent resources. It is currently available in version 1.0 and can be downloaded here.

Miscellaneous resources like XML schema definitions and examples can be found on Github.

 

Helpful Applications & Software Libraries

Search Engine / FCS Aggregator

The FCS aggregator is CLARIN’s central content search engine. Its source code is available on Github.

Software Libraries

The following libraries provide reference implementations of SRU/CQL and CLARIN FCS protocols. They are hosted on Github.

  • FCS-QL: An implementation of CLARIN-FCS Core 2.0 query language grammar and parser.
  • SRUServer: A SRU/CQL server implementation conforming to SRU/CQL protocol version 1.1 and 1.2 and (partially) 2.0.
  • SRUClient: A SRU/CQL client implementation conforming to SRU/CQL protocol version 1.1, 1.2, and (partially) 2.0.
  • FCSSimpleEndpoint: A simple CLARIN FCS endpoint reference implementation to ease the development of CLARIN FCS endpoints. Developed by the IDS.
  • FCSSimpleClient: A CLARIN-FCS client implementation supporting Legacy-FCS, FCS 1.0 and (partly) FCS 2.0.

Artifacts are available via CLARIN Nexus.

Testing Endpoint Conformance

Compliance of an endpoint with the CLARIN-FCS specification can be evaluated with the SRU/CQL Conformance Tester (Login required). Its source code is available on Github.

List of Available Endpoints

The full list of implemented endpoints is available at the CLARIN Centre Registry.

 

Various Implementations

In most cases, the implementing data provider only builds a simple wrapper-service, that translates between the CLARIN-FCS/SRU protocol and the endpoint's software. However, there are efforts to provide default wrappers (or at least sample implementations) for individual persistence systems like SQL databases or XML databases. 

Non-exhaustive list of implementations:

  • The Korp FCS 2.0 reference endpoint implementation using the Java SRU/FCS libraries listed above.
  • Based on the IDS library, Alex Kislev has implemented a CQP/SRU bridge: any CQP indexed corpus can be integrated quite easily into the CLARIN Federated Content Search.
  • A PHP implementation of an FCS 2.0 endpoint by the Austrian Centre for Digital Humanities & Cultural Heritage (ARCHE).
  • Recently, OCLC announced the oclcsrw, an Open Source implementation of an SRU 1.2 server that exposes a database interface allowing implementers to expose their databases via SRU 1.2. Database implementations are separately available for Apache Lucene and DSpace.
  • The ICLTT (Vienna) is developing corpus_shell a modular framework for publishing heterogeneous distributed language resources building on top of FCS. The system currently contains prototype implementations of an FCS-wrapper for mysql-db (in php), the ddc search engine (in perl). Additionally, an eXist/XQuery-based solution is being developed, but this code has been moved from corpus_shell as module to SADE. These implementations are work in progress and don't yet fully conform to the FCS specs.
  • KonText is an advanced corpus query interface and corpus data integration platform built around the corpus search engine Manatee-open. It supports the CLARIN FCS in version 1.0.