Feb 17

from zero to elasticsearch in a jiffy

In this article I will cut to the chase on how to start using elasticsearch, a distributed search engine that just works, You known, for Search!

Elasticsearch (ES) is based on Lucene, as so, you first need to download the Java Runtime Environment. I’ll assume you have installed it at C:\Dev\Java\jre6.

Next, download elasticsearch from:

http://www.elasticsearch.org/download/

And decompress it, eg., at C:\Dev\elasticsearch-0.14.4.

Open a Windows Command Prompt window, go into ES directory, and launch it:

cd c:\Dev\elasticsearch-0.14.4
set JAVA_HOME=C:\Dev\Java\jre6
bin\elasticsearch.bat -f

In a new Windows Command Prompt window, verify that ES is responding correctly. First check the cluster status by opening the following URL:

http://localhost:9200/_cluster/health

You should see something like:

{
  cluster_name: "elasticsearch"
  status: "green"
  timed_out: false
  number_of_nodes: 1
  number_of_data_nodes: 1
  active_primary_shards: 0
  active_shards: 0
  relocating_shards: 0
  initializing_shards: 0
  unassigned_shards: 0
}

And also check the cluster state at:

http://localhost:9200/_cluster/state

Which should return something like:

{
  cluster_name: "elasticsearch",
  master_node: "Sz86lynqSAKwb1X_NHsrIQ",
  blocks: {},
  nodes: {
    Sz86lynqSAKwb1X_NHsrIQ: {
      name: "Dredmund Druid",
      transport_address: "inet[/192.168.1.245:9300]",
      attributes: {}
    }
  },
  metadata: {
    templates: {},
    indices: {}
  },
  routing_table: {
    indices: {}
  },
  routing_nodes: {
    unassigned: [],
    nodes: {}
  },
  allocations: []
}

As ES seems to be working fine, we are now ready to index some documents; I’m going to use a tweet as an example.

Open a Bash shell window (I’m assuming you already have it installed as described at Sane shell environment on Windows).

You also need curl on your PATH, you can install it from:

http://curl.haxx.se/download/libcurl-7.19.3-win32-ssl-msvc.zip

Download a tweet:

curl -o 19529810560688128.json 'http://api.twitter.com/1/statuses/show/19529810560688128.json?include_entities=1'

And check its overall structure:

cat 19529810560688128.json

{
  "text": "why marking an operation as idempotent is important: http:\/\/www.zeroc.com\/faq\/whyIdempotent.html"
...

NB I’m just showing the attribute that is relevant to the search we are going to make. The actual tweet has much more attributes.

Lets index it:

curl -XPUT [email protected] http://localhost:9200/tweets/tweet/19529810560688128

ES should return something like:

{
  "_id": "19529810560688128",
  "_index": "tweets",
  "_type": "tweet",
  "ok": true
}

We are now ready for our first search! Lets search for tweets that have the “idempotent” word. For this we use the q parameter like:

curl 'http://localhost:9200/tweets/_search?pretty=true&q=idempotent'

Which returns something like:

{
  "_shards": {
    "failed": 0,
    "successful": 5,
    "total": 5
  },
  "hits": {
    "hits": [
      {
        "_id": "19529810560688128",
        "_index": "tweets",
        "_score": 0.023731936,
        "_source": {
          "text": "why marking an operation as idempotent is important: http:\/\/www.zeroc.com\/faq\/whyIdempotent.html"
          ...
        },
        "_type": "tweet"
      }
    ],
    "max_score": 0.023731936,
    "total": 1
  },
  "took": 1
}

The most important attribute in this result set is the hits.hits array, it contains all matched documents hits.

The previous search looked inside all tweet attributes. We can also search inside a particular attribute, for example, “text” attribute:

curl 'http://localhost:9200/tweets/_search?pretty=true&q=text:idempotent'

Which should return the same document as before, but this time, with a higher score.

ES lets us do much more advanced searches; we can use the simple Lucene Query Parser Syntax (as we did) and we can also use the JSON based search query DSL.

And that’s it! As you can see, the out-of-box experience is quite simple and strait forward! Of course, this was just the tip of the iceberg. elasticsearch has many tricks up its sleeve, my favorites are: simplicity, automatically distributed (just keep launching nodes, one or several, its the same; its really elastic), highly available (data is replicated; if a node fails, the data will come from another node) and thrift transport (at least one order of magnitude faster than HTTP).

You should read the documentation and join the community!

Also checkout the various client libraries, and the source code.

Oh, and in case you were wondering, the node names that appear on the logs are based on marvel characters, e.g.:

[2011-02-17 11:13:47,427][INFO ][node ] [D-Man] {elasticsearch/0.14.4}[6212]: stopping …

[2011-02-17 11:13:53,516][INFO ][node ] [Sublime, John] {elasticsearch/0.14.4}[5520]: initializing …

– RGL