Elasticsearch settings

Here is a list of Elasticsearch settings (under elasticsearch. prefix)`:

Name Default value Documentation
elasticsearch.index job name Index settings for documents
elasticsearch.index_folder job name + _folder Index settings for folders
elasticsearch.bulk_size 100 Bulk settings
elasticsearch.flush_interval "5s" Bulk settings
elasticsearch.byte_size "10mb" Bulk settings
elasticsearch.pipeline null Using Ingest Node Pipeline
elasticsearch.nodes http://127.0.0.1:9200 Node settings
elasticsearch.username null Using Credentials (X-Pack)
elasticsearch.password null Using Credentials (X-Pack)

Index settings

Index settings for documents

By default, FSCrawler will index your data in an index which name is the same as the crawler name (name property) plus _doc suffix, like test_doc. You can change it by setting index field:

{
  "name" : "test",
  "elasticsearch" : {
    "index" : "docs"
  }
}

Index settings for folders

FSCrawler will also index folders in an index which name is the same as the crawler name (name property) plus _folder suffix, like test_folder. You can change it by setting index_folder field:

{
 "name" : "test",
 "elasticsearch" : {
   "index_folder" : "folders"
 }
}

Mappings

When FSCrawler needs to create the doc index, it applies some default settings and mappings which are read from ~/.fscrawler/_default/6/_settings.json. You can read its content from the source.

Settings define an analyzer named fscrawler_path which uses a path hierarchy tokenizer.

FSCrawler applies as well a mapping automatically for the folders which can also be read from the source.

You can also display the index mapping being used with Kibana:

GET docs/_mapping
GET docs_folder/_mapping

Or fall back to the command line:

curl 'http://localhost:9200/docs/_mapping?pretty'
curl 'http://localhost:9200/docs_folder/_mapping?pretty'

Note

FSCrawler is actually applying default index settings depending on the elasticsearch version it is connected to. The default settings definitions are stored in ~/.fscrawler/_default/_mappings:

  • 2/_settings.json: for elasticsearch 2.x series document index settings
  • 2/_settings_folder.json: for elasticsearch 2.x series folder index settings
  • 5/_settings.json: for elasticsearch 5.x series document index settings
  • 5/_settings_folder.json: for elasticsearch 5.x series folder index settings
  • 6/_settings.json: for elasticsearch 6.x series document index settings
  • 6/_settings_folder.json: for elasticsearch 6.x series folder index settings

Note

For versions before 6.x series, the type of the document is doc. From 6.x, the type of the document is _doc.

Creating your own mapping (analyzers)

If you want to define your own index settings and mapping to set analyzers for example, you can either create the index and push the mapping or define a ~/.fscrawler/_default/6/_settings.json document which contains the index settings and mappings you wish before starting the FSCrawler.

The following example uses a french analyzer to index the content field.

{
  "settings": {
    "number_of_shards": 1,
    "index.mapping.total_fields.limit": 2000,
    "analysis": {
      "analyzer": {
        "fscrawler_path": {
          "tokenizer": "fscrawler_path"
        }
      },
      "tokenizer": {
        "fscrawler_path": {
          "type": "path_hierarchy"
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "dynamic_templates": [
        {
          "raw_as_text": {
            "path_match": "meta.raw.*",
            "mapping": {
              "type": "text",
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            }
          }
        }
      ],
      "properties": {
        "attachment": {
          "type": "binary",
          "doc_values": false
        },
        "attributes": {
          "properties": {
            "group": {
              "type": "keyword"
            },
            "owner": {
              "type": "keyword"
            }
          }
        },
        "content": {
          "type": "text",
          "analyzer": "french"
        },
        "file": {
          "properties": {
            "content_type": {
              "type": "keyword"
            },
            "filename": {
              "type": "keyword",
              "store": true
            },
            "extension": {
              "type": "keyword"
            },
            "filesize": {
              "type": "long"
            },
            "indexed_chars": {
              "type": "long"
            },
            "indexing_date": {
              "type": "date",
              "format": "dateOptionalTime"
            },
            "created": {
              "type": "date",
              "format": "dateOptionalTime"
            },
            "last_modified": {
              "type": "date",
              "format": "dateOptionalTime"
            },
            "last_accessed": {
              "type": "date",
              "format": "dateOptionalTime"
            },
            "checksum": {
              "type": "keyword"
            },
            "url": {
              "type": "keyword",
              "index": false
            }
          }
        },
        "meta": {
          "properties": {
            "author": {
              "type": "text"
            },
            "date": {
              "type": "date",
              "format": "dateOptionalTime"
            },
            "keywords": {
              "type": "text"
            },
            "title": {
              "type": "text"
            },
            "language": {
              "type": "keyword"
            },
            "format": {
              "type": "text"
            },
            "identifier": {
              "type": "text"
            },
            "contributor": {
              "type": "text"
            },
            "coverage": {
              "type": "text"
            },
            "modifier": {
              "type": "text"
            },
            "creator_tool": {
              "type": "keyword"
            },
            "publisher": {
              "type": "text"
            },
            "relation": {
              "type": "text"
            },
            "rights": {
              "type": "text"
            },
            "source": {
              "type": "text"
            },
            "type": {
              "type": "text"
            },
            "description": {
              "type": "text"
            },
            "created": {
              "type": "date",
              "format": "dateOptionalTime"
            },
            "print_date": {
              "type": "date",
              "format": "dateOptionalTime"
            },
            "metadata_date": {
              "type": "date",
              "format": "dateOptionalTime"
            },
            "latitude": {
              "type": "text"
            },
            "longitude": {
              "type": "text"
            },
            "altitude": {
              "type": "text"
            },
            "rating": {
              "type": "byte"
            },
            "comments": {
              "type": "text"
            }
          }
        },
        "path": {
          "properties": {
            "real": {
              "type": "keyword",
              "fields": {
                "tree": {
                  "type": "text",
                  "analyzer": "fscrawler_path",
                  "fielddata": true
                },
                "fulltext": {
                  "type": "text"
                }
              }
            },
            "root": {
              "type": "keyword"
            },
            "virtual": {
              "type": "keyword",
              "fields": {
                "tree": {
                  "type": "text",
                  "analyzer": "fscrawler_path",
                  "fielddata": true
                },
                "fulltext": {
                  "type": "text"
                }
              }
            }
          }
        }
      }
    }
  }
}

Note that if you want to push manually the mapping to elasticsearch you can use the classic REST calls:

# Create index (don't forget to add the fscrawler_path analyzer)
PUT docs
{
  // Same index settings as previously seen
}

Define explicit mapping/settings per job

Let’s say you created a job named job_name and you are sending documents against an elasticsearch cluster running version 6.x.

If you create the following files, they will be picked up at job start time instead of the default ones:

  • ~/.fscrawler/{job_name}/_mappings/6/_settings.json
  • ~/.fscrawler/{job_name}/_mappings/6/_settings_folder.json

Tip

You can do the same for other elasticsearch versions with:

  • ~/.fscrawler/{job_name}/_mappings/2/_settings.json for 2.x series (deprecated)
  • ~/.fscrawler/{job_name}/_mappings/2/_settings_folder.json for 2.x series (deprecated)
  • ~/.fscrawler/{job_name}/_mappings/5/_settings.json for 5.x series
  • ~/.fscrawler/{job_name}/_mappings/5/_settings_folder.json for 5.x series

Replace existing mapping

Unfortunately you can not change the mapping on existing data. Therefore, you’ll need first to remove existing index, which means remove all existing data, and then restart FSCrawler with the new mapping.

You might to try elasticsearch Reindex API though.

Bulk settings

FSCrawler is using bulks to send data to elasticsearch. By default the bulk is executed every 100 operations or every 5 seconds or every 10 megabytes. You can change default settings using bulk_size, byte_size and flush_interval:

{
  "name" : "test",
  "elasticsearch" : {
    "bulk_size" : 1000,
    "byte_size" : "500kb",
    "flush_interval" : "2s"
  }
}

Tip

Elasticsearch has a default limit of 100mb per HTTP request as per elasticsearch HTTP Module documentation.

Which means that if you are indexing a massive bulk of documents, you might hit that limit and FSCrawler will throw an error like entity content is too long [xxx] for the configured buffer limit [104857600].

You can either change this limit on elasticsearch side by setting http.max_content_length to a higher value but please be aware that this will consume much more memory on elasticsearch side.

Or you can decrease the bulk_size or byte_size setting to a smaller value.

Using Ingest Node Pipeline

New in version 2.2.

If you are using an elasticsearch cluster running a 5.0 or superior version, you can use an Ingest Node pipeline to transform documents sent by FSCrawler before they are actually indexed.

For example, if you have the following pipeline:

PUT _ingest/pipeline/fscrawler
{
  "description" : "fscrawler pipeline",
  "processors" : [
    {
      "set" : {
        "field": "foo",
        "value": "bar"
      }
    }
  ]
}

In FSCrawler settings, set the elasticsearch.pipeline option:

{
  "name" : "test",
  "elasticsearch" : {
    "pipeline" : "fscrawler"
  }
}

Note

Folder objects are not sent through the pipeline as they are more internal objects.

Node settings

FSCrawler is using elasticsearch REST layer to send data to your running cluster. By default, it connects to http://127.0.0.1:9200 which is the default when running a local node on your machine.

Of course, in production, you would probably change this and connect to a production cluster:

{
  "name" : "test",
  "elasticsearch" : {
    "nodes" : [
      { "url" : "http://mynode1.mycompany.com:9200" }
    ]
  }
}

If you are using Elasticsearch service by Elastic, you can just use the Cloud ID which is available in the Cloud Console and paste it:

{
  "name" : "test",
  "elasticsearch" : {
    "nodes" : [
      { "cloud_id" : "fscrawler:ZXVyb3BlLXdlc3QxLmdjcC5jbG91ZC5lcy5pbyQxZDFlYTk5Njg4Nzc0NWE2YTJiN2NiNzkzMTUzNDhhMyQyOTk1MDI3MzZmZGQ0OTI5OTE5M2UzNjdlOTk3ZmU3Nw==" }
    ]
  }
}

This ID will be used to automatically generate the right host, port and scheme.

Hint

In the context of Elasticsearch service by Elastic, you will most likely need to provide as well the username and the password. See Using Credentials (X-Pack).

You can define multiple nodes:

{
  "name" : "test",
  "elasticsearch" : {
    "nodes" : [
      { "url" : "http://mynode1.mycompany.com:9200" },
      { "url" : "http://mynode2.mycompany.com:9200" },
      { "url" : "http://mynode3.mycompany.com:9200" }
    ]
  }
}

Note

New in version 2.2: you can use HTTPS instead of default HTTP.

{
  "name" : "test",
  "elasticsearch" : {
    "nodes" : [
      { "url" : "https://CLUSTERID.eu-west-1.aws.found.io:9243" }
    ]
  }
}

For more information, read SSL Configuration.

Using Credentials (X-Pack)

New in version 2.2.

If you secured your elasticsearch cluster with X-Pack, you can provide username and password to FSCrawler:

{
  "name" : "test",
  "elasticsearch" : {
    "username" : "elastic",
    "password" : "changeme"
  }
}

Warning

For the current version, the elasticsearch password is stored in plain text in your job setting file.

A better practice is to only set the username or pass it with --username elastic option when starting FSCrawler.

If the password is not defined, you will be prompted when starting the job:

22:46:42,528 INFO  [f.p.e.c.f.FsCrawler] Password for elastic:

SSL Configuration

In order to ingest documents to Elasticsearch over HTTPS based connection, you need to perform additional configuration steps:

Important

Prerequisite: you need to have root CA chain certificate or Elasticsearch server certificate in DER format. DER format files have a .cer extension.

  1. Logon to server (or client machine) where FSCrawler is running
  2. Run:
keytool -import -alias <alias name> -keystore " <JAVA_HOME>\lib\security\cacerts" -file <Path of Elasticsearch Server certificate or Root certificate>

It will prompt you for the password. Enter the certificate password like changeit.

  1. Make changes to FSCrawler _settings.json file to connect to your Elasticsearch server over HTTPS:
{
  "name" : "test",
  "elasticsearch" : {
    "nodes" : [
      { "url" : "https://localhost:9243" }
    ]
  }
}

Tip

If you can not find keytool, it probably means that you did not add your JAVA_HOME/bin directory to your path.

Generated fields

FSCrawler may create the following fields depending on configuration and available data:

Field Description Example Javadoc
content Extracted content "This is my text!"  
attachment BASE64 encoded binary file BASE64 Encoded document  
meta.author Author if any in "David Pilato" CREATOR
meta.title Title if any in document metadata "My document title" TITLE
meta.date Last modified date "2013-04-04T15:21:35" MODIFIED
meta.keywords Keywords if any in document metadata ["fs","elasticsearch"] KEYWORDS
meta.language Language (can be detected) "fr" LANGUAGE
meta.format Format of the media "application/pdf; version=1.6" FORMAT
meta.identifier URL/DOI/ISBN for example "FOOBAR" IDENTIFIER
meta.contributor Contributor "foo bar" CONTRIBUTOR
meta.coverage Coverage "FOOBAR" COVERAGE
meta.modifier Last author "David Pilato" MODIFIER
meta.creator_tool Tool used to create the resource "HTML2PDF- TCPDF" CREATOR_TOOL
meta.publisher Publisher: person, organisation, service "elastic" PUBLISHER
meta.relation Related resource "FOOBAR" RELATION
meta.rights Information about rights "CC-BY-ND" RIGHTS
meta.source Source for the current document (derivated) "FOOBAR" SOURCE
meta.type Nature or genre of the content "Image" TYPE
meta.description An account of the content "This is a description" DESCRIPTION
meta.created Date of creation "2013-04-04T15:21:35" CREATED
meta.print_date When was the doc last printed? "2013-04-04T15:21:35" PRINT_DATE
meta.metadata_date Last modification of metadata "2013-04-04T15:21:35" METADATA_DATE
meta.latitude The WGS84 Latitude of the Point "N 48° 51' 45.81''" LATITUDE
meta.longitude The WGS84 Longitude of the Point "E 17'15.331''" LONGITUDE
meta.altitude The WGS84 Altitude of the Point "" ALTITUDE
meta.rating A user-assigned rating -1, [0..5] 0 RATING
meta.comments Comments "Comments" COMMENTS
meta.raw An object with all raw metadata "meta.raw.channels": "2"  
file.content_type Content Type "application/vnd.oasis.opendocument.text"  
file.created Creation date "2018-07-30T11:19:23.000+0000"  
file.last_modified Last modification date "2018-07-30T11:19:23.000+0000"  
file.last_accessed Last accessed date "2018-07-30T11:19:23.000+0000"  
file.indexing_date Indexing date "2018-07-30T11:19:30.703+0000"  
file.filesize File size in bytes 1256362  
file.indexed_chars Extracted chars 100000  
file.filename Original file name "mydocument.pdf"  
file.extension Original file name extension "pdf"  
file.url Original file url "file://tmp/otherdir/mydocument.pdf"  
file.checksum Checksum "c32eafae2587bef4b3b32f73743c3c61"  
path.virtual Relative path from "/otherdir/mydocument.pdf"  
path.root MD5 encoded parent path (internal use) "112aed83738239dbfe4485f024cd4ce1"  
path.real Real path name "/tmp/otherdir/mydocument.pdf"  
attributes.owner Owner name "david"  
attributes.group Group name "staff"  
attributes.permissions Permissions 764  
external Additional tags { "tenantId": 22, "projectId": 33 }  

For more information about meta data, please read the TikaCoreProperties.

Here is a typical JSON document generated by the crawler:

{
   "content":"This is a sample text available in page 1\n\nThis second part of the text is in Page 2\n\n",
   "meta":{
      "author":"David Pilato",
      "title":"Test Tika title",
      "date":"2016-07-07T16:37:00.000+0000",
      "keywords":[
         "keyword1",
         "  keyword2"
      ],
      "language":"en",
      "description":"Comments",
      "created":"2016-07-07T16:37:00.000+0000"
   },
   "file":{
      "extension":"odt",
      "content_type":"application/vnd.oasis.opendocument.text",
      "created":"2018-07-30T11:35:08.000+0000",
      "last_modified":"2018-07-30T11:35:08.000+0000",
      "last_accessed":"2018-07-30T11:35:08.000+0000",
      "indexing_date":"2018-07-30T11:35:19.781+0000",
      "filesize":6236,
      "filename":"test.odt",
      "url":"file:///tmp/test.odt"
   },
   "path":{
      "root":"7537e4fb47e553f110a1ec312c2537c0",
      "virtual":"/test.odt",
      "real":"/tmp/test.odt"
   }
}

Search examples

You can use the content field to perform full-text search on

GET docs/_search
{
  "query" : {
    "match" : {
        "content" : "the quick brown fox"
    }
  }
}

You can use meta fields to perform search on.

GET docs/_search
{
  "query" : {
    "term" : {
        "file.filename" : "mydocument.pdf"
    }
  }
}

Or run some aggregations on top of them, like:

GET docs/_search
{
  "size": 0,
  "aggs": {
    "by_extension": {
      "terms": {
        "field": "file.extension"
      }
    }
  }
}