Elasticsearch settings¶
Contents
Here is a list of Elasticsearch settings (under elasticsearch.
prefix)`:
Name | Default value | Documentation |
---|---|---|
elasticsearch.index |
job name | Index settings for documents |
elasticsearch.index_folder |
job name + _folder |
Index settings for folders |
elasticsearch.bulk_size |
100 |
Bulk settings |
elasticsearch.flush_interval |
"5s" |
Bulk settings |
elasticsearch.byte_size |
"10mb" |
Bulk settings |
elasticsearch.pipeline |
null |
Using Ingest Node Pipeline |
elasticsearch.nodes |
http://127.0.0.1:9200 |
Node settings |
elasticsearch.path_prefix |
null |
Path prefix |
elasticsearch.username |
null |
Using Credentials (Security) |
elasticsearch.password |
null |
Using Credentials (Security) |
elasticsearch.ssl_verification |
true |
Using Credentials (Security) |
Index settings¶
Index settings for documents¶
By default, FSCrawler will index your data in an index which name is
the same as the crawler name (name
property) plus _doc
suffix,
like test_doc
. You can change it by setting index
field:
name: "test"
elasticsearch:
index: "docs"
Index settings for folders¶
FSCrawler will also index folders in an index which name is the same as
the crawler name (name
property) plus _folder
suffix, like
test_folder
. You can change it by setting index_folder
field:
name: "test"
elasticsearch:
index_folder: "folders"
Mappings¶
When FSCrawler needs to create the doc index, it applies some default
settings and mappings which are read from
~/.fscrawler/_default/7/_settings.json
. You can read its content
from the
source.
Settings define an analyzer named fscrawler_path
which uses a path
hierarchy
tokenizer.
FSCrawler applies as well a mapping automatically for the folders which can also be read from the source.
You can also display the index mapping being used with Kibana:
GET docs/_mapping
GET docs_folder/_mapping
Or fall back to the command line:
curl 'http://localhost:9200/docs/_mapping?pretty'
curl 'http://localhost:9200/docs_folder/_mapping?pretty'
Note
FSCrawler is actually applying default index settings depending on the
elasticsearch version it is connected to.
The default settings definitions are stored in ~/.fscrawler/_default/_mappings
:
6/_settings.json
: for elasticsearch 6.x series document index settings6/_settings_folder.json
: for elasticsearch 6.x series folder index settings7/_settings.json
: for elasticsearch 7.x series document index settings7/_settings_folder.json
: for elasticsearch 7.x series folder index settings
Creating your own mapping (analyzers)¶
If you want to define your own index settings and mapping to set
analyzers for example, you can either create the index and push the
mapping or define a ~/.fscrawler/_default/7/_settings.json
document
which contains the index settings and mappings you wish before
starting the FSCrawler.
The following example uses a french
analyzer to index the
content
field.
{
"settings": {
"number_of_shards": 1,
"index.mapping.total_fields.limit": 2000,
"analysis": {
"analyzer": {
"fscrawler_path": {
"tokenizer": "fscrawler_path"
}
},
"tokenizer": {
"fscrawler_path": {
"type": "path_hierarchy"
}
}
}
},
"mappings": {
"_doc": {
"dynamic_templates": [
{
"raw_as_text": {
"path_match": "meta.raw.*",
"mapping": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
],
"properties": {
"attachment": {
"type": "binary",
"doc_values": false
},
"attributes": {
"properties": {
"group": {
"type": "keyword"
},
"owner": {
"type": "keyword"
}
}
},
"content": {
"type": "text",
"analyzer": "french"
},
"file": {
"properties": {
"content_type": {
"type": "keyword"
},
"filename": {
"type": "keyword",
"store": true
},
"extension": {
"type": "keyword"
},
"filesize": {
"type": "long"
},
"indexed_chars": {
"type": "long"
},
"indexing_date": {
"type": "date",
"format": "dateOptionalTime"
},
"created": {
"type": "date",
"format": "dateOptionalTime"
},
"last_modified": {
"type": "date",
"format": "dateOptionalTime"
},
"last_accessed": {
"type": "date",
"format": "dateOptionalTime"
},
"checksum": {
"type": "keyword"
},
"url": {
"type": "keyword",
"index": false
}
}
},
"meta": {
"properties": {
"author": {
"type": "text"
},
"date": {
"type": "date",
"format": "dateOptionalTime"
},
"keywords": {
"type": "text"
},
"title": {
"type": "text"
},
"language": {
"type": "keyword"
},
"format": {
"type": "text"
},
"identifier": {
"type": "text"
},
"contributor": {
"type": "text"
},
"coverage": {
"type": "text"
},
"modifier": {
"type": "text"
},
"creator_tool": {
"type": "keyword"
},
"publisher": {
"type": "text"
},
"relation": {
"type": "text"
},
"rights": {
"type": "text"
},
"source": {
"type": "text"
},
"type": {
"type": "text"
},
"description": {
"type": "text"
},
"created": {
"type": "date",
"format": "dateOptionalTime"
},
"print_date": {
"type": "date",
"format": "dateOptionalTime"
},
"metadata_date": {
"type": "date",
"format": "dateOptionalTime"
},
"latitude": {
"type": "text"
},
"longitude": {
"type": "text"
},
"altitude": {
"type": "text"
},
"rating": {
"type": "byte"
},
"comments": {
"type": "text"
}
}
},
"path": {
"properties": {
"real": {
"type": "keyword",
"fields": {
"tree": {
"type": "text",
"analyzer": "fscrawler_path",
"fielddata": true
},
"fulltext": {
"type": "text"
}
}
},
"root": {
"type": "keyword"
},
"virtual": {
"type": "keyword",
"fields": {
"tree": {
"type": "text",
"analyzer": "fscrawler_path",
"fielddata": true
},
"fulltext": {
"type": "text"
}
}
}
}
}
}
}
}
}
Note that if you want to push manually the mapping to elasticsearch you can use the classic REST calls:
# Create index (don't forget to add the fscrawler_path analyzer)
PUT docs
{
// Same index settings as previously seen
}
Define explicit mapping/settings per job¶
Let’s say you created a job named job_name
and you are sending
documents against an elasticsearch cluster running version 6.x
.
If you create the following files, they will be picked up at job start time instead of the default ones:
~/.fscrawler/{job_name}/_mappings/7/_settings.json
~/.fscrawler/{job_name}/_mappings/7/_settings_folder.json
Tip
You can do the same for other elasticsearch versions with:
~/.fscrawler/{job_name}/_mappings/6/_settings.json
for 6.x series~/.fscrawler/{job_name}/_mappings/6/_settings_folder.json
for 6.x series
Replace existing mapping¶
Unfortunately you can not change the mapping on existing data. Therefore, you’ll need first to remove existing index, which means remove all existing data, and then restart FSCrawler with the new mapping.
You might to try elasticsearch Reindex API though.
Bulk settings¶
FSCrawler is using bulks to send data to elasticsearch. By default the
bulk is executed every 100 operations or every 5 seconds or every 10 megabytes. You can change
default settings using bulk_size
, byte_size
and flush_interval
:
name: "test"
elasticsearch:
bulk_size: 1000
byte_size: "500kb"
flush_interval: "2s"
Tip
Elasticsearch has a default limit of 100mb
per HTTP request as per
elasticsearch HTTP Module
documentation.
Which means that if you are indexing a massive bulk of documents, you
might hit that limit and FSCrawler will throw an error like
entity content is too long [xxx] for the configured buffer limit [104857600]
.
You can either change this limit on elasticsearch side by setting
http.max_content_length
to a higher value but please be aware that
this will consume much more memory on elasticsearch side.
Or you can decrease the bulk_size
or byte_size
setting to a smaller value.
Using Ingest Node Pipeline¶
New in version 2.2.
If you are using an elasticsearch cluster running a 5.0 or superior version, you can use an Ingest Node pipeline to transform documents sent by FSCrawler before they are actually indexed.
For example, if you have the following pipeline:
PUT _ingest/pipeline/fscrawler
{
"description" : "fscrawler pipeline",
"processors" : [
{
"set" : {
"field": "foo",
"value": "bar"
}
}
]
}
In FSCrawler settings, set the elasticsearch.pipeline
option:
name: "test"
elasticsearch:
pipeline: "fscrawler"
Note
Folder objects are not sent through the pipeline as they are more internal objects.
Node settings¶
FSCrawler is using elasticsearch REST layer to send data to your
running cluster. By default, it connects to http://127.0.0.1:9200
which is the default when running a local node on your machine.
Of course, in production, you would probably change this and connect to a production cluster:
name: "test"
elasticsearch:
nodes:
- url: "http://mynode1.mycompany.com:9200"
If you are using Elasticsearch service by Elastic,
you can just use the Cloud ID
which is available in the Cloud Console and paste it:
name: "test"
elasticsearch:
nodes:
- cloud_id: "fscrawler:ZXVyb3BlLXdlc3QxLmdjcC5jbG91ZC5lcy5pbyQxZDFlYTk5Njg4Nzc0NWE2YTJiN2NiNzkzMTUzNDhhMyQyOTk1MDI3MzZmZGQ0OTI5OTE5M2UzNjdlOTk3ZmU3Nw=="
This ID will be used to automatically generate the right host, port and scheme.
Hint
In the context of Elasticsearch service by Elastic, you will most likely need to provide as well the username and the password. See Using Credentials (Security).
You can define multiple nodes:
name: "test"
elasticsearch:
nodes:
- url: "http://mynode1.mycompany.com:9200"
- url: "http://mynode2.mycompany.com:9200"
- url: "http://mynode3.mycompany.com:9200"
Note
New in version 2.2: you can use HTTPS instead of default HTTP.
name: "test"
elasticsearch:
nodes:
- url: "https://CLUSTERID.eu-west-1.aws.found.io:9243"
For more information, read SSL Configuration.
Path prefix¶
New in version 2.7: If your elasticsearch is running behind a proxy with url rewriting,
you might have to specify a path prefix. This can be done with path_prefix
setting:
name: "test"
elasticsearch:
nodes:
- url: "http://mynode1.mycompany.com:9200"
path_prefix: "/path/to/elasticsearch"
Note
The same path_prefix
applies to all nodes.
Using Credentials (Security)¶
New in version 2.2.
If you secured your elasticsearch cluster, you can provide
username
and password
to FSCrawler:
name: "test"
elasticsearch:
username: "elastic"
password: "changeme"
Warning
For the current version, the elasticsearch password is stored in plain text in your job setting file.
A better practice is to only set the username or pass it with
--username elastic
option when starting FSCrawler.
If the password is not defined, you will be prompted when starting the job:
22:46:42,528 INFO [f.p.e.c.f.FsCrawler] Password for elastic:
If you want to use another user than the default elastic
, you will need to give him some permissions:
cluster:monitor
indices:fsc/all
indices:fsc_folder/all
where fsc
is the FSCrawler index name as defined in Index settings for documents.
This can be done by defining the following role:
PUT /_security/role/fscrawler
{
"cluster" : [ "monitor" ],
"indices" : [ {
"names" : [ "fsc", "fsc_folder" ],
"privileges" : [ "all" ]
} ]
}
This also can be done using the Kibana Stack Management Interface.
Then, you can assign this role to the user who will be defined within the username
setting.
SSL Configuration¶
In order to ingest documents to Elasticsearch over HTTPS based connection, you need to perform additional configuration steps:
Important
Prerequisite: you need to have root CA chain certificate or Elasticsearch server certificate
in DER format. DER format files have a .cer
extension. Certificate verification can be disabled by option ssl_verification: false
- Logon to server (or client machine) where FSCrawler is running
- Run:
keytool -import -alias <alias name> -keystore " <JAVA_HOME>\lib\security\cacerts" -file <Path of Elasticsearch Server certificate or Root certificate>
It will prompt you for the password. Enter the certificate password like changeit
.
- Make changes to FSCrawler
_settings.json
file to connect to your Elasticsearch server over HTTPS:
name: "test"
elasticsearch:
nodes:
- url: "https://localhost:9243"
Tip
If you can not find keytool
, it probably means that you did not add your JAVA_HOME/bin
directory to your path.
Generated fields¶
FSCrawler may create the following fields depending on configuration and available data:
Field | Description | Example | Javadoc |
---|---|---|---|
content |
Extracted content | "This is my text!" |
|
attachment |
BASE64 encoded binary file | BASE64 Encoded document | |
meta.author |
Author if any in | "David Pilato" |
CREATOR |
meta.title |
Title if any in document metadata | "My document title" |
TITLE |
meta.date |
Last modified date | "2013-04-04T15:21:35" |
MODIFIED |
meta.keywords |
Keywords if any in document metadata | ["fs","elasticsearch"] |
KEYWORDS |
meta.language |
Language (can be detected) | "fr" |
LANGUAGE |
meta.format |
Format of the media | "application/pdf; version=1.6" |
FORMAT |
meta.identifier |
URL/DOI/ISBN for example | "FOOBAR" |
IDENTIFIER |
meta.contributor |
Contributor | "foo bar" |
CONTRIBUTOR |
meta.coverage |
Coverage | "FOOBAR" |
COVERAGE |
meta.modifier |
Last author | "David Pilato" |
MODIFIER |
meta.creator_tool |
Tool used to create the resource | "HTML2PDF- TCPDF" |
CREATOR_TOOL |
meta.publisher |
Publisher: person, organisation, service | "elastic" |
PUBLISHER |
meta.relation |
Related resource | "FOOBAR" |
RELATION |
meta.rights |
Information about rights | "CC-BY-ND" |
RIGHTS |
meta.source |
Source for the current document (derivated) | "FOOBAR" |
SOURCE |
meta.type |
Nature or genre of the content | "Image" |
TYPE |
meta.description |
An account of the content | "This is a description" |
DESCRIPTION |
meta.created |
Date of creation | "2013-04-04T15:21:35" |
CREATED |
meta.print_date |
When was the doc last printed? | "2013-04-04T15:21:35" |
PRINT_DATE |
meta.metadata_date |
Last modification of metadata | "2013-04-04T15:21:35" |
METADATA_DATE |
meta.latitude |
The WGS84 Latitude of the Point | "N 48° 51' 45.81''" |
LATITUDE |
meta.longitude |
The WGS84 Longitude of the Point | "E 2° 17'15.331''" |
LONGITUDE |
meta.altitude |
The WGS84 Altitude of the Point | "" |
ALTITUDE |
meta.rating |
A user-assigned rating -1, [0..5] | 0 |
RATING |
meta.comments |
Comments | "Comments" |
COMMENTS |
meta.raw |
An object with all raw metadata | "meta.raw.channels": "2" |
|
file.content_type |
Content Type | "application/vnd.oasis.opendocument.text" |
|
file.created |
Creation date | "2018-07-30T11:19:23.000+0000" |
|
file.last_modified |
Last modification date | "2018-07-30T11:19:23.000+0000" |
|
file.last_accessed |
Last accessed date | "2018-07-30T11:19:23.000+0000" |
|
file.indexing_date |
Indexing date | "2018-07-30T11:19:30.703+0000" |
|
file.filesize |
File size in bytes | 1256362 |
|
file.indexed_chars |
Extracted chars | 100000 |
|
file.filename |
Original file name | "mydocument.pdf" |
|
file.extension |
Original file name extension | "pdf" |
|
file.url |
Original file url | "file://tmp/otherdir/mydocument.pdf" |
|
file.checksum |
Checksum | "c32eafae2587bef4b3b32f73743c3c61" |
|
path.virtual |
Relative path from | "/otherdir/mydocument.pdf" |
|
path.root |
MD5 encoded parent path (internal use) | "112aed83738239dbfe4485f024cd4ce1" |
|
path.real |
Real path name | "/tmp/otherdir/mydocument.pdf" |
|
attributes.owner |
Owner name | "david" |
|
attributes.group |
Group name | "staff" |
|
attributes.permissions |
Permissions | 764 |
|
external |
Additional tags | { "tenantId": 22, "projectId": 33 } |
For more information about meta data, please read the TikaCoreProperties.
Here is a typical JSON document generated by the crawler:
{
"content":"This is a sample text available in page 1\n\nThis second part of the text is in Page 2\n\n",
"meta":{
"author":"David Pilato",
"title":"Test Tika title",
"date":"2016-07-07T16:37:00.000+0000",
"keywords":[
"keyword1",
" keyword2"
],
"language":"en",
"description":"Comments",
"created":"2016-07-07T16:37:00.000+0000"
},
"file":{
"extension":"odt",
"content_type":"application/vnd.oasis.opendocument.text",
"created":"2018-07-30T11:35:08.000+0000",
"last_modified":"2018-07-30T11:35:08.000+0000",
"last_accessed":"2018-07-30T11:35:08.000+0000",
"indexing_date":"2018-07-30T11:35:19.781+0000",
"filesize":6236,
"filename":"test.odt",
"url":"file:///tmp/test.odt"
},
"path":{
"root":"7537e4fb47e553f110a1ec312c2537c0",
"virtual":"/test.odt",
"real":"/tmp/test.odt"
}
}
Search examples¶
You can use the content field to perform full-text search on
GET docs/_search
{
"query" : {
"match" : {
"content" : "the quick brown fox"
}
}
}
You can use meta fields to perform search on.
GET docs/_search
{
"query" : {
"term" : {
"file.filename" : "mydocument.pdf"
}
}
}
Or run some aggregations on top of them, like:
GET docs/_search
{
"size": 0,
"aggs": {
"by_extension": {
"terms": {
"field": "file.extension"
}
}
}
}