Job file specifications
Expected files
FSCrawler expects to find a job directory in the ~/.fscrawler directory or in the directory
you defined with the -config_dir CLI option (see CLI options). The job file could be either:
a
yamlfile named_settings.yamla
jsonfile named_settings.jsona list of files within a directory named
_settings
When using a directory, FSCrawler will merge all files found in the directory. Meaning that you can split your settings in multiple files, like:
my_job_fs.yamlwhich contains the file system settingsmy_job_elasticsearch.yamlwhich contains the elasticsearch settings
Using placeholders
Added in version 2.10.
FSCrawler supports placeholders in the job file. This is useful when you want to use environment variables in your job file. For example, you can define the following job file:
fs:
url: "${HOME}/docs"
elasticsearch:
nodes:
- url: "${ES_NODE1:=https://127.0.0.1:9200}"
api_key: "${ES_API_KEY}"
When running FSCrawler, it will replace ${HOME}, ${ES_NODE1} and ${ES_API_KEY}
by their respective values which will be read from environment variables and java system properties if not found.
If no value is found, it will use the default value after the := if any, or it will fail starting if no default value.
In the previous example, both ${HOME} and ${ES_API_KEY} are mandatory but ${ES_NODE1} is optional and will
be set to https://127.0.0.1:9200 if not set.
FSCrawler is using the gestalt-config project to handle placeholders. You can read more about String substitution in the gestalt-config documentation.
Default placeholders
FSCrawler supports a set of default placeholders that you can define using environment variables.
The form of those placeholders is the prefix FSCRAWLER_ and the setting name. For example,
fs.url can be set using the environment variable FSCRAWLER_FS_URL or the system property -Dfs.url.
As an example, you can run:
FSCRAWLER_NAME=foo \
FSCRAWLER_FS_URL=/tmp/test \
FSCRAWLER_ELASTICSEARCH_API-KEY=VnVhQ2ZHY0JDZGJrUW0tZTVhT3g6dWkybHAyYXhUTm1zeWFrdzl0dk5udw== \
bin/fscrawler test
or:
FS_JAVA_OPTS="-Dname=foo -Dfs.url=/tmp/test -Delasticsearch.api-key=VnVhQ2ZHY0JDZGJrUW0tZTVhT3g6dWkybHAyYXhUTm1zeWFrdzl0dk5udw==" \
bin/fscrawler test
Note
If you define as well some settings in the job file, the settings in the job file will override the environment variables and system properties.
Example job file specification
The job file (~/.fscrawler/test/_settings.yaml) for the job name test must comply to the following yaml specifications:
# optional: the name of the crawler. Defaults to the job directory name.
name: "test"
# required
fs:
# define a "local" file path crawler, if running inside a docker container this must be the path INSIDE the container (/tmp/es)
url: "/path/to/docs"
follow_symlinks: false
remove_deleted: true
continue_on_error: false
# scan every 5 minutes for changes in url defined above
update_rate: "5m"
# opional: define includes and excludes, "~" files are excluded by default if not defined below
includes:
- "*.doc"
- "*.xls"
excludes:
- "resume.doc"
# optional: do not send big files to TIKA
ignore_above: "512mb"
# special handling of JSON files, should only be used if ALL files are JSON
json_support: false
add_as_inner_object: false
# special handling of XML files, should only be used if ALL files are XML
xml_support: false
# use MD5 from filename (instead of filename) if set to false
filename_as_id: true
# include size ot file in index
add_filesize: true
# inlcude user/group of file only if needed
attributes_support: false
# collect ACL metadata when available
acl_support: false
# do you REALLY want to store every file as a copy in the index ? Then set this to true
store_source: false
# you may want to store (partial) content of the file (see indexed_chars)
index_content: true
# how much data from the content of the file should be indexed (and stored inside the index), set to 0 if you need checksum, but no content at all to be indexed
#indexed_chars: "0"
indexed_chars: "10000.0"
# usually file metadata will be stored in separate fields, if you want to keep the original set, set this to true
raw_metadata: false
# optional: add checksum meta (requires index_content to be set to true)
checksum: "MD5"
# recommmended, but will create another index
index_folders: true
lang_detect: false
ocr.pdf_strategy: noocr
#ocr:
# language: "eng"
# path: "/path/to/tesseract/if/not/available/in/PATH"
# data_path: "/path/to/tesseract/tessdata/if/needed"
# optional: add static metadata tags to documents
tags:
metaFilename: "meta_tags.json" # default is ".meta.yml"
# optional: add static metadata to all indexed documents
staticMetaFilename: "/path/to/static_metadata.json"
# optional: specify a crawler provider (default is "local")
# available providers: "local", "ftp", "ssh"
# provider: "ssh"
# optional: only required if you want to SSH/FTP to another server to index documents from there
server:
hostname: "localhost"
port: 22
username: "dadoonet"
password: "password"
pem_path: "/path/to/pemfile"
# required
elasticsearch:
urls:
- "https://127.0.0.1:9200"
bulk_size: 1000
flush_interval: "5s"
byte_size: "10mb"
# choose one of the 2 following options:
# 1 - Using Api Key
api_key: "VnVhQ2ZHY0JDZGJrUW0tZTVhT3g6dWkybHAyYXhUTm1zeWFrdzl0dk5udw=="
# 2 - Using username/password (not recommended / deprecated)
username: "elastic"
password: "password"
# optional, defaults to ``name``-property
index: "test_docs"
# optional, defaults to "test_folders", used when es.index_folders is set to true
index_folder: "test_fold"
# optional, defaults to "true"
push_templates: "true"
# optional, defaults to "true", used with Elasticsearch 8.17+ with a trial or enterprise license
semantic_search: "true"
# only used when started with --rest option
rest:
url: "http://127.0.0.1:8080/fscrawler"
Here is a list of existing top level settings:
Name |
Documentation |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
You can define your job settings either in _settings.yaml (using .yaml extension) or
in _settings.json (using .json extension).