Job file specificationΒΆ
The job file must comply to the following yaml
specifications:
name: "job_name"
fs:
url: "/path/to/docs"
update_rate: "5m"
includes:
- "*.doc"
- "*.xls"
excludes:
- "resume.doc"
json_support: false
filename_as_id: true
add_filesize: true
remove_deleted: true
add_as_inner_object: false
store_source: true
index_content: true
indexed_chars: "10000.0"
attributes_support: false
raw_metadata: true
xml_support: false
index_folders: true
lang_detect: false
continue_on_error: false
pdf_ocr: true
ocr:
language: "eng"
path: "/path/to/tesseract/if/not/available/in/PATH"
data_path: "/path/to/tesseract/tessdata/if/needed"
server:
hostname: "localhost"
port: 22
username: "dadoonet"
password: "password"
protocol: "SSH"
pem_path: "/path/to/pemfile"
elasticsearch:
nodes:
# With Cloud ID
- cloud_id: "CLOUD_ID"
# With URL
- url: "http://127.0.0.1:9200"
index: "docs"
bulk_size: 1000
flush_interval: "5s"
byte_size: "10mb"
username: "elastic"
password: "password"
rest:
url: "https://127.0.0.1:8080/fscrawler"
Here is a list of existing top level settings:
Name | Documentation |
---|---|
name (mandatory field) |
The most simple crawler |
fs |
Local FS settings |
elasticsearch |
Elasticsearch settings |
server |
SSH settings |
rest |
REST service |
New in version 2.7.
You can define your job settings either in yaml
(using .yaml
extension) or
in json
(using .json
extension).