REST service
FSCrawler can expose a REST service running at http://127.0.0.1:8080/.
To activate it, launch FSCrawler with --rest option.
General settings
Added in version 2.10.
For all the APIs on this page, you can pass parameters in different ways.
You can use a query string parameter:
curl "http://127.0.0.1:8080/API?param1=foo¶m2=bar"
You can use a header parameter:
curl -H "param1=foo" -H "param2=bar" "http://127.0.0.1:8080/API"
The rest of this documentation will assume using a query string parameter unless stated otherwise.
FSCrawler status
To get an overview of the running service, you can call GET /
endpoint:
curl http://127.0.0.1:8080/
It will give you a response similar to:
{
"ok" : true,
"version" : "2.10-SNAPSHOT",
"elasticsearch" : "9.4.1",
"settings" : {
"name" : "fscrawler",
"fs" : {
"url" : "/tmp/es",
"update_rate" : "15m",
"excludes" : [ "*/~*" ],
"json_support" : false,
"add_as_inner_object" : false,
"xml_support" : false,
"follow_symlinks" : false,
"remove_deleted" : true,
"continue_on_error" : false,
"filename_as_id" : false,
"add_filesize" : true,
"attributes_support" : false,
"store_source" : false,
"index_content" : true,
"acl_support" : false,
"raw_metadata" : true,
"index_folders" : true,
"lang_detect" : false,
"ocr" : {
"enabled" : true,
"language" : "eng",
"output_type" : "txt",
"pdf_strategy" : "ocr_and_text",
"page_seg_mode" : 1,
"preserve_interword_spacing" : false
}
},
"server" : {
"port" : 0,
"protocol" : "local"
},
"elasticsearch" : {
"urls" : [ "http://es-fscrawler:9200" ],
"index" : "fscrawler_docs",
"index_folder" : "fscrawler_folder",
"bulk_size" : 100,
"flush_interval" : "5s",
"byte_size" : "10mb",
"username" : "elastic",
"ssl_verification" : true,
"push_templates" : true,
"semantic_search" : true
},
"rest" : {
"url" : "http://127.0.0.1:8080",
"enable_cors" : false
},
"tags" : {
"meta_filename" : ".meta.yml"
}
}
}
Uploading a binary document
To upload a binary, you can call POST /_document endpoint:
echo "This is my text" > test.txt
curl -F "file=@test.txt" "http://127.0.0.1:8080/_document"
It will give you a response similar to:
{
"ok" : true,
"filename" : "test.txt",
"url" : "http://127.0.0.1:9200/fscrawler-rest-tests_doc/_doc/dd18bf3a8ea2a3e53e2661c7fb53534"
}
The url represents the elasticsearch address of the indexed
document. If you call:
curl http://127.0.0.1:9200/fscrawler-rest-tests_doc/_doc/dd18bf3a8ea2a3e53e2661c7fb53534?pretty
You will get back your document as it has been stored by elasticsearch:
{
"_index" : "fscrawler-rest-tests_doc",
"_type" : "_doc",
"_id" : "dd18bf3a8ea2a3e53e2661c7fb53534",
"_version" : 1,
"found" : true,
"_source" : {
"content" : "This file contains some words.\n",
"meta" : {
"raw" : {
"X-Parsed-By" : "org.apache.tika.parser.DefaultParser",
"Content-Encoding" : "ISO-8859-1",
"Content-Type" : "text/plain; charset=ISO-8859-1"
}
},
"file" : {
"extension" : "txt",
"content_type" : "text/plain; charset=ISO-8859-1",
"indexing_date" : "2017-01-04T21:01:08.043",
"filename" : "test.txt"
},
"path" : {
"virtual" : "test.txt",
"real" : "test.txt"
}
}
}
If you started FSCrawler in debug mode or if you pass
debug=true query parameter, then the response will be much more
complete:
echo "This is my text" > test.txt
curl -F "file=@test.txt" "http://127.0.0.1:8080/_document?debug=true"
will give
{
"ok" : true,
"filename" : "test.txt",
"url" : "http://127.0.0.1:9200/fscrawler-rest-tests_doc/_doc/dd18bf3a8ea2a3e53e2661c7fb53534",
"doc" : {
"content" : "This file contains some words.\n",
"meta" : {
"raw" : {
"X-Parsed-By" : "org.apache.tika.parser.DefaultParser",
"Content-Encoding" : "ISO-8859-1",
"Content-Type" : "text/plain; charset=ISO-8859-1"
}
},
"file" : {
"extension" : "txt",
"content_type" : "text/plain; charset=ISO-8859-1",
"indexing_date" : "2017-01-04T14:05:10.325",
"filename" : "test.txt"
},
"path" : {
"virtual" : "test.txt",
"real" : "test.txt"
}
}
}
Uploading a binary document from a 3rd party service
Added in version 2.10.
You can also ask FSCrawler to fetch a document from a 3rd party service and index it into Elasticsearch. FSCrawler supports so far the following services:
local: reads a file from the server where FSCrawler is running (a local file)http: reads a file from a URLs3: reads a file from an S3 compatible servicessh: reads a file from an SSH/SFTP serverftp: reads a file from an FTP server
To upload a binary from a 3rd party service, you can call POST /_document endpoint and pass
a JSON document which describes the service settings:
curl -XPOST http://127.0.0.1:8080/_document -H 'Content-Type: application/json' -d '{
"type": "<TYPE>",
"<TYPE>": {
// Settings for the <TYPE>
}
}'
Local plugin
The local plugin reads a file from the server where FSCrawler is running (a local file).
It needs the following parameter:
url: link to the local file
For example, we can read the file bar.txt from the /path/to/foo directory with:
curl -XPOST http://127.0.0.1:8080/_document -H 'Content-Type: application/json' -d '{
"type": "local",
"local": {
"url": "/path/to/foo/bar.txt"
}
}'
Note
For security reasons, the local plugin can only read files
which are under the path defined in the job settings file under
fs.url.
HTTP plugin
The http plugin reads a file from a given URL.
It needs the following parameter:
url: link to the file
For example, we can read the file robots.txt from the https://www.elastic.co/ website with:
curl -XPOST http://127.0.0.1:8080/_document -H 'Content-Type: application/json' -d '{
"type": "http",
"http": {
"url": "https://www.elastic.co/robots.txt"
}
}'
S3 plugin
The s3 plugin reads a file from an S3 compatible service.
It needs the following parameters:
url: url for the S3 Servicebucket: bucket nameobject: object to read from the bucketaccess_key: access key (or login)secret_key: secret key (or password)
For example, we can read the file foo.txt from the bucket foo running on https://s3.amazonaws.com/ with:
curl -XPOST http://127.0.0.1:8080/_document -H 'Content-Type: application/json' -d '{
"type": "s3",
"s3": {
"url": "https://s3.amazonaws.com",
"bucket": "foo",
"object": "foo.txt",
"access_key": "ACCESS",
"secret_key": "SECRET"
}
}'
If you are using Minio, you can use:
curl -XPOST http://127.0.0.1:8080/_document -H 'Content-Type: application/json' -d '{
"type": "s3",
"s3": {
"url": "http://localhost:9000",
"bucket": "foo",
"object": "foo.txt",
"access_key": "minioadmin",
"secret_key": "minioadmin"
}
}'
SSH plugin
The ssh plugin reads a file from an SSH/SFTP server.
It accepts the following parameters:
path(required): path to the file on the remote serverhostname(optional): SSH server hostname. If not provided, uses theserver.hostnamefrom job settings.port(optional): SSH server port. If not provided, uses theserver.portfrom job settings.username(optional): SSH username. If not provided, uses theserver.usernamefrom job settings.password(optional): SSH password. If not provided, uses theserver.passwordfrom job settings.pem_path(optional): path to the PEM key file for key-based authentication. If not provided, uses theserver.pem_pathfrom job settings.
For example, we can read the file document.pdf from an SSH server with:
curl -XPOST http://127.0.0.1:8080/_document -H 'Content-Type: application/json' -d '{
"type": "ssh",
"ssh": {
"hostname": "my-ssh-server.example.com",
"port": 22,
"username": "myuser",
"password": "mypassword",
"path": "/home/myuser/documents/document.pdf"
}
}'
If you have already configured the SSH server settings in your job _settings.yaml file:
name: "my_job"
server:
hostname: "my-ssh-server.example.com"
port: 22
username: "myuser"
password: "mypassword"
protocol: "SSH"
You can simplify the REST call by only providing the path:
curl -XPOST http://127.0.0.1:8080/_document -H 'Content-Type: application/json' -d '{
"type": "ssh",
"ssh": {
"path": "/home/myuser/documents/document.pdf"
}
}'
FTP plugin
The ftp plugin reads a file from an FTP server.
It accepts the following parameters:
path(required): path to the file on the remote serverhostname(optional): FTP server hostname. If not provided, uses theserver.hostnamefrom job settings.port(optional): FTP server port. If not provided, uses theserver.portfrom job settings.username(optional): FTP username. If not provided, uses theserver.usernamefrom job settings.password(optional): FTP password. If not provided, uses theserver.passwordfrom job settings.
For example, we can read the file document.pdf from an FTP server with:
curl -XPOST http://127.0.0.1:8080/_document -H 'Content-Type: application/json' -d '{
"type": "ftp",
"ftp": {
"hostname": "ftp.example.com",
"port": 21,
"username": "myuser",
"password": "mypassword",
"path": "/documents/document.pdf"
}
}'
If you have already configured the FTP server settings in your job _settings.yaml file:
name: "my_job"
server:
hostname: "ftp.example.com"
port: 21
username: "myuser"
password: "mypassword"
protocol: "FTP"
You can simplify the REST call by only providing the path:
curl -XPOST http://127.0.0.1:8080/_document -H 'Content-Type: application/json' -d '{
"type": "ftp",
"ftp": {
"path": "/documents/document.pdf"
}
}'
Simulate Upload
If you want to get back the extracted content and its metadata but
without indexing into elasticsearch you can use simulate=true query
parameter:
echo "This is my text" > test.txt
curl -F "file=@test.txt" "http://127.0.0.1:8080/_document?debug=true&simulate=true"
Document ID
By default, FSCrawler encodes the filename to generate an id. Which
means that if you send 2 files with the same filename test.txt, the
second one will overwrite the first one because they will both share the
same ID.
You can force any id you wish by adding id=YOUR_ID as a parameter:
echo "This is my text" > test.txt
curl -F "file=@test.txt" "http://127.0.0.1:8080/_document?id=my-test"
You can pass the id parameter within the form data:
echo "This is my text" > test.txt
curl -F "file=@test.txt" -F "id=my-test" "http://127.0.0.1:8080/_document"
There is a specific id named _auto_ where the ID will be
autogenerated by elasticsearch. It means that sending twice the same
file will result in 2 different documents indexed.
Remove a document
Added in version 2.10.
To remove a document, you can call DELETE /_document endpoint.
If you only know the filename, you can pass it to FSCrawler using the filename field:
curl -X DELETE "http://127.0.0.1:8080/_document?filename=test.txt"
It will give you a response similar to:
{
"ok": true,
"filename": "test.txt",
"index": "rest",
"id": "dd18bf3a8ea2a3e53e2661c7fb53534"
}
If you know the document id, you can pass it to FSCrawler within the url:
curl -X DELETE "http://127.0.0.1:8080/_document/dd18bf3a8ea2a3e53e2661c7fb53534"
If the document does not exist, you will get the following response:
{
"ok": false,
"message": "Can not remove document [rest/test.txt]: Can not remove document rest/dd18bf3a8ea2a3e53e2661c7fb53534 cause: NOT_FOUND",
"filename": "test.txt",
"index": "rest",
"id": "dd18bf3a8ea2a3e53e2661c7fb53534"
}
Specifying an elasticsearch index
By default, fscrawler creates document in the index defined in the _settings.yaml file.
However, using the REST service, it is possible to require fscrawler to use different indexes, by setting the index
parameter:
echo "This is my text" > test.txt
curl -F "file=@test.txt" "http://127.0.0.1:8080/_document?index=my-index"
curl -X DELETE "http://127.0.0.1:8080/_document?filename=test.txt&index=my-index"
When uploading, you can pass the id parameter within the form data:
echo "This is my text" > test.txt
curl -F "file=@test.txt" -F "index=my-index" "http://127.0.0.1:8080/_document"
Enabling CORS
To enable Cross-Origin Request Sharing you will need to set enable_cors: true
under rest in your job settings. Doing so will enable the relevant access headers
on all REST service resource responses (for example / and /_document).
You can check if CORS is enabled with:
curl -I http://127.0.0.1:8080/
The response header should contain Access-Control-Allow-* parameters like:
Access-Control-Allow-Origin: *
Access-Control-Allow-Headers: origin, content-type, accept, authorization
Access-Control-Allow-Credentials: true
Access-Control-Allow-Methods: GET, POST, PUT, PATCH, DELETE, OPTIONS, HEAD
REST settings
Here is a list of REST service settings (under rest. prefix):
Name |
Environment Variable |
Default value |
Documentation |
|---|---|---|---|
|
|
|
Rest Service URL |
|
|
|
Enables or disables Cross-Origin Resource Sharing globally for all resources |
Tip
Most Local FS settings (under fs.* in the
settings file) also affect the REST service, e.g. fs.indexed_chars.
Local FS settings that do not affect the REST service are those such
as url, update_rate, includes, excludes.
REST service is running at http://127.0.0.1:8080/ by default.
You can change it using rest settings:
name: "test"
rest:
url: "http://192.168.0.1:8180/my_fscrawler"
It also means that if you are running more than one instance of FScrawler locally, you can (must) change the port as it will conflict.
Crawler control
Added in version 2.10.
FSCrawler provides REST endpoints to control the crawler, allowing you to pause, resume, and monitor the crawling process. This is particularly useful for large file systems where crawling may take a long time and you want to be able to stop and resume later.
Getting crawler status
To get the current status of the crawler, you can call GET /_crawler/status:
curl http://127.0.0.1:8080/_crawler/status
It will give you a response similar to:
{
"state" : "RUNNING",
"scan_id" : "abc123-def456",
"current_path" : "/data/documents/subfolder",
"pending_directories" : 42,
"completed_directories" : 158,
"files_processed" : 1523,
"files_deleted" : 12,
"scan_start_time" : "2024-01-15T10:30:00",
"scan_end_time" : null,
"next_check" : null,
"elapsed_time" : "15m 32s",
"retry_count" : 0,
"last_error" : null
}
When a scan is completed, the response will also include the scan_end_time and next_check fields:
{
"state" : "COMPLETED",
"files_processed" : 2500,
"files_deleted" : 25,
"scan_start_time" : "2024-01-15T10:30:00",
"scan_end_time" : "2024-01-15T11:45:00",
"next_check" : "2024-01-15T12:00:00",
"elapsed_time" : "1h 15m"
}
The possible states are:
RUNNING: The crawler is actively processing filesPAUSED: The crawler is between runs or has been explicitly paused (see below)STOPPED: The crawler is not runningCOMPLETED: The crawler has finished its scan successfullyERROR: The crawler encountered an error and stopped
Behavior between runs
After each crawl run, the crawler enters a pause and waits for the next run. The next run starts when either:
You call
POST /_crawler/resume(run on demand), orThe configured
update_ratetime has elapsed (automatic run).
This behavior is the same whether you use the REST service or not. So you can trigger a run
at any time with resume, or let the crawler run automatically at the scheduled interval.
If you explicitly call POST /_crawler/pause while the crawler is in that “between runs”
wait, the crawler will not start the next run when the time elapses; it will only start
when you call POST /_crawler/resume. This lets you truly pause and control when the next
run happens.
Pausing the crawler
To pause the crawler, call POST /_crawler/pause:
curl -X POST http://127.0.0.1:8080/_crawler/pause
The crawler will save its current progress (checkpoint) and pause. While explicitly paused,
it will not automatically start the next run when update_rate elapses; you must call
POST /_crawler/resume to start the next run. You can also safely stop FSCrawler while
paused; when you restart FSCrawler, it will resume from where it left off when you call
resume.
Success response (200):
{
"ok" : true,
"message" : "Crawler paused. Checkpoint saved."
}
If the crawler is already paused, you get 200 with:
{
"ok" : true,
"message" : "Crawler is already paused."
}
Error response (400) when the crawler is not running:
{
"ok" : false,
"message" : "Crawler is not running"
}
Resuming the crawler
To resume a paused crawler, call POST /_crawler/resume:
curl -X POST http://127.0.0.1:8080/_crawler/resume
Success response (200) when resuming from pause:
{
"ok" : true,
"message" : "Crawler resumed."
}
If the crawler is not paused, you get 200 with no action taken:
{
"ok" : true,
"message" : "Crawler is not paused. No action needed."
}
Error response (400) when the crawler is closed:
{
"ok" : false,
"message" : "Crawler is closed. Cannot resume."
}
Clearing the checkpoint
If you want to force a fresh start and ignore any saved progress, you can clear the checkpoint file. The crawler must be paused or stopped first:
curl -X DELETE http://127.0.0.1:8080/_crawler/checkpoint
Success response (200):
{
"ok" : true,
"message" : "Checkpoint cleared"
}
Error response (400) when the crawler is running and not paused:
{
"ok" : false,
"message" : "Cannot clear checkpoint while crawler is running. Pause or stop it first."
}
Error response (404) when there is no active crawler (e.g. started with --loop 0):
{
"ok" : false,
"message" : "Failed to clear checkpoint as we don't have a checkpoint handler. This probably means there's no active crawler. Did you start with --loop 0?"
}
On I/O error, the server returns 500 with a message starting with Failed to clear checkpoint:.
Note
You can also clear the checkpoint using the --restart command line option
when starting FSCrawler. See CLI options for more details.
Automatic resume after crash
If FSCrawler crashes or is forcefully terminated during a crawl, it will automatically resume from the last saved checkpoint when restarted. The checkpoint is saved periodically (every 100 files by default) and whenever the crawler state changes.
Network error recovery
FSCrawler automatically handles network errors with exponential backoff retry. If a network error occurs, it will:
Save the current checkpoint
Wait with exponential backoff (starting from 1 second, doubling each retry)
Attempt to reconnect
Resume from the failed directory
After 10 consecutive failures, the crawler will stop with an error state. You can then fix the network issue and restart FSCrawler to resume from the checkpoint.