REST service 

FSCrawler can expose a REST service running at http://127.0.0.1:8080/. To activate it, launch FSCrawler with --rest option.

General settings 

Added in version 2.10.

For all the APIs on this page, you can pass parameters in different ways.

You can use a query string parameter:

curl "http://127.0.0.1:8080/API?param1=foo&param2=bar"

You can use a header parameter:

curl -H "param1=foo" -H "param2=bar" "http://127.0.0.1:8080/API"

The rest of this documentation will assume using a query string parameter unless stated otherwise.

FSCrawler status 

To get an overview of the running service, you can call GET / endpoint:

curl http://127.0.0.1:8080/

It will give you a response similar to:

{
  "ok" : true,
  "version" : "2.10-SNAPSHOT",
  "elasticsearch" : "9.4.3",
  "settings" : {
    "name" : "fscrawler",
    "fs" : {
      "url" : "/tmp/es",
      "update_rate" : "15m",
      "excludes" : [ "*/~*" ],
      "json_support" : false,
      "add_as_inner_object" : false,
      "xml_support" : false,
      "follow_symlinks" : false,
      "remove_deleted" : true,
      "continue_on_error" : false,
      "filename_as_id" : false,
      "add_filesize" : true,
      "attributes_support" : false,
      "store_source" : false,
      "index_content" : true,
      "acl_support" : false,
      "raw_metadata" : true,
      "index_folders" : true,
      "lang_detect" : false,
      "ocr" : {
        "enabled" : true,
        "language" : "eng",
        "output_type" : "txt",
        "pdf_strategy" : "ocr_and_text",
        "page_seg_mode" : 1,
        "preserve_interword_spacing" : false
      }
    },
    "server" : {
      "port" : 0,
      "protocol" : "local"
    },
    "elasticsearch" : {
      "urls" : [ "http://es-fscrawler:9200" ],
      "index" : "fscrawler_docs",
      "index_folder" : "fscrawler_folder",
      "bulk_size" : 100,
      "flush_interval" : "5s",
      "byte_size" : "10mb",
      "username" : "elastic",
      "ssl_verification" : true,
      "push_templates" : true,
      "semantic_search" : true
    },
    "rest" : {
      "url" : "http://127.0.0.1:8080",
      "enable_cors" : false
    },
    "tags" : {
      "meta_filename" : ".meta.yml"
    }
  }
}

Uploading a binary document 

To upload a binary, you can call POST /_document endpoint:

echo "This is my text" > test.txt
curl -F "file=@test.txt" "http://127.0.0.1:8080/_document"

It will give you a response similar to:

{
  "ok" : true,
  "filename" : "test.txt",
  "url" : "http://127.0.0.1:9200/fscrawler-rest-tests_doc/_doc/dd18bf3a8ea2a3e53e2661c7fb53534"
}

The url represents the elasticsearch address of the indexed document. If you call:

curl http://127.0.0.1:9200/fscrawler-rest-tests_doc/_doc/dd18bf3a8ea2a3e53e2661c7fb53534?pretty

You will get back your document as it has been stored by elasticsearch:

{
  "_index" : "fscrawler-rest-tests_doc",
  "_type" : "_doc",
  "_id" : "dd18bf3a8ea2a3e53e2661c7fb53534",
  "_version" : 1,
  "found" : true,
  "_source" : {
    "content" : "This file contains some words.\n",
    "meta" : {
      "raw" : {
        "X-Parsed-By" : "org.apache.tika.parser.DefaultParser",
        "Content-Encoding" : "ISO-8859-1",
        "Content-Type" : "text/plain; charset=ISO-8859-1"
      }
    },
    "file" : {
      "extension" : "txt",
      "content_type" : "text/plain; charset=ISO-8859-1",
      "indexing_date" : "2017-01-04T21:01:08.043",
      "filename" : "test.txt"
    },
    "path" : {
      "virtual" : "test.txt",
      "real" : "test.txt"
    }
  }
}

If you started FSCrawler in debug mode or if you pass debug=true query parameter, then the response will be much more complete:

echo "This is my text" > test.txt
curl -F "file=@test.txt" "http://127.0.0.1:8080/_document?debug=true"

will give

{
  "ok" : true,
  "filename" : "test.txt",
  "url" : "http://127.0.0.1:9200/fscrawler-rest-tests_doc/_doc/dd18bf3a8ea2a3e53e2661c7fb53534",
  "doc" : {
    "content" : "This file contains some words.\n",
    "meta" : {
      "raw" : {
        "X-Parsed-By" : "org.apache.tika.parser.DefaultParser",
        "Content-Encoding" : "ISO-8859-1",
        "Content-Type" : "text/plain; charset=ISO-8859-1"
      }
    },
    "file" : {
      "extension" : "txt",
      "content_type" : "text/plain; charset=ISO-8859-1",
      "indexing_date" : "2017-01-04T14:05:10.325",
      "filename" : "test.txt"
    },
    "path" : {
      "virtual" : "test.txt",
      "real" : "test.txt"
    }
  }
}

Uploading a binary document from a 3rd party service 

Added in version 2.10.

You can also ask FSCrawler to fetch a document from a 3rd party service and index it into Elasticsearch. FSCrawler supports so far the following services:

local: reads a file from the server where FSCrawler is running (a local file)
http: reads a file from a URL
s3: reads a file from an S3 compatible service
ssh: reads a file from an SSH/SFTP server
ftp: reads a file from an FTP server

To upload a binary from a 3rd party service, you can call POST /_document endpoint and pass a JSON document which describes the service settings:

curl -XPOST http://127.0.0.1:8080/_document -H 'Content-Type: application/json' -d '{
  "type": "<TYPE>",
  "<TYPE>": {
    // Settings for the <TYPE>
  }
}'

Local plugin 

The local plugin reads a file from the server where FSCrawler is running (a local file). It needs the following parameter:

url: link to the local file

For example, we can read the file bar.txt from the /path/to/foo directory with:

curl -XPOST http://127.0.0.1:8080/_document -H 'Content-Type: application/json' -d '{
  "type": "local",
  "local": {
    "url": "/path/to/foo/bar.txt"
  }
}'

Note

For security reasons, the local plugin can only read files which are under the path defined in the job settings file under fs.url.

HTTP plugin 

The http plugin reads a file from a given URL. It needs the following parameter:

url: link to the file

For example, we can read the file robots.txt from the https://www.elastic.co/ website with:

curl -XPOST http://127.0.0.1:8080/_document -H 'Content-Type: application/json' -d '{
  "type": "http",
  "http": {
    "url": "https://www.elastic.co/robots.txt"
  }
}'

S3 plugin 

The s3 plugin reads a file from an S3 compatible service. It needs the following parameters:

url: url for the S3 Service
bucket: bucket name
object: object to read from the bucket
access_key: access key (or login)
secret_key: secret key (or password)

For example, we can read the file foo.txt from the bucket foo running on https://s3.amazonaws.com/ with:

curl -XPOST http://127.0.0.1:8080/_document -H 'Content-Type: application/json' -d '{
  "type": "s3",
  "s3": {
    "url": "https://s3.amazonaws.com",
    "bucket": "foo",
    "object": "foo.txt",
    "access_key": "ACCESS",
    "secret_key": "SECRET"
  }
}'

If you are using Minio, you can use:

curl -XPOST http://127.0.0.1:8080/_document -H 'Content-Type: application/json' -d '{
  "type": "s3",
  "s3": {
    "url": "http://localhost:9000",
    "bucket": "foo",
    "object": "foo.txt",
    "access_key": "minioadmin",
    "secret_key": "minioadmin"
  }
}'

SSH plugin 

The ssh plugin reads a file from an SSH/SFTP server. It accepts the following parameters:

path (required): path to the file on the remote server
hostname (optional): SSH server hostname. If not provided, uses the server.hostname from job settings.
port (optional): SSH server port. If not provided, uses the server.port from job settings.
username (optional): SSH username. If not provided, uses the server.username from job settings.
password (optional): SSH password. If not provided, uses the server.password from job settings.
pem_path (optional): path to the PEM key file for key-based authentication. If not provided, uses the server.pem_path from job settings.

For example, we can read the file document.pdf from an SSH server with:

curl -XPOST http://127.0.0.1:8080/_document -H 'Content-Type: application/json' -d '{
  "type": "ssh",
  "ssh": {
    "hostname": "my-ssh-server.example.com",
    "port": 22,
    "username": "myuser",
    "password": "mypassword",
    "path": "/home/myuser/documents/document.pdf"
  }
}'

If you have already configured the SSH server settings in your job _settings.yaml file:

name: "my_job"
server:
  hostname: "my-ssh-server.example.com"
  port: 22
  username: "myuser"
  password: "mypassword"
  protocol: "SSH"

You can simplify the REST call by only providing the path:

curl -XPOST http://127.0.0.1:8080/_document -H 'Content-Type: application/json' -d '{
  "type": "ssh",
  "ssh": {
    "path": "/home/myuser/documents/document.pdf"
  }
}'

FTP plugin 

The ftp plugin reads a file from an FTP server. It accepts the following parameters:

path (required): path to the file on the remote server
hostname (optional): FTP server hostname. If not provided, uses the server.hostname from job settings.
port (optional): FTP server port. If not provided, uses the server.port from job settings.
username (optional): FTP username. If not provided, uses the server.username from job settings.
password (optional): FTP password. If not provided, uses the server.password from job settings.

For example, we can read the file document.pdf from an FTP server with:

curl -XPOST http://127.0.0.1:8080/_document -H 'Content-Type: application/json' -d '{
  "type": "ftp",
  "ftp": {
    "hostname": "ftp.example.com",
    "port": 21,
    "username": "myuser",
    "password": "mypassword",
    "path": "/documents/document.pdf"
  }
}'

If you have already configured the FTP server settings in your job _settings.yaml file:

name: "my_job"
server:
  hostname: "ftp.example.com"
  port: 21
  username: "myuser"
  password: "mypassword"
  protocol: "FTP"

You can simplify the REST call by only providing the path:

curl -XPOST http://127.0.0.1:8080/_document -H 'Content-Type: application/json' -d '{
  "type": "ftp",
  "ftp": {
    "path": "/documents/document.pdf"
  }
}'

Simulate Upload 

If you want to get back the extracted content and its metadata but without indexing into elasticsearch you can use simulate=true query parameter:

echo "This is my text" > test.txt
curl -F "file=@test.txt" "http://127.0.0.1:8080/_document?debug=true&simulate=true"

By default, FSCrawler encodes the filename to generate an id. Which means that if you send 2 files with the same filename test.txt, the second one will overwrite the first one because they will both share the same ID.

You can force any id you wish by adding id=YOUR_ID as a parameter:

echo "This is my text" > test.txt
curl -F "file=@test.txt" "http://127.0.0.1:8080/_document?id=my-test"

You can pass the id parameter within the form data:

echo "This is my text" > test.txt
curl -F "file=@test.txt" -F "id=my-test" "http://127.0.0.1:8080/_document"

There is a specific id named _auto_ where the ID will be autogenerated by elasticsearch. It means that sending twice the same file will result in 2 different documents indexed.

Additional tags 

Add custom tags to the document. In case you want to do filtering on those tags (examples are projectId or tenantId). These tags can be assigned to an external object field. As you can see in the json, you are able to overwrite the content field. meta, file and path fields can be overwritten as well. To upload a binary with additional tags, you can call POST /_document endpoint:

{
  "content": "OVERWRITE CONTENT",
  "external": {
    "tenantId": 23,
    "projectId": 34,
    "description": "these are additional tags"
  }
}

echo "This is my text" > test.txt
echo "{\"content\":\"OVERWRITE CONTENT\",\"external\":{\"tenantId\": 23,\"projectId\": 34,\"description\":\"these are additional tags\"}}" > tags.txt
curl -F "file=@test.txt" -F "tags=@tags.txt" "http://127.0.0.1:8080/_document"

The field external doesn’t necessarily be a flat structure. This is a more advanced example:

{
  "external": {
    "tenantId" : 23,
    "company": "shoe company",
    "projectId": 34,
    "project": "business development",
    "daysOpen": [
      "Mon",
      "Tue",
      "Wed",
      "Thu",
      "Fri"
    ],
    "products": [
      {
        "brand": "nike",
        "size": 41,
        "sub": "Air MAX"
      },
      {
        "brand": "reebok",
        "size": 43,
        "sub": "Pump"
      }
    ]
  }
}

You can use this technique to add for example the filesize of the file your are uploading:

.. code:: sh

echo “This is my text” > test.txt curl -F “file=@test.txt”

-F “tags={"file":{"filesize":$(ls -l test.txt | awk ‘{print $5}’)}}” “http://127.0.0.1:8080/_document”

Attention

Only standard FSCrawler fields can be set outside external field name.

Remove a document 

Added in version 2.10.

To remove a document, you can call DELETE /_document endpoint.

If you only know the filename, you can pass it to FSCrawler using the filename field:

curl -X DELETE "http://127.0.0.1:8080/_document?filename=test.txt"

It will give you a response similar to:

{
  "ok": true,
  "filename": "test.txt",
  "index": "rest",
  "id": "dd18bf3a8ea2a3e53e2661c7fb53534"
}

If you know the document id, you can pass it to FSCrawler within the url:

curl -X DELETE "http://127.0.0.1:8080/_document/dd18bf3a8ea2a3e53e2661c7fb53534"

If the document does not exist, you will get the following response:

{
  "ok": false,
  "message": "Can not remove document [rest/test.txt]: Can not remove document rest/dd18bf3a8ea2a3e53e2661c7fb53534 cause: NOT_FOUND",
  "filename": "test.txt",
  "index": "rest",
  "id": "dd18bf3a8ea2a3e53e2661c7fb53534"
}

Specifying an elasticsearch index 

By default, fscrawler creates document in the index defined in the _settings.yaml file. However, using the REST service, it is possible to require fscrawler to use different indexes, by setting the index parameter:

echo "This is my text" > test.txt
curl -F "file=@test.txt" "http://127.0.0.1:8080/_document?index=my-index"
curl -X DELETE "http://127.0.0.1:8080/_document?filename=test.txt&index=my-index"

When uploading, you can pass the id parameter within the form data:

echo "This is my text" > test.txt
curl -F "file=@test.txt" -F "index=my-index" "http://127.0.0.1:8080/_document"

Enabling CORS 

To enable Cross-Origin Request Sharing you will need to set enable_cors: true under rest in your job settings. Doing so will enable the relevant access headers on all REST service resource responses (for example / and /_document).

You can check if CORS is enabled with:

curl -I http://127.0.0.1:8080/

The response header should contain Access-Control-Allow-* parameters like:

Access-Control-Allow-Origin: *
Access-Control-Allow-Headers: origin, content-type, accept, authorization
Access-Control-Allow-Credentials: true
Access-Control-Allow-Methods: GET, POST, PUT, PATCH, DELETE, OPTIONS, HEAD

REST settings 

Here is a list of REST service settings (under rest. prefix):

Name	Environment Variable	Default value	Documentation
`rest.url`	`FSCRAWLER_REST_URL`	`http://127.0.0.1:8080`	Rest Service URL
`rest.enable_cors`	`FSCRAWLER_REST_ENABLE_CORS`	`false`	Enables or disables Cross-Origin Resource Sharing globally for all resources

Tip

Most Local FS settings (under fs.* in the settings file) also affect the REST service, e.g. fs.indexed_chars. Local FS settings that do not affect the REST service are those such as url, update_rate, includes, excludes.

REST service is running at http://127.0.0.1:8080/ by default.

You can change it using rest settings:

name: "test"
rest:
  url: "http://192.168.0.1:8180/my_fscrawler"

It also means that if you are running more than one instance of FScrawler locally, you can (must) change the port as it will conflict.

Crawler control 

Added in version 2.10.

FSCrawler provides REST endpoints to control the crawler, allowing you to pause, resume, and monitor the crawling process. This is particularly useful for large file systems where crawling may take a long time and you want to be able to stop and resume later.

Getting crawler status 

To get the current status of the crawler, you can call GET /_crawler/status:

curl http://127.0.0.1:8080/_crawler/status

It will give you a response similar to:

{
  "state" : "RUNNING",
  "scan_id" : "abc123-def456",
  "current_path" : "/data/documents/subfolder",
  "pending_directories" : 42,
  "completed_directories" : 158,
  "files_processed" : 1523,
  "files_deleted" : 12,
  "scan_start_time" : "2024-01-15T10:30:00",
  "scan_end_time" : null,
  "next_check" : null,
  "elapsed_time" : "15m 32s",
  "retry_count" : 0,
  "last_error" : null
}

When a scan is completed, the response will also include the scan_end_time and next_check fields:

{
  "state" : "COMPLETED",
  "files_processed" : 2500,
  "files_deleted" : 25,
  "scan_start_time" : "2024-01-15T10:30:00",
  "scan_end_time" : "2024-01-15T11:45:00",
  "next_check" : "2024-01-15T12:00:00",
  "elapsed_time" : "1h 15m"
}

The possible states are:

RUNNING: The crawler is actively processing files
PAUSED: The crawler is between runs or has been explicitly paused (see below)
STOPPED: The crawler is not running
COMPLETED: The crawler has finished its scan successfully
ERROR: The crawler encountered an error and stopped

Behavior between runs 

After each crawl run, the crawler enters a pause and waits for the next run. The next run starts when either:

You call POST /_crawler/resume (run on demand), or
The configured update_rate time has elapsed (automatic run).

This behavior is the same whether you use the REST service or not. So you can trigger a run at any time with resume, or let the crawler run automatically at the scheduled interval.

If you explicitly call POST /_crawler/pause while the crawler is in that “between runs” wait, the crawler will not start the next run when the time elapses; it will only start when you call POST /_crawler/resume. This lets you truly pause and control when the next run happens.

Pausing the crawler 

To pause the crawler, call POST /_crawler/pause:

curl -X POST http://127.0.0.1:8080/_crawler/pause

The crawler will save its current progress (checkpoint) and pause. While explicitly paused, it will not automatically start the next run when update_rate elapses; you must call POST /_crawler/resume to start the next run. You can also safely stop FSCrawler while paused; when you restart FSCrawler, it will resume from where it left off when you call resume.

Success response (200):

{
  "ok" : true,
  "message" : "Crawler paused. Checkpoint saved."
}

If the crawler is already paused, you get 200 with:

{
  "ok" : true,
  "message" : "Crawler is already paused."
}

Error response (400) when the crawler is not running:

{
  "ok" : false,
  "message" : "Crawler is not running"
}

Resuming the crawler 

To resume a paused crawler, call POST /_crawler/resume:

curl -X POST http://127.0.0.1:8080/_crawler/resume

Success response (200) when resuming from pause:

{
  "ok" : true,
  "message" : "Crawler resumed."
}

If the crawler is not paused, you get 200 with no action taken:

{
  "ok" : true,
  "message" : "Crawler is not paused. No action needed."
}

Error response (400) when the crawler is closed:

{
  "ok" : false,
  "message" : "Crawler is closed. Cannot resume."
}

Clearing the checkpoint 

If you want to force a fresh start and ignore any saved progress, you can clear the checkpoint file. The crawler must be paused or stopped first:

curl -X DELETE http://127.0.0.1:8080/_crawler/checkpoint

Success response (200):

{
  "ok" : true,
  "message" : "Checkpoint cleared"
}

Error response (400) when the crawler is running and not paused:

{
  "ok" : false,
  "message" : "Cannot clear checkpoint while crawler is running. Pause or stop it first."
}

Error response (404) when there is no active crawler (e.g. started with --loop 0):

{
  "ok" : false,
  "message" : "Failed to clear checkpoint as we don't have a checkpoint handler. This probably means there's no active crawler. Did you start with --loop 0?"
}

On I/O error, the server returns 500 with a message starting with Failed to clear checkpoint:.

Note

You can also clear the checkpoint using the --restart command line option when starting FSCrawler. See CLI options for more details.

Automatic resume after crash 

If FSCrawler crashes or is forcefully terminated during a crawl, it will automatically resume from the last saved checkpoint when restarted. The checkpoint is saved periodically (every 100 files by default) and whenever the crawler state changes.

Network error recovery 

FSCrawler automatically handles network errors with exponential backoff retry. If a network error occurs, it will:

Save the current checkpoint
Wait with exponential backoff (starting from 1 second, doubling each retry)
Attempt to reconnect
Resume from the failed directory

After 10 consecutive failures, the crawler will stop with an error state. You can then fix the network issue and restart FSCrawler to resume from the checkpoint.