REST service

FSCrawler can expose a REST service running at http://127.0.0.1:8080/. To activate it, launch FSCrawler with --rest option.

General settings

Added in version 2.10.

For all the APIs on this page, you can pass parameters in different ways.

You can use a query string parameter:

curl "http://127.0.0.1:8080/API?param1=foo&param2=bar"

You can use a header parameter:

curl -H "param1=foo" -H "param2=bar" "http://127.0.0.1:8080/API"

The rest of this documentation will assume using a query string parameter unless stated otherwise.

FSCrawler status

To get an overview of the running service, you can call GET / endpoint:

curl http://127.0.0.1:8080/

It will give you a response similar to:

{
  "ok" : true,
  "version" : "2.10-SNAPSHOT",
  "elasticsearch" : "9.4.1",
  "settings" : {
    "name" : "fscrawler",
    "fs" : {
      "url" : "/tmp/es",
      "update_rate" : "15m",
      "excludes" : [ "*/~*" ],
      "json_support" : false,
      "add_as_inner_object" : false,
      "xml_support" : false,
      "follow_symlinks" : false,
      "remove_deleted" : true,
      "continue_on_error" : false,
      "filename_as_id" : false,
      "add_filesize" : true,
      "attributes_support" : false,
      "store_source" : false,
      "index_content" : true,
      "acl_support" : false,
      "raw_metadata" : true,
      "index_folders" : true,
      "lang_detect" : false,
      "ocr" : {
        "enabled" : true,
        "language" : "eng",
        "output_type" : "txt",
        "pdf_strategy" : "ocr_and_text",
        "page_seg_mode" : 1,
        "preserve_interword_spacing" : false
      }
    },
    "server" : {
      "port" : 0,
      "protocol" : "local"
    },
    "elasticsearch" : {
      "urls" : [ "http://es-fscrawler:9200" ],
      "index" : "fscrawler_docs",
      "index_folder" : "fscrawler_folder",
      "bulk_size" : 100,
      "flush_interval" : "5s",
      "byte_size" : "10mb",
      "username" : "elastic",
      "ssl_verification" : true,
      "push_templates" : true,
      "semantic_search" : true
    },
    "rest" : {
      "url" : "http://127.0.0.1:8080",
      "enable_cors" : false
    },
    "tags" : {
      "meta_filename" : ".meta.yml"
    }
  }
}

Uploading a binary document

To upload a binary, you can call POST /_document endpoint:

echo "This is my text" > test.txt
curl -F "file=@test.txt" "http://127.0.0.1:8080/_document"

It will give you a response similar to:

{
  "ok" : true,
  "filename" : "test.txt",
  "url" : "http://127.0.0.1:9200/fscrawler-rest-tests_doc/_doc/dd18bf3a8ea2a3e53e2661c7fb53534"
}

The url represents the elasticsearch address of the indexed document. If you call:

curl http://127.0.0.1:9200/fscrawler-rest-tests_doc/_doc/dd18bf3a8ea2a3e53e2661c7fb53534?pretty

You will get back your document as it has been stored by elasticsearch:

{
  "_index" : "fscrawler-rest-tests_doc",
  "_type" : "_doc",
  "_id" : "dd18bf3a8ea2a3e53e2661c7fb53534",
  "_version" : 1,
  "found" : true,
  "_source" : {
    "content" : "This file contains some words.\n",
    "meta" : {
      "raw" : {
        "X-Parsed-By" : "org.apache.tika.parser.DefaultParser",
        "Content-Encoding" : "ISO-8859-1",
        "Content-Type" : "text/plain; charset=ISO-8859-1"
      }
    },
    "file" : {
      "extension" : "txt",
      "content_type" : "text/plain; charset=ISO-8859-1",
      "indexing_date" : "2017-01-04T21:01:08.043",
      "filename" : "test.txt"
    },
    "path" : {
      "virtual" : "test.txt",
      "real" : "test.txt"
    }
  }
}

If you started FSCrawler in debug mode or if you pass debug=true query parameter, then the response will be much more complete:

echo "This is my text" > test.txt
curl -F "file=@test.txt" "http://127.0.0.1:8080/_document?debug=true"

will give

{
  "ok" : true,
  "filename" : "test.txt",
  "url" : "http://127.0.0.1:9200/fscrawler-rest-tests_doc/_doc/dd18bf3a8ea2a3e53e2661c7fb53534",
  "doc" : {
    "content" : "This file contains some words.\n",
    "meta" : {
      "raw" : {
        "X-Parsed-By" : "org.apache.tika.parser.DefaultParser",
        "Content-Encoding" : "ISO-8859-1",
        "Content-Type" : "text/plain; charset=ISO-8859-1"
      }
    },
    "file" : {
      "extension" : "txt",
      "content_type" : "text/plain; charset=ISO-8859-1",
      "indexing_date" : "2017-01-04T14:05:10.325",
      "filename" : "test.txt"
    },
    "path" : {
      "virtual" : "test.txt",
      "real" : "test.txt"
    }
  }
}

Uploading a binary document from a 3rd party service

Added in version 2.10.

You can also ask FSCrawler to fetch a document from a 3rd party service and index it into Elasticsearch. FSCrawler supports so far the following services:

  • local: reads a file from the server where FSCrawler is running (a local file)

  • http: reads a file from a URL

  • s3: reads a file from an S3 compatible service

  • ssh: reads a file from an SSH/SFTP server

  • ftp: reads a file from an FTP server

To upload a binary from a 3rd party service, you can call POST /_document endpoint and pass a JSON document which describes the service settings:

curl -XPOST http://127.0.0.1:8080/_document -H 'Content-Type: application/json' -d '{
  "type": "<TYPE>",
  "<TYPE>": {
    // Settings for the <TYPE>
  }
}'

Local plugin

The local plugin reads a file from the server where FSCrawler is running (a local file). It needs the following parameter:

  • url: link to the local file

For example, we can read the file bar.txt from the /path/to/foo directory with:

curl -XPOST http://127.0.0.1:8080/_document -H 'Content-Type: application/json' -d '{
  "type": "local",
  "local": {
    "url": "/path/to/foo/bar.txt"
  }
}'

Note

For security reasons, the local plugin can only read files which are under the path defined in the job settings file under fs.url.

HTTP plugin

The http plugin reads a file from a given URL. It needs the following parameter:

  • url: link to the file

For example, we can read the file robots.txt from the https://www.elastic.co/ website with:

curl -XPOST http://127.0.0.1:8080/_document -H 'Content-Type: application/json' -d '{
  "type": "http",
  "http": {
    "url": "https://www.elastic.co/robots.txt"
  }
}'

S3 plugin

The s3 plugin reads a file from an S3 compatible service. It needs the following parameters:

  • url: url for the S3 Service

  • bucket: bucket name

  • object: object to read from the bucket

  • access_key: access key (or login)

  • secret_key: secret key (or password)

For example, we can read the file foo.txt from the bucket foo running on https://s3.amazonaws.com/ with:

curl -XPOST http://127.0.0.1:8080/_document -H 'Content-Type: application/json' -d '{
  "type": "s3",
  "s3": {
    "url": "https://s3.amazonaws.com",
    "bucket": "foo",
    "object": "foo.txt",
    "access_key": "ACCESS",
    "secret_key": "SECRET"
  }
}'

If you are using Minio, you can use:

curl -XPOST http://127.0.0.1:8080/_document -H 'Content-Type: application/json' -d '{
  "type": "s3",
  "s3": {
    "url": "http://localhost:9000",
    "bucket": "foo",
    "object": "foo.txt",
    "access_key": "minioadmin",
    "secret_key": "minioadmin"
  }
}'

SSH plugin

The ssh plugin reads a file from an SSH/SFTP server. It accepts the following parameters:

  • path (required): path to the file on the remote server

  • hostname (optional): SSH server hostname. If not provided, uses the server.hostname from job settings.

  • port (optional): SSH server port. If not provided, uses the server.port from job settings.

  • username (optional): SSH username. If not provided, uses the server.username from job settings.

  • password (optional): SSH password. If not provided, uses the server.password from job settings.

  • pem_path (optional): path to the PEM key file for key-based authentication. If not provided, uses the server.pem_path from job settings.

For example, we can read the file document.pdf from an SSH server with:

curl -XPOST http://127.0.0.1:8080/_document -H 'Content-Type: application/json' -d '{
  "type": "ssh",
  "ssh": {
    "hostname": "my-ssh-server.example.com",
    "port": 22,
    "username": "myuser",
    "password": "mypassword",
    "path": "/home/myuser/documents/document.pdf"
  }
}'

If you have already configured the SSH server settings in your job _settings.yaml file:

name: "my_job"
server:
  hostname: "my-ssh-server.example.com"
  port: 22
  username: "myuser"
  password: "mypassword"
  protocol: "SSH"

You can simplify the REST call by only providing the path:

curl -XPOST http://127.0.0.1:8080/_document -H 'Content-Type: application/json' -d '{
  "type": "ssh",
  "ssh": {
    "path": "/home/myuser/documents/document.pdf"
  }
}'

FTP plugin

The ftp plugin reads a file from an FTP server. It accepts the following parameters:

  • path (required): path to the file on the remote server

  • hostname (optional): FTP server hostname. If not provided, uses the server.hostname from job settings.

  • port (optional): FTP server port. If not provided, uses the server.port from job settings.

  • username (optional): FTP username. If not provided, uses the server.username from job settings.

  • password (optional): FTP password. If not provided, uses the server.password from job settings.

For example, we can read the file document.pdf from an FTP server with:

curl -XPOST http://127.0.0.1:8080/_document -H 'Content-Type: application/json' -d '{
  "type": "ftp",
  "ftp": {
    "hostname": "ftp.example.com",
    "port": 21,
    "username": "myuser",
    "password": "mypassword",
    "path": "/documents/document.pdf"
  }
}'

If you have already configured the FTP server settings in your job _settings.yaml file:

name: "my_job"
server:
  hostname: "ftp.example.com"
  port: 21
  username: "myuser"
  password: "mypassword"
  protocol: "FTP"

You can simplify the REST call by only providing the path:

curl -XPOST http://127.0.0.1:8080/_document -H 'Content-Type: application/json' -d '{
  "type": "ftp",
  "ftp": {
    "path": "/documents/document.pdf"
  }
}'

Simulate Upload

If you want to get back the extracted content and its metadata but without indexing into elasticsearch you can use simulate=true query parameter:

echo "This is my text" > test.txt
curl -F "file=@test.txt" "http://127.0.0.1:8080/_document?debug=true&simulate=true"

Document ID

By default, FSCrawler encodes the filename to generate an id. Which means that if you send 2 files with the same filename test.txt, the second one will overwrite the first one because they will both share the same ID.

You can force any id you wish by adding id=YOUR_ID as a parameter:

echo "This is my text" > test.txt
curl -F "file=@test.txt" "http://127.0.0.1:8080/_document?id=my-test"

You can pass the id parameter within the form data:

echo "This is my text" > test.txt
curl -F "file=@test.txt" -F "id=my-test" "http://127.0.0.1:8080/_document"

There is a specific id named _auto_ where the ID will be autogenerated by elasticsearch. It means that sending twice the same file will result in 2 different documents indexed.

Additional tags

Add custom tags to the document. In case you want to do filtering on those tags (examples are projectId or tenantId). These tags can be assigned to an external object field. As you can see in the json, you are able to overwrite the content field. meta, file and path fields can be overwritten as well. To upload a binary with additional tags, you can call POST /_document endpoint:

{
  "content": "OVERWRITE CONTENT",
  "external": {
    "tenantId": 23,
    "projectId": 34,
    "description": "these are additional tags"
  }
}
echo "This is my text" > test.txt
echo "{\"content\":\"OVERWRITE CONTENT\",\"external\":{\"tenantId\": 23,\"projectId\": 34,\"description\":\"these are additional tags\"}}" > tags.txt
curl -F "file=@test.txt" -F "tags=@tags.txt" "http://127.0.0.1:8080/_document"

The field external doesn’t necessarily be a flat structure. This is a more advanced example:

{
  "external": {
    "tenantId" : 23,
    "company": "shoe company",
    "projectId": 34,
    "project": "business development",
    "daysOpen": [
      "Mon",
      "Tue",
      "Wed",
      "Thu",
      "Fri"
    ],
    "products": [
      {
        "brand": "nike",
        "size": 41,
        "sub": "Air MAX"
      },
      {
        "brand": "reebok",
        "size": 43,
        "sub": "Pump"
      }
    ]
  }
}

You can use this technique to add for example the filesize of the file your are uploading:

.. code:: sh

echo “This is my text” > test.txt curl -F “file=@test.txt

-F “tags={"file":{"filesize":$(ls -l test.txt | awk ‘{print $5}’)}}” “http://127.0.0.1:8080/_document

Attention

Only standard FSCrawler fields can be set outside external field name.

Remove a document

Added in version 2.10.

To remove a document, you can call DELETE /_document endpoint.

If you only know the filename, you can pass it to FSCrawler using the filename field:

curl -X DELETE "http://127.0.0.1:8080/_document?filename=test.txt"

It will give you a response similar to:

{
  "ok": true,
  "filename": "test.txt",
  "index": "rest",
  "id": "dd18bf3a8ea2a3e53e2661c7fb53534"
}

If you know the document id, you can pass it to FSCrawler within the url:

curl -X DELETE "http://127.0.0.1:8080/_document/dd18bf3a8ea2a3e53e2661c7fb53534"

If the document does not exist, you will get the following response:

{
  "ok": false,
  "message": "Can not remove document [rest/test.txt]: Can not remove document rest/dd18bf3a8ea2a3e53e2661c7fb53534 cause: NOT_FOUND",
  "filename": "test.txt",
  "index": "rest",
  "id": "dd18bf3a8ea2a3e53e2661c7fb53534"
}

Specifying an elasticsearch index

By default, fscrawler creates document in the index defined in the _settings.yaml file. However, using the REST service, it is possible to require fscrawler to use different indexes, by setting the index parameter:

echo "This is my text" > test.txt
curl -F "file=@test.txt" "http://127.0.0.1:8080/_document?index=my-index"
curl -X DELETE "http://127.0.0.1:8080/_document?filename=test.txt&index=my-index"

When uploading, you can pass the id parameter within the form data:

echo "This is my text" > test.txt
curl -F "file=@test.txt" -F "index=my-index" "http://127.0.0.1:8080/_document"

Enabling CORS

To enable Cross-Origin Request Sharing you will need to set enable_cors: true under rest in your job settings. Doing so will enable the relevant access headers on all REST service resource responses (for example / and /_document).

You can check if CORS is enabled with:

curl -I http://127.0.0.1:8080/

The response header should contain Access-Control-Allow-* parameters like:

Access-Control-Allow-Origin: *
Access-Control-Allow-Headers: origin, content-type, accept, authorization
Access-Control-Allow-Credentials: true
Access-Control-Allow-Methods: GET, POST, PUT, PATCH, DELETE, OPTIONS, HEAD

REST settings

Here is a list of REST service settings (under rest. prefix):

Name

Environment Variable

Default value

Documentation

rest.url

FSCRAWLER_REST_URL

http://127.0.0.1:8080

Rest Service URL

rest.enable_cors

FSCRAWLER_REST_ENABLE_CORS

false

Enables or disables Cross-Origin Resource Sharing globally for all resources

Tip

Most Local FS settings (under fs.* in the settings file) also affect the REST service, e.g. fs.indexed_chars. Local FS settings that do not affect the REST service are those such as url, update_rate, includes, excludes.

REST service is running at http://127.0.0.1:8080/ by default.

You can change it using rest settings:

name: "test"
rest:
  url: "http://192.168.0.1:8180/my_fscrawler"

It also means that if you are running more than one instance of FScrawler locally, you can (must) change the port as it will conflict.

Crawler control

Added in version 2.10.

FSCrawler provides REST endpoints to control the crawler, allowing you to pause, resume, and monitor the crawling process. This is particularly useful for large file systems where crawling may take a long time and you want to be able to stop and resume later.

Getting crawler status

To get the current status of the crawler, you can call GET /_crawler/status:

curl http://127.0.0.1:8080/_crawler/status

It will give you a response similar to:

{
  "state" : "RUNNING",
  "scan_id" : "abc123-def456",
  "current_path" : "/data/documents/subfolder",
  "pending_directories" : 42,
  "completed_directories" : 158,
  "files_processed" : 1523,
  "files_deleted" : 12,
  "scan_start_time" : "2024-01-15T10:30:00",
  "scan_end_time" : null,
  "next_check" : null,
  "elapsed_time" : "15m 32s",
  "retry_count" : 0,
  "last_error" : null
}

When a scan is completed, the response will also include the scan_end_time and next_check fields:

{
  "state" : "COMPLETED",
  "files_processed" : 2500,
  "files_deleted" : 25,
  "scan_start_time" : "2024-01-15T10:30:00",
  "scan_end_time" : "2024-01-15T11:45:00",
  "next_check" : "2024-01-15T12:00:00",
  "elapsed_time" : "1h 15m"
}

The possible states are:

  • RUNNING: The crawler is actively processing files

  • PAUSED: The crawler is between runs or has been explicitly paused (see below)

  • STOPPED: The crawler is not running

  • COMPLETED: The crawler has finished its scan successfully

  • ERROR: The crawler encountered an error and stopped

Behavior between runs

After each crawl run, the crawler enters a pause and waits for the next run. The next run starts when either:

  • You call POST /_crawler/resume (run on demand), or

  • The configured update_rate time has elapsed (automatic run).

This behavior is the same whether you use the REST service or not. So you can trigger a run at any time with resume, or let the crawler run automatically at the scheduled interval.

If you explicitly call POST /_crawler/pause while the crawler is in that “between runs” wait, the crawler will not start the next run when the time elapses; it will only start when you call POST /_crawler/resume. This lets you truly pause and control when the next run happens.

Pausing the crawler

To pause the crawler, call POST /_crawler/pause:

curl -X POST http://127.0.0.1:8080/_crawler/pause

The crawler will save its current progress (checkpoint) and pause. While explicitly paused, it will not automatically start the next run when update_rate elapses; you must call POST /_crawler/resume to start the next run. You can also safely stop FSCrawler while paused; when you restart FSCrawler, it will resume from where it left off when you call resume.

Success response (200):

{
  "ok" : true,
  "message" : "Crawler paused. Checkpoint saved."
}

If the crawler is already paused, you get 200 with:

{
  "ok" : true,
  "message" : "Crawler is already paused."
}

Error response (400) when the crawler is not running:

{
  "ok" : false,
  "message" : "Crawler is not running"
}

Resuming the crawler

To resume a paused crawler, call POST /_crawler/resume:

curl -X POST http://127.0.0.1:8080/_crawler/resume

Success response (200) when resuming from pause:

{
  "ok" : true,
  "message" : "Crawler resumed."
}

If the crawler is not paused, you get 200 with no action taken:

{
  "ok" : true,
  "message" : "Crawler is not paused. No action needed."
}

Error response (400) when the crawler is closed:

{
  "ok" : false,
  "message" : "Crawler is closed. Cannot resume."
}

Clearing the checkpoint

If you want to force a fresh start and ignore any saved progress, you can clear the checkpoint file. The crawler must be paused or stopped first:

curl -X DELETE http://127.0.0.1:8080/_crawler/checkpoint

Success response (200):

{
  "ok" : true,
  "message" : "Checkpoint cleared"
}

Error response (400) when the crawler is running and not paused:

{
  "ok" : false,
  "message" : "Cannot clear checkpoint while crawler is running. Pause or stop it first."
}

Error response (404) when there is no active crawler (e.g. started with --loop 0):

{
  "ok" : false,
  "message" : "Failed to clear checkpoint as we don't have a checkpoint handler. This probably means there's no active crawler. Did you start with --loop 0?"
}

On I/O error, the server returns 500 with a message starting with Failed to clear checkpoint:.

Note

You can also clear the checkpoint using the --restart command line option when starting FSCrawler. See CLI options for more details.

Automatic resume after crash

If FSCrawler crashes or is forcefully terminated during a crawl, it will automatically resume from the last saved checkpoint when restarted. The checkpoint is saved periodically (every 100 files by default) and whenever the crawler state changes.

Network error recovery

FSCrawler automatically handles network errors with exponential backoff retry. If a network error occurs, it will:

  1. Save the current checkpoint

  2. Wait with exponential backoff (starting from 1 second, doubling each retry)

  3. Attempt to reconnect

  4. Resume from the failed directory

After 10 consecutive failures, the crawler will stop with an error state. You can then fix the network issue and restart FSCrawler to resume from the checkpoint.