Using docker

Pull the Docker image from Docker Hub:

docker pull dadoonet/fscrawler

Note

This image is very big (500+mb) as it contains Tesseract and all the trained language data. If you don’t want to use OCR at all, you can use a smaller image (around 230mb) by pulling instead dadoonet/fscrawler:noocr:

docker pull dadoonet/fscrawler:noocr

Let say your documents are located in ~/tmp dir, and you want to store your FSCrawler jobs in ~/.fscrawler. You can run FSCrawler with:

docker run -it --rm \
     -v ~/.fscrawler:/root/.fscrawler \
     -v ~/tmp:/tmp/es:ro \
     dadoonet/fscrawler

Note

The configuration file is expected to be stored on your machine in ~/.fscrawler/fscrawler/_settings.yaml. Remember to change the URL of your elasticsearch instance as the container won’t be able to see it running under the default 127.0.0.1. You will need to use the actual IP address of the host.

Or use the FSCRAWLER_ELASTICSEARCH_URLS environment variable to set the elasticsearch URL. See CLI options for more information.

If you need to add a 3rd party library (jar) or your Tika custom jar, you can put it in a external directory and mount it as well:

docker run -it --rm \
     -v ~/.fscrawler:/root/.fscrawler \
     -v ~/tmp:/tmp/es:ro \
     -v "$PWD/external:/usr/share/fscrawler/external" \
     dadoonet/fscrawler

If you want to use the REST service, don’t forget to also expose the port:

docker run -it --rm \
     -v ~/.fscrawler:/root/.fscrawler \
     -v ~/tmp:/tmp/es:ro \
     -p 8080:8080 \
     dadoonet/fscrawler

If you want to change the log level for FSCrawler, you can run:

docker run -it --rm \
     -v ~/.fscrawler:/root/.fscrawler \
     -v ~/tmp:/tmp/es:ro \
     -v ~/logs:/root/logs \
     -e FS_JAVA_OPTS="-DLOG_LEVEL=debug -DDOC_LEVEL=debug" \
     dadoonet/fscrawler

And you can read the logs from the ~/logs directory:

tail -f ~/logs/documents.log

You can pass all the CLI options to the docker container as well:

docker run -it --rm \
     -v ~/.fscrawler:/root/.fscrawler \
     -v ~/tmp:/tmp/es:ro \
     dadoonet/fscrawler job_name --restart --loop 1

See CLI options for more information.

Using docker compose

In this section, the following directory layout is assumed:

.
├── .env
├── docs
│   └── <your PDF, DOC, ... files>
└── docker-compose.yml

The .env file looks like this:

# Password for the 'elastic' user (at least 6 characters)
ES_LOCAL_PASSWORD=changeme

# Version of Elastic products
ES_LOCAL_VERSION=9.4.3

# Set the ES container name
ES_LOCAL_CONTAINER_NAME=es-fscrawler

# Set to 'basic' or 'trial' to automatically start the 30-day trial
ES_LOCAL_LICENSE=basic
#ES_LOCAL_LICENSE=trial

# Port to expose Elasticsearch HTTP API to the host
ES_LOCAL_PORT=9200
ES_LOCAL_DISK_SPACE_REQUIRED=1gb
ES_LOCAL_JAVA_OPTS="-XX:UseSVE=0 -Xms128m -Xmx2g"

# Project namespace (defaults to the current folder name if not set)
COMPOSE_PROJECT_NAME=fscrawler

# FSCrawler Settings
FSCRAWLER_VERSION=3.0-SNAPSHOT
FSCRAWLER_PORT=8080

# Optionally, you can change the log level settings
FS_JAVA_OPTS="-DLOG_LEVEL=debug -DDOC_LEVEL=debug"

And, the docker-compose.yml file looks like this:

---
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:${ES_LOCAL_VERSION}
    container_name: ${ES_LOCAL_CONTAINER_NAME}
    volumes:
      - dev-elasticsearch:/usr/share/elasticsearch/data
    ports:
      - 127.0.0.1:${ES_LOCAL_PORT}:9200
    environment:
      - discovery.type=single-node
      - ELASTIC_PASSWORD=${ES_LOCAL_PASSWORD}
      - xpack.security.enabled=true
      - xpack.security.http.ssl.enabled=false
      - xpack.license.self_generated.type=${ES_LOCAL_LICENSE}
      - xpack.ml.use_auto_machine_memory_percent=true
      - ES_JAVA_OPTS=${ES_LOCAL_JAVA_OPTS}
      - cluster.routing.allocation.disk.watermark.low=${ES_LOCAL_DISK_SPACE_REQUIRED}
      - cluster.routing.allocation.disk.watermark.high=${ES_LOCAL_DISK_SPACE_REQUIRED}
      - cluster.routing.allocation.disk.watermark.flood_stage=${ES_LOCAL_DISK_SPACE_REQUIRED}
    ulimits:
      memlock:
        soft: -1
        hard: -1
    healthcheck:
      test:
        [
          "CMD-SHELL",
          "curl --output /dev/null --silent --head --fail -u elastic:${ES_LOCAL_PASSWORD} http://elasticsearch:9200",
        ]
      interval: 10s
      timeout: 10s
      retries: 30

  # FSCrawler
  fscrawler:
    image: dadoonet/fscrawler:${FSCRAWLER_VERSION}
    container_name: fscrawler
    restart: always
    environment:
      - FS_JAVA_OPTS=${FS_JAVA_OPTS}
      - FSCRAWLER_ELASTICSEARCH_URLS=http://${ES_LOCAL_CONTAINER_NAME}:9200
      - FSCRAWLER_ELASTICSEARCH_USERNAME=elastic
      - FSCRAWLER_ELASTICSEARCH_PASSWORD=${ES_LOCAL_PASSWORD}
      - FSCRAWLER_REST_URL=http://fscrawler:${FSCRAWLER_PORT}
    volumes:
      - ${PWD}/docs:/tmp/es:ro
    depends_on:
      elasticsearch:
        condition: service_healthy
    ports:
      - ${FSCRAWLER_PORT}:8080
    command: --rest

volumes:
  dev-elasticsearch:

Copy your pdf/doc files into the docs directory and run the full stack, including FSCrawler with:

 docker-compose up

When the job has finished indexing, you can check your documents in Elasticsearch with:

curl -u elastic:changeme "http://localhost:9200/fscrawler/_search"

Note

You will find this example in the contrib/docker-compose-example-elasticsearch project directory.

Running as a Service on Windows

Create a fscrawlerRunner.bat as:

set JAVA_HOME=c:\Program Files\Java\jdk17.0.18
set FS_JAVA_OPTS=-Xmx2g -Xms2g
/Elastic/fscrawler/bin/fscrawler.bat --config_dir /Elastic/fscrawler data >> /Elastic/logs/fscrawler.log 2>&1

Then use fscrawlerRunner.bat to create your Windows service.

Local installation

If you prefer to run FSCrawler from a ZIP distribution on your machine instead of Docker:

You can download FSCrawler 3.0 from Sonatype snapshots.

The filename ends with .zip.

Warning

This is a SNAPSHOT version. You can also download a stable version from Maven Central.

Note

There’s an issue with the download links for SNAPSHOT versions.

Hint

Due to a bug with the underlying service we rely on to provide SNAPSHOT hosting, we’ve had to temporarily remove browse access for SNAPSHOT releases. You should still be able to publish and consume SNAPSHOT releases as usual, but you cannot browse them via the UI.

So you must now download the maven-metadata.xml file. Check the <snapshotVersion> tag to find the latest SNAPSHOT version of the zip file.

<snapshotVersion>
  <extension>zip</extension>
  <value>3.0-YYYYMMDD.HHMMSS-NN</value>
  <updated>20250801161301</updated>
</snapshotVersion>

Note the value tag which contains the version you need to download. And use that value in the following URL:

https://central.sonatype.com/repository/maven-snapshots/fr/pilato/elasticsearch/crawler/fscrawler-distribution/3.0-SNAPSHOT/fscrawler-distribution-3.0-YYYYMMDD.HHMMSS-NN.zip

After extracting the ZIP, you get a directory with bin/ (run scripts), config/ (logging), lib/ (core and dependencies), external/ (optional JARs), and logs/. See Directory layout for the full directory layout.

Optional libraries (external)

You may need to add JARs to the external directory for some formats. For example, to support JPEG2000 (JPX/JP2) images in PDFs, add the jai-imageio-jpeg2000 library: download it from Maven Central and put jai-imageio-jpeg2000-1.4.0.jar in the external directory.