Download FSCrawler¶
Depending on your Elasticsearch cluster version, you can download FSCrawler 2.9 using the following links:
- fscrawler-es7-2.9 for Elasticsearch V7.
- fscrawler-es6-2.9 for Elasticsearch V6.
Tip
This is a stable version. You can choose another version than 2.9 from Maven Central:
- fscrawler-es7-* for Elasticsearch V7.
- fscrawler-es6-* for Elasticsearch V6.
You can also download a SNAPSHOT version from Sonatype:
- fscrawler-es7-* for Elasticsearch V7.
- fscrawler-es6-* for Elasticsearch V6.
The distribution contains:
$ tree
.
├── LICENSE
├── NOTICE
├── README.md
├── bin
│ ├── fscrawler
│ └── fscrawler.bat
├── config
│ └── log4j2.xml
└── lib
├── ... All needed jars
Using docker¶
Pull the Docker image:
docker pull dadoonet/fscrawler
Note
This image is very big (1.2+gb) as it contains Tesseract and
all the trained language data.
If you don’t want to use OCR at all, you can use a smaller image (around 530mb) by pulling instead
dadoonet/fscrawler:noocr
docker pull dadoonet/fscrawler:noocr
Let say your documents are located in ~/tmp dir and you want to store your fscrawler jobs in ~/.fscrawler.
You can run FSCrawler with:
docker run -it --rm -v ~/.fscrawler:/root/.fscrawler -v ~/tmp:/tmp/es:ro dadoonet/fscrawler fscrawler job_name
On the first run, if the job does not exist yet in ~/.fscrawler, FSCrawler will ask you if you want to create it:
10:16:53,880 INFO [f.p.e.c.f.c.BootstrapChecks] Memory [Free/Total=Percent]: HEAP [67.3mb/876.5mb=7.69%], RAM [2.1gb/3.8gb=55.43%], Swap [1023.9mb/1023.9mb=100.0%].
10:16:53,899 WARN [f.p.e.c.f.c.FsCrawlerCli] job [job_name] does not exist
10:16:53,900 INFO [f.p.e.c.f.c.FsCrawlerCli] Do you want to create it (Y/N)?
y
10:16:56,745 INFO [f.p.e.c.f.c.FsCrawlerCli] Settings have been created in [/root/.fscrawler/job_name/_settings.yaml]. Please review and edit before relaunch
Note
The configuration file is actually stored on your machine in ~/.fscrawler/job_name/_settings.yaml.
Remember to change the URL of your elasticsearch instance as the container won’t be able to see it
running under the default 127.0.0.1. You will need to use the actual IP address of the host.
Using docker compose¶
In this section, the following directory layout is assumed:
.
├── config
│ └── job_name
│ └── _settings.yaml
├── data
│ └── <your files>
├── logs
│ └── <fscrawler logs>
└── docker-compose.yml
For example, to connect to a docker container named elasticsearch, modify your _settings.yaml.
name: "job_name"
elasticsearch:
nodes:
- url: "http://elasticsearch:9200"
And, prepare the following docker-compose.yml.
version: '3'
services:
# Elasticsearch Cluster
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:$ELASTIC_VERSION
container_name: elasticsearch
environment:
- bootstrap.memory_lock=true
- discovery.type=single-node
restart: always
ulimits:
memlock:
soft: -1
hard: -1
volumes:
- data:/usr/share/elasticsearch/data
ports:
- 9200:9200
networks:
- fscrawler_net
# FSCrawler
fscrawler:
image: dadoonet/fscrawler:$FSCRAWLER_VERSION
container_name: fscrawler
restart: always
volumes:
- ${PWD}/config:/root/.fscrawler
- ${PWD}/logs:/usr/share/fscrawler/logs
- ../../test-documents/src/main/resources/documents/:/tmp/es:ro
depends_on:
- elasticsearch
command: fscrawler --rest idx
networks:
- fscrawler_net
volumes:
data:
driver: local
networks:
fscrawler_net:
driver: bridge
Then, you can run Elasticsearch.
docker-compose up -d elasticsearch
docker-compose logs -f elasticsearch
Wait for elasticsearch to be started:
After starting Elasticsearch, you can run FSCrawler.
docker-compose up fscrawler
Running as a Service on Windows¶
Create a fscrawlerRunner.bat as:
set JAVA_HOME=c:\Program Files\Java\jdk15.0.1
set FS_JAVA_OPTS=-Xmx2g -Xms2g
/Elastic/fscrawler/bin/fscrawler.bat --config_dir /Elastic/fscrawler data >> /Elastic/logs/fscrawler.log 2>&1
Then use fscrawlerRunner.bat to create your windows service.