Version 2.10
Breaking changes
If you want to exclude a specific folder, you need to use a wildcard character at the end of the folder name. For example, to exclude the folder
/tmp/foo
, you need to use/tmp/foo/*
. Thanks to dadoonet.The way we run docker images has changed. We don’t need anymore to specify the fscrawler binary. So running
docker run -it -v ~/.fscrawler:/root/.fscrawler -v /documents:/tmp/es:ro dadoonet/fscrawler job_name
is
enough. Thanks to dadoonet.
New
Add support for automatic semantic search when using a 8.17+ version with a trial or enterprise license. See Semantic search. Warning: this might slow down the ingestion process. Thanks to dadoonet.
Add support for Elastic cloud serverless. Thanks to dadoonet.
Using the REST API
_document
, you can now fetch a document from the local dir, from an http website or from an S3 bucket. See REST service. Thanks to dadoonet.You can now remove a document in Elasticsearch using FSCrawler
_document
endpoint. See REST service. Thanks to dadoonet.Implement our own HTTP Client for Elasticsearch. Thanks to dadoonet.
Add option to set path to custom tika config file. See Local FS settings. Thanks to iadcode.
Support for Index Templates. See Mappings. Thanks to dadoonet.
Support for Aliases. You can now index to an alias. Thanks to dadoonet.
Support for Access Token and Api Keys instead of Basic Authentication. See Using Credentials (Security). Thanks to dadoonet.
Allow loading external jars. This adds a new
external
directory from where jars can be loaded to the FSCrawler JVM. For example, you could provide your own Custom Tika Parser code. See Directory layout. Thanks to dadoonet.Add temporal information in folder index. Thanks to bdauvissat
Add support for external metadata files while crawling, defaults to
.meta.yml
. See External Tags Thanks to dadoonet.
Fix
fs.ocr.enabled
was always false. Thanks to ywjung.Do not hide YAML parsing errors. Thanks to dadoonet.
Fix duration parsing for the day unit
d
. Thanks to dadoonet.
Deprecated
The
_upload
REST endpoint has been deprecated. Please now use the_document
endpoint. Thanks to dadoonet.Support for Elasticsearch 6.x is deprecated. Thanks to dadoonet.
Support for Basic Authentication is deprecated. You should use API keys instead. Thanks to dadoonet.
Updated
Files are now sorted by date with a reverse order. So the most recent files should be indexed first. Thanks to dadoonet.
Add full support for Elasticsearch Elasticsearch 8.17.3, Elasticsearch 7.17.23, Elasticsearch 6.8.23. Thanks to dadoonet.
Update to Tika Tika 2.9.3. Thanks to dadoonet.
Removed
Remove the specific distributions depending on Elastic version. Thanks to dadoonet.
Thanks to @dadoonet
, @ywjung
, @iadcode
, @bdauvissat
for this release!