REST service¶
New in version 2.2.
FSCrawler can expose a REST service running at http://127.0.0.1:8080/fscrawler.
To activate it, launch FSCrawler with --rest
option.
Contents
FSCrawler status¶
To get an overview of the running service, you can call GET /
endpoint:
curl http://127.0.0.1:8080/fscrawler/
It will give you a response similar to:
{
"ok" : true,
"version" : "2.2",
"elasticsearch" : "5.1.1",
"settings" : {
"name" : "fscrawler-rest-tests",
"fs" : {
"url" : "/tmp/es",
"update_rate" : "15m",
"json_support" : false,
"filename_as_id" : false,
"add_filesize" : true,
"remove_deleted" : true,
"store_source" : false,
"index_content" : true,
"attributes_support" : false,
"raw_metadata" : true,
"xml_support" : false,
"index_folders" : true,
"lang_detect" : false
},
"elasticsearch" : {
"nodes" : [ {
"url" : "http://127.0.0.1:9200"
} ],
"index" : "fscrawler-rest-tests_doc",
"index_folder" : "fscrawler-rest-tests_folder",
"bulk_size" : 100,
"flush_interval" : "5s",
"byte_size" : "10mb",
"username" : "elastic"
},
"rest" : {
"url" : "http://127.0.0.1:8080/fscrawler",
"enable_cors": false
}
}
}
Uploading a binary document¶
To upload a binary, you can call POST /_upload
endpoint:
echo "This is my text" > test.txt
curl -F "file=@test.txt" "http://127.0.0.1:8080/fscrawler/_upload"
It will give you a response similar to:
{
"ok" : true,
"filename" : "test.txt",
"url" : "http://127.0.0.1:9200/fscrawler-rest-tests_doc/doc/dd18bf3a8ea2a3e53e2661c7fb53534"
}
The url
represents the elasticsearch address of the indexed
document. If you call:
curl http://127.0.0.1:9200/fscrawler-rest-tests_doc/doc/dd18bf3a8ea2a3e53e2661c7fb53534?pretty
You will get back your document as it has been stored by elasticsearch:
{
"_index" : "fscrawler-rest-tests_doc",
"_type" : "_doc",
"_id" : "dd18bf3a8ea2a3e53e2661c7fb53534",
"_version" : 1,
"found" : true,
"_source" : {
"content" : "This file contains some words.\n",
"meta" : {
"raw" : {
"X-Parsed-By" : "org.apache.tika.parser.DefaultParser",
"Content-Encoding" : "ISO-8859-1",
"Content-Type" : "text/plain; charset=ISO-8859-1"
}
},
"file" : {
"extension" : "txt",
"content_type" : "text/plain; charset=ISO-8859-1",
"indexing_date" : "2017-01-04T21:01:08.043",
"filename" : "test.txt"
},
"path" : {
"virtual" : "test.txt",
"real" : "test.txt"
}
}
}
If you started FSCrawler in debug mode with --debug
or if you pass
debug=true
query parameter, then the response will be much more
complete:
echo "This is my text" > test.txt
curl -F "file=@test.txt" "http://127.0.0.1:8080/fscrawler/_upload?debug=true"
will give
{
"ok" : true,
"filename" : "test.txt",
"url" : "http://127.0.0.1:9200/fscrawler-rest-tests_doc/doc/dd18bf3a8ea2a3e53e2661c7fb53534",
"doc" : {
"content" : "This file contains some words.\n",
"meta" : {
"raw" : {
"X-Parsed-By" : "org.apache.tika.parser.DefaultParser",
"Content-Encoding" : "ISO-8859-1",
"Content-Type" : "text/plain; charset=ISO-8859-1"
}
},
"file" : {
"extension" : "txt",
"content_type" : "text/plain; charset=ISO-8859-1",
"indexing_date" : "2017-01-04T14:05:10.325",
"filename" : "test.txt"
},
"path" : {
"virtual" : "test.txt",
"real" : "test.txt"
}
}
}
Simulate Upload¶
If you want to get back the extracted content and its metadata but
without indexing into elasticsearch you can use simulate=true
query
parameter:
echo "This is my text" > test.txt
curl -F "file=@test.txt" "http://127.0.0.1:8080/fscrawler/_upload?debug=true&simulate=true"
Document ID¶
By default, FSCrawler encodes the filename to generate an id. Which
means that if you send 2 files with the same filename test.txt
, the
second one will overwrite the first one because they will both share the
same ID.
You can force any id you wish by adding id=YOUR_ID
in the form data:
echo "This is my text" > test.txt
curl -F "file=@test.txt" -F "id=my-test" "http://127.0.0.1:8080/fscrawler/_upload"
There is a specific id named _auto_
where the ID will be
autogenerated by elasticsearch. It means that sending twice the same
file will result in 2 different documents indexed.
Additional tags¶
Add custom tags to the document. In case you want to do filtering on those
tags (examples are projectId
or tenantId
).
These tags can be assigned to an external
object field. As you can see
in the json, you are able to overwrite the content
field.
meta
, file
and path
fields can be overwritten as well.
To upload a binary with additional tags, you can call POST /_upload
endpoint:
{
"content": "OVERWRITE CONTENT",
"external": {
"tenantId": 23,
"projectId": 34,
"description": "these are additional tags"
}
}
echo "This is my text" > test.txt
echo "{\"content\":\"OVERWRITE CONTENT\",\"external\":{\"tenantId\": 23,\"projectId\": 34,\"description\":\"these are additional tags\"}}" > tags.txt
curl -F "file=@test.txt" -F "tags=@tags.txt" "http://127.0.0.1:8080/fscrawler/_upload"
The field external
doesn’t necessarily be a flat structure. This is a more advanced example:
{
"external": {
"tenantId" : 23,
"company": "shoe company",
"projectId": 34,
"project": "business development",
"daysOpen": [
"Mon",
"Tue",
"Wed",
"Thu",
"Fri"
],
"products": [
{
"brand": "nike",
"size": 41,
"sub": "Air MAX"
},
{
"brand": "reebok",
"size": 43,
"sub": "Pump"
}
]
}
}
Attention
Only standard FSCrawler fields can be set outside external
field name.
Specifying an elasticsearch index¶
By default, fscrawler creates document in the index defined in the _settings.yaml
file.
However, using the REST service, it is possible to require fscrawler to use different indexes, by adding index=YOUR_INDEX
in the form data:
echo "This is my text" > test.txt
curl -F "file=@test.txt" -F "index=my-index" "http://127.0.0.1:8080/fscrawler/_upload"
Enabling CORS¶
To enable Cross-Origin Request Sharing you will need to set enable_cors: true
under rest
in your job settings. Doing so will enable the relevant access headers
on all REST service resource responses (for example /fscrawler
and /fscrawler/_upload
).
You can check if CORS is enabled with:
curl -I http://127.0.0.1:8080/fscrawler/
The response header should contain Access-Control-Allow-*
parameters like:
Access-Control-Allow-Origin: *
Access-Control-Allow-Headers: origin, content-type, accept, authorization
Access-Control-Allow-Credentials: true
Access-Control-Allow-Methods: GET, POST, PUT, PATCH, DELETE, OPTIONS, HEAD
REST settings¶
Here is a list of REST service settings (under rest.
prefix)`:
Name | Default value | Documentation |
---|---|---|
rest.url |
http://127.0.0.1:8080/fscrawler |
Rest Service URL |
rest.enable_cors |
false |
Enables or disables Cross-Origin Resource Sharing globally for all resources |
Tip
Most Local FS settings (under fs.*
in the
settings file) also affect the REST service, e.g. fs.indexed_chars
.
Local FS settings that do not affect the REST service are those such
as url
, update_rate
, includes
, excludes
.
REST service is running at http://127.0.0.1:8080/fscrawler by default.
You can change it using rest
settings:
name: "test"
rest:
url: "http://192.168.0.1:8180/my_fscrawler"
It also means that if you are running more than one instance of FS crawler locally, you can (must) change the port as it will conflict.