ElasticSearch 环境准备
略
中文分词实现
安装插件 https://github.com/medcl/elasticsearch-analysis-ik
测试分词:
ik_max_word会将文本做最细粒度的拆分;
ik_smart 会做最粗粒度的拆分。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
| http: { "analyzer": "ik_max_word", "text": "绝地求生是最好玩的游戏" } 和 { "analyzer": "ik_smart", "text": "绝地求生是最好玩的游戏" } 和 { "analyzer": "standard", "text": "绝地求生是最好玩的游戏" }
|
创建索引
http://192.168.10.74:9200/ik-index PUT
指定使用 ik_max_word 分词器
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
| { "settings" : { "analysis" : { "analyzer" : { "ik" : { "tokenizer" : "ik_max_word" } } } }, "mappings" : { "article" : { "dynamic" : true, "properties" : { "subject" : { "type" : "string", "analyzer" : "ik_max_word" }, "content" : { "type" : "string", "analyzer" : "ik_max_word" } } } } }
|
添加数据
略
查询:
http://192.168.10.74:9200/index/_search POST
1 2 3 4 5 6 7 8 9 10 11 12
| { "query": { "match": { "subject": "合肥送餐冲突" } }, "highlight": { "pre_tags": ["<span style = 'color:red'>"], "post_tags": ["</span>"], "fields": {"subject": {}} } }
|
热更新
IKAnalyzer.cfg.xml
http://localhost/hotload.dic
放入到 静态资源服务器下面
同义词配置
http://192.168.10.74:9200/synonyms-ik-index PUT
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
| { "settings": { "analysis": { "analyzer": { "by_smart": { "type": "custom", "tokenizer": "ik_smart", "filter": [ "by_tfr", "by_sfr" ], "char_filter": [ "by_cfr" ] }, "by_max_word": { "type": "custom", "tokenizer": "ik_max_word", "filter": [ "by_tfr", "by_sfr" ], "char_filter": [ "by_cfr" ] } }, "filter": { "by_tfr": { "type": "stop", "stopwords": [ " " ] }, "by_sfr": { "type": "synonym", "synonyms_path": "synonyms.dic" } }, "char_filter": { "by_cfr": { "type": "mapping", "mappings": [ "| => |" ] } } } }, "mappings": { "article": { "dynamic": true, "properties": { "subject": { "type": "string", "analyzer": "by_max_word", "search_analyzer": "by_smart" }, "content": { "type": "string", "analyzer": "by_max_word", "search_analyzer": "by_smart" } } } } }
|
测试同义词
http://192.168.10.74:9200/synonyms-ik-index/_analyze POST
1 2 3 4
| { "analyzer": "by_smart", "text": "绝地求生是最好玩的游戏" }
|
- 查询同义词
http://192.168.10.74:9200/synonyms-ik-index/_search POST
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
| { "query": { "match": { "subject": "吃鸡" } }, "highlight": { "pre_tags": [ "<span style = 'color:red'>" ], "post_tags": [ "</span>" ], "fields": { "subject": {} } } }
|
数据导入/导出 : elasticdump
github 地址: https://github.com/taskrabbit/elasticsearch-dump
文件搜索实现
文档地址: https://www.elastic.co/guide/en/elasticsearch/plugins/5.3/using-ingest-attachment.html
安装插件
./bin/elasticsearch-plugin install ingest-attachment
创建管道single_attachment
http://192.168.10.74:9200/_ingest/pipeline/single_attachment PUT
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
| { "description": "Extract single attachment information", "processors": [ { "attachment": { "field": "data", "indexed_chars": -1, "ignore_missing": true } }, { "remove": { "field": "data" } } ] }
|
field : 指定某个字段作为附件内容字段(需要用base64进行加密)
target_field:指定某个字段作为附件信息字段(作者、时间、类型)
indexed_chars : 指定解析文件管道流的最大大小,默认是100000。如果不想限制设置为-1(注意设置为-1的时候如果上传文件过大会而内存不够会导致文件上传不完全)
indexed_chars_field:指定某个字段能覆盖index_chars字段属性,这样子可以通过文件的大小去指定indexed_chars值。
properties: 选择需要存储附件的属性值可以为:content,title,name,author,keyword,date,content_type,content_length,language
ignore_missing: 默认为false,如果设置为true表示,如果上面指定的field字段不存在这不对附件进行解析,文档还能继续保留
新增了添加完附件数据后 删除 data 的 base64 的数据
多文件管道流
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
| { "description": "多文件管道流", "processors": [ { "foreach": { "field": "attachments", "processor": { "attachment": { "field": "data", "indexed_chars": -1, "ignore_missing": true } } } } ] }
|
- 删除通道
http://192.168.10.74:9200/_ingest/pipeline/single_attachment DELETE
- 创建索引
http://192.168.10.74:9200/file_attachment/ PUT
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
| { "settings": { "analysis": { "analyzer": { "ik": { "tokenizer": "ik_max_word" } } } }, "mappings": { "attachment": { "properties": { "filename": { "type": "text", "analyzer": "ik_max_word" }, "data": { "type": "text" }, "time": { "type": "string" }, "attachment.content": { "type": "text", "analyzer": "ik_max_word" } } } } }
|
- 添加数据
http://192.168.10.74:9200/file_attachment/attachment/1?pipeline=single_attachment&refresh=true&pretty=1/ POST
1 2 3 4 5
| { "filename": "测试文档.txt", "time": "2018-06-13 15:14:00", "data": "6L+Z5piv56ys5LiA5Liq55So5LqO5rWL6K+V5paH5pys6ZmE5Lu255qE5YaF5a6577yb5paH5Lu25qC85byP5Li6dHh0LOaWh+acrOS4uuS4reaWhw==" }
|
- 文档查询
http://192.168.10.74:9200/file_attachment/_search POST
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
| { "query": { "match": { "attachment.content": "测试" } }, "highlight": { "pre_tags": [ "<span style = 'color:red'>" ], "post_tags": [ "</span>" ], "fields": { "attachment.content": {} } } }
|
注意: 使用 nginx 的静态资源目录作为 文件的存放,那么在下载文件时,想要 txt ,html ,pdf 等文件直接被下载而不被浏览器打开时,可在 nginx 的配置文件加入以下配置
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
| server { listen 80; server_name localhost;
location / { root html; if ($request_filename ~* ^.*?.(txt|doc|pdf|rar|gz|zip|docx|exe|xlsx|ppt|pptx|jpg|png|html|xml)$){ add_header Content-Disposition attachment; add_header Content-Type 'APPLICATION/OCTET-STREAM'; } index index.html index.htm; }
error_page 500 502 503 504 /50x.html; location = /50x.html { root html; }
}
|
重点是 :
if ($request_filename ~* ^.?.(txt|doc|pdf|rar|gz|zip|docx|exe|xlsx|ppt|pptx|jpg|png|html|xml)$){
add_header Content-Disposition attachment;
add_header Content-Type ‘APPLICATION/OCTET-STREAM’;
}
或者也可以这样处理:
if ($args ~ “target=download”) {
add_header Content-Disposition ‘attachment’;
add_header Content-Type ‘APPLICATION/OCTET-STREAM’;
}
这样的话只要在 get请求加上 target=download 参数就可以下载了。
Office 套件研究
OpenOffice 服务搭建
安装步骤
下载 rpm 包 : 官网: https://www.openoffice.org/download/
解压,进入 /zh-CN/RPMS/ , 安装 rpm 包: rpm -ivh *.rpm
安装完成后,生成 desktop-integration 目录,进入,因为我的系统是 centos 的 ,我选择安装 rpm -ivh openoffice4.1.5-redhat-menus-4.1.5-9789.noarch.rpm
安装完成后,目录在 /opt/openoffice4 下
启动: /opt/openoffice4/program/soffice -headless -accept="socket,host=0.0.0.0,port=8100;urp;" -nofirststartwizard &
遇到的问题
libXext.so.6: cannot open shared object file: No such file or directory
解决 : yum install libXext.x86_64
no suitable windowing system found, exiting.
解决: yum groupinstall "X Window System"
之后再启动,查看监听端口 netstat -lnp |grep 8100
已经可以了。
存在的问题
对很多中文字体的支持并不是很好,很多中文字符及特殊字符无法显示
LibreOffice 服务搭建
安装步骤
下载 Linux系统下的 rpm 安装包
将安装包解压缩到目录下
安装
$ sudo yum install ./RPMS/.rpm / 安装主安装程序的所有rpm包 /
$ sudo yum install ./RPMS/.rpm /* 安装中文语言包中的所有rpm包 /
$ sudo yum install ./RPMS/.rpm /* 安装中文离线帮助文件中的所有rpm包 */
卸载
$ sudo apt-get remove –purge libreoffice6.x-* /* 移除所有类似libreoffice6.x-*的包。–purge表示卸载的同时移除所有相关的配置文件 */
使用总结
LibreOffice 的安装表示没有像 OpenOffice 那样遇到很多问题,且对中文字符的支持较为友好,官网也提供了相应的中文字体下载。
Spring Boot 连接并调用 Office 服务
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
| public Object preview(@PathVariable String fileName){ try { Resource resource = new UrlResource(remoteAddr + fileName); if (FilenameUtils.getExtension(resource.getFilename()).equalsIgnoreCase("pdf")) { return "Is the PDF file"; } try (ByteArrayOutputStream baos = new ByteArrayOutputStream()) {
final DocumentFormat targetFormat = DefaultDocumentFormatRegistry.getFormatByExtension("pdf"); converter .convert(resource.getInputStream()) .as( DefaultDocumentFormatRegistry.getFormatByExtension( FilenameUtils.getExtension(resource.getFilename()))) .to(baos) .as(targetFormat) .execute();
final HttpHeaders headers = new HttpHeaders(); headers.setContentType(MediaType.parseMediaType(targetFormat.getMediaType())); return new ResponseEntity<>(baos.toByteArray(), headers, HttpStatus.OK);
} catch (OfficeException | IOException e) { e.printStackTrace(); return "convert error: " + e.getMessage(); } } catch (IOException e) { e.printStackTrace(); return "File does not exist;"; } }
|
Collabora Office 服务搭建
官方地址: https://www.collaboraoffice.com/solutions/collabora-office/
Collabora CODE 服务搭建
官方建议采用docker来安装
Docker
1 2 3
| $ docker pull collabora/code $ docker run -t -d -p 127.0.0.1:9980:9980 -e "domain=<your-dot-escaped-domain>" \ -e "username=admin" -e "password=S3cRet" --restart always --cap-add MKNOD collabora/code
|
Linux packages
1 2 3 4 5 6
| # import the signing key wget https://www.collaboraoffice.com/repos/CollaboraOnline/CODE-centos7/repodata/repomd.xml.key && rpm --import repomd.xml.key # add the repository URL to yum yum-config-manager --add-repo https://www.collaboraoffice.com/repos/CollaboraOnline/CODE-centos7 # perform the installation yum install loolwsd CODE-brand
|
Office 套件文档在线协作
需要域名和SSL证书,尚未实际研究