office 套件的一系列研究记录

ElasticSearch 环境准备

中文分词实现

  1. 安装插件 https://github.com/medcl/elasticsearch-analysis-ik

  2. 测试分词:

ik_max_word会将文本做最细粒度的拆分;
ik_smart 会做最粗粒度的拆分。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
http://192.168.10.74:9200/_analyze/ POST
{
"analyzer": "ik_max_word",
"text": "绝地求生是最好玩的游戏"
}


{
"analyzer": "ik_smart",
"text": "绝地求生是最好玩的游戏"
}


{
"analyzer": "standard",
"text": "绝地求生是最好玩的游戏"
}
  1. 创建索引

    http://192.168.10.74:9200/ik-index PUT
    指定使用 ik_max_word 分词器

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
{
"settings" : {
"analysis" : {
"analyzer" : {
"ik" : {
"tokenizer" : "ik_max_word"
}
}
}
},
"mappings" : {
"article" : {
"dynamic" : true,
"properties" : {
"subject" : {
"type" : "string",
"analyzer" : "ik_max_word"
},
"content" : {
"type" : "string",
"analyzer" : "ik_max_word"
}
}
}
}
}

  1. 添加数据

  2. 查询:
    http://192.168.10.74:9200/index/_search POST

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    {
    "query": {
    "match": {
    "subject": "合肥送餐冲突"
    }
    },
    "highlight": {
    "pre_tags": ["<span style = 'color:red'>"],
    "post_tags": ["</span>"],
    "fields": {"subject": {}}
    }
    }
  3. 热更新
    IKAnalyzer.cfg.xml

    http://localhost/hotload.dic

    放入到 静态资源服务器下面

  4. 同义词配置
    http://192.168.10.74:9200/synonyms-ik-index PUT

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
{
"settings": {
"analysis": {
"analyzer": {
"by_smart": {
"type": "custom",
"tokenizer": "ik_smart",
"filter": [
"by_tfr",
"by_sfr"
],
"char_filter": [
"by_cfr"
]
},
"by_max_word": {
"type": "custom",
"tokenizer": "ik_max_word",
"filter": [
"by_tfr",
"by_sfr"
],
"char_filter": [
"by_cfr"
]
}
},
"filter": {
"by_tfr": {
"type": "stop",
"stopwords": [
" "
]
},
"by_sfr": {
"type": "synonym",
"synonyms_path": "synonyms.dic"
}
},
"char_filter": {
"by_cfr": {
"type": "mapping",
"mappings": [
"| => |"
]
}
}
}
},
"mappings": {
"article": {
"dynamic": true,
"properties": {
"subject": {
"type": "string",
"analyzer": "by_max_word",
"search_analyzer": "by_smart"
},
"content": {
"type": "string",
"analyzer": "by_max_word",
"search_analyzer": "by_smart"
}
}
}
}
}
  1. 测试同义词

    http://192.168.10.74:9200/synonyms-ik-index/_analyze POST

1
2
3
4
{
"analyzer": "by_smart",
"text": "绝地求生是最好玩的游戏"
}
  1. 查询同义词
    http://192.168.10.74:9200/synonyms-ik-index/_search POST
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
{
"query": {
"match": {
"subject": "吃鸡"
}
},
"highlight": {
"pre_tags": [
"<span style = 'color:red'>"
],
"post_tags": [
"</span>"
],
"fields": {
"subject": {}
}
}
}

数据导入/导出 : elasticdump

github 地址: https://github.com/taskrabbit/elasticsearch-dump

文件搜索实现

  1. 文档地址: https://www.elastic.co/guide/en/elasticsearch/plugins/5.3/using-ingest-attachment.html

  2. 安装插件
    ./bin/elasticsearch-plugin install ingest-attachment

  3. 创建管道single_attachment
    http://192.168.10.74:9200/_ingest/pipeline/single_attachment PUT

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
{
"description": "Extract single attachment information",
"processors": [
{
"attachment": {
"field": "data",
"indexed_chars": -1,
"ignore_missing": true
}
},
{
"remove": {
"field": "data"
}
}
]
}

field  :  指定某个字段作为附件内容字段(需要用base64进行加密)

target_field:指定某个字段作为附件信息字段(作者、时间、类型)

indexed_chars : 指定解析文件管道流的最大大小,默认是100000。如果不想限制设置为-1(注意设置为-1的时候如果上传文件过大会而内存不够会导致文件上传不完全)

indexed_chars_field:指定某个字段能覆盖index_chars字段属性,这样子可以通过文件的大小去指定indexed_chars值。

properties:  选择需要存储附件的属性值可以为:content,title,name,author,keyword,date,content_type,content_length,language

ignore_missing: 默认为false,如果设置为true表示,如果上面指定的field字段不存在这不对附件进行解析,文档还能继续保留

新增了添加完附件数据后 删除 data 的 base64 的数据

多文件管道流

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
{
  "description": "多文件管道流",
  "processors": [
     {
      "foreach": {
        "field": "attachments",
        "processor": {
          "attachment": {
            "field": "data",
"indexed_chars": -1,
"ignore_missing": true
          }
        }
      }
    }
  ]
}
  1. 删除通道

http://192.168.10.74:9200/_ingest/pipeline/single_attachment DELETE

  1. 创建索引
    http://192.168.10.74:9200/file_attachment/ PUT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
{
"settings": {
"analysis": {
"analyzer": {
"ik": {
"tokenizer": "ik_max_word"
}
}
}
},
"mappings": {
"attachment": {
"properties": {
"filename": {
"type": "text",
"analyzer": "ik_max_word"
},
"data": {
"type": "text"
},
"time": {
"type": "string"
},
"attachment.content": {
"type": "text",
"analyzer": "ik_max_word"
}
}
}
}
}
  1. 添加数据
    http://192.168.10.74:9200/file_attachment/attachment/1?pipeline=single_attachment&refresh=true&pretty=1/ POST
1
2
3
4
5
{
"filename": "测试文档.txt",
"time": "2018-06-13 15:14:00",
"data": "6L+Z5piv56ys5LiA5Liq55So5LqO5rWL6K+V5paH5pys6ZmE5Lu255qE5YaF5a6577yb5paH5Lu25qC85byP5Li6dHh0LOaWh+acrOS4uuS4reaWhw=="
}
  1. 文档查询
    http://192.168.10.74:9200/file_attachment/_search POST
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
{
"query": {
"match": {
"attachment.content": "测试"
}
},
"highlight": {
"pre_tags": [
"<span style = 'color:red'>"
],
"post_tags": [
"</span>"
],
"fields": {
"attachment.content": {}
}
}
}

注意: 使用 nginx 的静态资源目录作为 文件的存放,那么在下载文件时,想要 txt ,html ,pdf 等文件直接被下载而不被浏览器打开时,可在 nginx 的配置文件加入以下配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
server {
listen 80;
server_name localhost;

#charset koi8-r;

#access_log logs/host.access.log main;

location / {
root html;
if ($request_filename ~* ^.*?.(txt|doc|pdf|rar|gz|zip|docx|exe|xlsx|ppt|pptx|jpg|png|html|xml)$){
add_header Content-Disposition attachment;
add_header Content-Type 'APPLICATION/OCTET-STREAM';
}
index index.html index.htm;
}

#error_page 404 /404.html;

# redirect server error pages to the static page /50x.html
#
error_page 500 502 503 504 /50x.html;
location = /50x.html {
root html;
}

# proxy the PHP scripts to Apache listening on 127.0.0.1:80
#
#location ~ \.php$ {
# proxy_pass http://127.0.0.1;
#}

# pass the PHP scripts to FastCGI server listening on 127.0.0.1:9000
#
#location ~ \.php$ {
# root html;
# fastcgi_pass 127.0.0.1:9000;
# fastcgi_index index.php;
# fastcgi_param SCRIPT_FILENAME /scripts$fastcgi_script_name;
# include fastcgi_params;
#}

# deny access to .htaccess files, if Apache's document root
# concurs with nginx's one
#
#location ~ /\.ht {
# deny all;
#}
}

重点是 :
if ($request_filename ~* ^.?.(txt|doc|pdf|rar|gz|zip|docx|exe|xlsx|ppt|pptx|jpg|png|html|xml)$){
add_header Content-Disposition attachment;
add_header Content-Type ‘APPLICATION/OCTET-STREAM’;
}
或者也可以这样处理:
if ($args ~
“target=download”) {
add_header Content-Disposition ‘attachment’;
add_header Content-Type ‘APPLICATION/OCTET-STREAM’;
}

这样的话只要在 get请求加上 target=download 参数就可以下载了。

Office 套件研究

OpenOffice 服务搭建

安装步骤

  1. 下载 rpm 包 : 官网: https://www.openoffice.org/download/

  2. 解压,进入 /zh-CN/RPMS/ , 安装 rpm 包: rpm -ivh *.rpm

  3. 安装完成后,生成 desktop-integration 目录,进入,因为我的系统是 centos 的 ,我选择安装 rpm -ivh openoffice4.1.5-redhat-menus-4.1.5-9789.noarch.rpm

  4. 安装完成后,目录在 /opt/openoffice4 下
    启动: /opt/openoffice4/program/soffice -headless -accept="socket,host=0.0.0.0,port=8100;urp;" -nofirststartwizard &

遇到的问题

  1. libXext.so.6: cannot open shared object file: No such file or directory
    解决 : yum install libXext.x86_64

  2. no suitable windowing system found, exiting.
    解决: yum groupinstall "X Window System"

之后再启动,查看监听端口 netstat -lnp |grep 8100
已经可以了。

存在的问题

对很多中文字体的支持并不是很好,很多中文字符及特殊字符无法显示

LibreOffice 服务搭建

安装步骤

  1. 下载 Linux系统下的 rpm 安装包

  2. 将安装包解压缩到目录下

  3. 安装
    $ sudo yum install ./RPMS/.rpm / 安装主安装程序的所有rpm包 /
    $ sudo yum install ./RPMS/
    .rpm /* 安装中文语言包中的所有rpm包 /
    $ sudo yum install ./RPMS/
    .rpm /* 安装中文离线帮助文件中的所有rpm包 */

  4. 卸载
    $ sudo apt-get remove –purge libreoffice6.x-* /* 移除所有类似libreoffice6.x-*的包。–purge表示卸载的同时移除所有相关的配置文件 */

使用总结

LibreOffice 的安装表示没有像 OpenOffice 那样遇到很多问题,且对中文字符的支持较为友好,官网也提供了相应的中文字体下载。

Spring Boot 连接并调用 Office 服务

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
public Object preview(@PathVariable String fileName){
try {
Resource resource = new UrlResource(remoteAddr + fileName);
if (FilenameUtils.getExtension(resource.getFilename()).equalsIgnoreCase("pdf")) {
return "Is the PDF file";
}
try (ByteArrayOutputStream baos = new ByteArrayOutputStream()) {

final DocumentFormat targetFormat =
DefaultDocumentFormatRegistry.getFormatByExtension("pdf");
converter
.convert(resource.getInputStream())
.as(
DefaultDocumentFormatRegistry.getFormatByExtension(
FilenameUtils.getExtension(resource.getFilename())))
.to(baos)
.as(targetFormat)
.execute();

final HttpHeaders headers = new HttpHeaders();
headers.setContentType(MediaType.parseMediaType(targetFormat.getMediaType()));
return new ResponseEntity<>(baos.toByteArray(), headers, HttpStatus.OK);

} catch (OfficeException | IOException e) {
e.printStackTrace();
return "convert error: " + e.getMessage();
}
} catch (IOException e) {
e.printStackTrace();
return "File does not exist;";
}
}

Collabora Office 服务搭建

官方地址: https://www.collaboraoffice.com/solutions/collabora-office/

Collabora CODE 服务搭建

官方建议采用docker来安装

Docker
1
2
3
$ docker pull collabora/code
$ docker run -t -d -p 127.0.0.1:9980:9980 -e "domain=<your-dot-escaped-domain>" \
-e "username=admin" -e "password=S3cRet" --restart always --cap-add MKNOD collabora/code
Linux packages
1
2
3
4
5
6
# import the signing key
wget https://www.collaboraoffice.com/repos/CollaboraOnline/CODE-centos7/repodata/repomd.xml.key && rpm --import repomd.xml.key
# add the repository URL to yum
yum-config-manager --add-repo https://www.collaboraoffice.com/repos/CollaboraOnline/CODE-centos7
# perform the installation
yum install loolwsd CODE-brand

Office 套件文档在线协作

需要域名和SSL证书,尚未实际研究