Solr数据库导入和中文分词(二)
参考文档
mysql数据库信息导入
http://www.cnblogs.com/blueskyli/p/7128400.htmlANSJ分词的配置
https://blog.csdn.net/baidu_26550817/article/details/73181081
java的方式导入数据
https://www.cnblogs.com/xiao-zhang-blogs/p/7339476.html
mysql数据库信息导入
http://www.cnblogs.com/blueskyli/p/7128400.html
中文分词的引入
1.复制源端的ext dist contrib等文件夹,修改soreconfig里面lib的指向地址
路径为collection相同的路径,所以直接指向即可
把 solr-6.6.2\contrib\analysis-extras\lucene-libs\lucene-analyzers-smartcn-6.6.2.jar jar包复制到webapp/solr/webinfo/lib下面
设置字段属性
<fieldType name="solr_cnAnalyzer" class="solr.TextField" positionIncrementGap="100"><analyzer type="index"><tokenizer class="org.apache.lucene.analysis.cn.smart.HMMChineseTokenizerFactory"/></analyzer><analyzer type="query"><tokenizer class="org.apache.lucene.analysis.cn.smart.HMMChineseTokenizerFactory"/></analyzer></fieldType><field name="cityname" type="solr_cnAnalyzer" indexed="true" stored="true"/>测试分词效果,Analysis Field Value (Index)(中国九江), 进行分词查询
solr6 自带中文分词器的使用
https://blog.csdn.net/programmeryu/article/details/72828561
分词的效果是每个中文都为一个单位,不会连词<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true"><analyzer type="index"><tokenizer class="solr.StandardTokenizerFactory"/><filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/><filter class="solr.LowerCaseFilterFactory"/></analyzer><analyzer type="query"><tokenizer class="solr.StandardTokenizerFactory"/><filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/><filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/><filter class="solr.LowerCaseFilterFactory"/></analyzer></fieldType>设置AnsjTokenizer 中文分词
https://blog.csdn.net/lzx1104/article/details/51438136###
- 去maven仓库下载ansj_seg-5.1.0.jar、nlp-lang-1.7.3.jar,并且进入到github的官网上下载
在ansj_seg的lucene5的插件项目(ansj_seg/plugin/ansj_lucene5_plug)中做扩展,需要先下载下来,在maven编译正常后 - 添加solr/AnsjTokenizerFactory java文件
- 编译成jar包 ansj_lucene5_plug-5.1.2.0.jar
- 把3个jar包都添加到lib下面去
- 把ansj_seg-master\library\default.dic 获取到,并且设置为分词字典,也支持自定义添加分词进去,可以直接放在conf目录下面,然后通过stopwords 直接饮用
- 配置managed-schema<fieldType name="text_ansj" class="solr.TextField" positionIncrementGap="100"><analyzer type="index"><tokenizer class="org.ansj.solr.AnsjTokenizerFactory" isQuery="false" stopwords="default.dic"/></analyzer><analyzer type="query"><tokenizerclass="org.ansj.solr.AnsjTokenizerFactory"/></analyzer></fieldType>