Contents
  1. 1. 参考文档
  2. 2. java的方式导入数据
  3. 3. mysql数据库信息导入
  4. 4. 中文分词的引入

参考文档

java的方式导入数据

https://www.cnblogs.com/xiao-zhang-blogs/p/7339476.html

public void addDoc() throws SolrServerException, IOException{
HttpSolrClient solr = new HttpSolrClient(SOLR_URL + "helloworld");
try {
//构造一篇文档
SolrInputDocument document = new SolrInputDocument();
//往doc中添加字段,在客户端这边添加的字段必须在服务端中有过定义
document.addField("id", "1");
document.addField("name", "你好");
document.addField("description", "前进的中国你好");
solr.add(document);
solr.commit();
} finally{
solr.close();
}
}

mysql数据库信息导入

http://www.cnblogs.com/blueskyli/p/7128400.html

中文分词的引入

1.复制源端的ext dist contrib等文件夹,修改soreconfig里面lib的指向地址
路径为collection相同的路径,所以直接指向即可

<lib dir="../contrib/extraction/lib" regex=".*\.jar" />
<lib dir="../dist/" regex="solr-cell-\d.*\.jar" />
<lib dir="../contrib/clustering/lib/" regex=".*\.jar" />
<lib dir="../dist/" regex="solr-clustering-\d.*\.jar" />
<lib dir="../contrib/langid/lib/" regex=".*\.jar" />
<lib dir="../dist/" regex="solr-langid-\d.*\.jar" />
<lib dir="../contrib/velocity/lib" regex=".*\.jar" />
<lib dir="../dist/" regex="solr-velocity-\d.*\.jar" />

  1. 导入数据,参考blog
    http://www.cnblogs.com/blueskyli/p/7128400.html

  2. 设置简单的中文分词器
    https://www.cnblogs.com/LUA123/p/7783102.html

  • 把 solr-6.6.2\contrib\analysis-extras\lucene-libs\lucene-analyzers-smartcn-6.6.2.jar jar包复制到webapp/solr/webinfo/lib下面

  • 设置字段属性

    <fieldType name="solr_cnAnalyzer" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
    <tokenizer class="org.apache.lucene.analysis.cn.smart.HMMChineseTokenizerFactory"/>
    </analyzer>
    <analyzer type="query">
    <tokenizer class="org.apache.lucene.analysis.cn.smart.HMMChineseTokenizerFactory"/>
    </analyzer>
    </fieldType>
    <field name="cityname" type="solr_cnAnalyzer" indexed="true" stored="true"/>
  • 测试分词效果,Analysis Field Value (Index)(中国九江), 进行分词查询

  1. solr6 自带中文分词器的使用
    https://blog.csdn.net/programmeryu/article/details/72828561
    分词的效果是每个中文都为一个单位,不会连词

    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
    <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
    <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    </fieldType>
  2. 设置AnsjTokenizer 中文分词
    https://blog.csdn.net/lzx1104/article/details/51438136###

  • 去maven仓库下载ansj_seg-5.1.0.jar、nlp-lang-1.7.3.jar,并且进入到github的官网上下载
    在ansj_seg的lucene5的插件项目(ansj_seg/plugin/ansj_lucene5_plug)中做扩展,需要先下载下来,在maven编译正常后
  • 添加solr/AnsjTokenizerFactory java文件
  • 编译成jar包 ansj_lucene5_plug-5.1.2.0.jar
  • 把3个jar包都添加到lib下面去
  • 把ansj_seg-master\library\default.dic 获取到,并且设置为分词字典,也支持自定义添加分词进去,可以直接放在conf目录下面,然后通过stopwords 直接饮用
  • 配置managed-schema
    <fieldType name="text_ansj" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
    <tokenizer class="org.ansj.solr.AnsjTokenizerFactory" isQuery="false" stopwords="default.dic"/>
    </analyzer>
    <analyzer type="query">
    <tokenizer
    class="org.ansj.solr.AnsjTokenizerFactory"/>
    </analyzer>
    </fieldType>
Contents
  1. 1. 参考文档
  2. 2. java的方式导入数据
  3. 3. mysql数据库信息导入
  4. 4. 中文分词的引入