2016-12-15 2 views
0

Ich habe mehrere Websites erfolgreich gecrawlt und zwei Segmente mit Nutch erstellt. Ich habe auch Solr Service installiert und gestartet.Integration von Apache Nutch 1.12 und Solr 5.4.1 fehlgeschlagen

Aber wenn ich versuche, diese gecrawlten Daten in Solr zu indizieren, funktioniert es nicht.

ich diesen Befehl versucht:

bin/nutch index http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/* 

Ausgang:

The input path at crawldb is not a segment... skipping 
Segment dir is complete: crawl/segments/20161214143435. 
Segment dir is complete: crawl/segments/20161214144230. 
Indexer: starting at 2016-12-15 10:55:35 
Indexer: deleting gone documents: false 
Indexer: URL filtering: false 
Indexer: URL normalizing: false 
Active IndexWriters : 
SOLRIndexWriter 
    solr.server.url : URL of the SOLR instance 
    solr.zookeeper.hosts : URL of the Zookeeper quorum 
    solr.commit.size : buffer size when sending to SOLR (default 1000) 
    solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) 
    solr.auth : use authentication (default false) 
    solr.auth.username : username for authentication 
    solr.auth.password : password for authentication 


Indexer: java.io.IOException: No FileSystem for scheme: http 
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2385) 
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392) 
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89) 
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431) 
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413) 
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368) 
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) 
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256) 
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228) 
    at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45) 
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304) 
    at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:520) 
    at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:512) 
    at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394) 
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285) 
    at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282) 
    at java.security.AccessController.doPrivileged(Native Method) 
    at javax.security.auth.Subject.doAs(Subject.java:422) 
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) 
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282) 
    at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562) 
    at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557) 
    at java.security.AccessController.doPrivileged(Native Method) 
    at javax.security.auth.Subject.doAs(Subject.java:422) 
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) 
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557) 
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548) 
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833) 
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145) 
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228) 
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) 
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237) 

Und auch diesen Befehl ein:

bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/* 

Ausgang:

Segment dir is complete: crawl/segments/20161214143435. 
Segment dir is complete: crawl/segments/20161214144230. 
Indexer: starting at 2016-12-15 10:54:07 
Indexer: deleting gone documents: false 
Indexer: URL filtering: false 
Indexer: URL normalizing: false 
Active IndexWriters : 
SOLRIndexWriter 
    solr.server.url : URL of the SOLR instance 
    solr.zookeeper.hosts : URL of the Zookeeper quorum 
    solr.commit.size : buffer size when sending to SOLR (default 1000) 
    solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml) 
    solr.auth : use authentication (default false) 
    solr.auth.username : username for authentication 
    solr.auth.password : password for authentication 


Indexing 250/250 documents 
Deleting 0 documents 
Indexing 250/250 documents 
Deleting 0 documents 
Indexer: java.io.IOException: Job failed! 
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836) 
    at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145) 
    at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228) 
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) 
    at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237) 

Vor diesen habe ich die nutch/conf/schema/xml Datei in /Nutch/solr-5.4.1/server/solr/configsets/data_driven_schema_configs/conf kopiert und wie vorgeschlagen in managed-schema umbenannt.

Was könnten meine möglichen Fehler sein? Danke im Voraus!

bearbeiten

Das ist mein Nutch log

........................... 
........................... 
2016-12-15 10:15:48,355 INFO indexer.IndexerMapReduce - IndexerMapReduce: crawldb: crawl/crawldb 
2016-12-15 10:15:48,355 INFO indexer.IndexerMapReduce - IndexerMapReduce: linkdb: crawl/linkdb 
2016-12-15 10:15:48,355 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20161214143435 
2016-12-15 10:15:48,378 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20161214144230 
2016-12-15 10:15:49,120 WARN conf.Configuration - file:/tmp/hadoop-kaidul/mapred/staging/kaidul1333791357/.staging/job_local1333791357_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 
2016-12-15 10:15:49,122 WARN conf.Configuration - file:/tmp/hadoop-kaidul/mapred/staging/kaidul1333791357/.staging/job_local1333791357_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 
2016-12-15 10:15:49,180 WARN conf.Configuration - file:/tmp/hadoop-kaidul/mapred/local/localRunner/kaidul/job_local1333791357_0001/job_local1333791357_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 
2016-12-15 10:15:49,181 WARN conf.Configuration - file:/tmp/hadoop-kaidul/mapred/local/localRunner/kaidul/job_local1333791357_0001/job_local1333791357_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 
2016-12-15 10:15:49,406 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off 
2016-12-15 10:15:50,930 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter 
2016-12-15 10:15:51,137 INFO solr.SolrMappingReader - source: content dest: content 
2016-12-15 10:15:51,137 INFO solr.SolrMappingReader - source: title dest: title 
2016-12-15 10:15:51,137 INFO solr.SolrMappingReader - source: host dest: host 
2016-12-15 10:15:51,137 INFO solr.SolrMappingReader - source: segment dest: segment 
2016-12-15 10:15:51,137 INFO solr.SolrMappingReader - source: boost dest: boost 
2016-12-15 10:15:51,137 INFO solr.SolrMappingReader - source: digest dest: digest 
2016-12-15 10:15:51,137 INFO solr.SolrMappingReader - source: tstamp dest: tstamp 
2016-12-15 10:15:51,243 INFO solr.SolrIndexWriter - Indexing 250/250 documents 
2016-12-15 10:15:51,243 INFO solr.SolrIndexWriter - Deleting 0 documents 
2016-12-15 10:15:51,384 INFO solr.SolrIndexWriter - Indexing 250/250 documents 
2016-12-15 10:15:51,384 INFO solr.SolrIndexWriter - Deleting 0 documents 
2016-12-15 10:15:51,414 WARN mapred.LocalJobRunner - job_local1333791357_0001 
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr: Expected mime type application/octet-stream but got text/html. <html> 
<head> 
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/> 
<title>Error 404 Not Found</title> 
</head> 
<body><h2>HTTP ERROR 404</h2> 
<p>Problem accessing /solr/update. Reason: 
<pre> Not Found</pre></p><hr><i><small>Powered by Jetty://</small></i><hr/> 

</body> 
</html> 

    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) 
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) 
............................ 
............................. 

Antwort

0

Das Problem Version Inkompatibilität zwischen solr, nutch und hbase war. This article funktionierte perfekt für mich.