Ich habe mehrere Websites erfolgreich gecrawlt und zwei Segmente mit Nutch erstellt. Ich habe auch Solr Service installiert und gestartet.Integration von Apache Nutch 1.12 und Solr 5.4.1 fehlgeschlagen
Aber wenn ich versuche, diese gecrawlten Daten in Solr zu indizieren, funktioniert es nicht.
ich diesen Befehl versucht:
bin/nutch index http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/*
Ausgang:
The input path at crawldb is not a segment... skipping
Segment dir is complete: crawl/segments/20161214143435.
Segment dir is complete: crawl/segments/20161214144230.
Indexer: starting at 2016-12-15 10:55:35
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance
solr.zookeeper.hosts : URL of the Zookeeper quorum
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication
Indexer: java.io.IOException: No FileSystem for scheme: http
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2385)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2392)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:89)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2431)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2413)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:368)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256)
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:304)
at org.apache.hadoop.mapreduce.JobSubmitter.writeOldSplits(JobSubmitter.java:520)
at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:512)
at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:394)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1285)
at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1282)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:1282)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:562)
at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:557)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:557)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:548)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:833)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)
Und auch diesen Befehl ein:
bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/*
Ausgang:
Segment dir is complete: crawl/segments/20161214143435.
Segment dir is complete: crawl/segments/20161214144230.
Indexer: starting at 2016-12-15 10:54:07
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance
solr.zookeeper.hosts : URL of the Zookeeper quorum
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : username for authentication
solr.auth.password : password for authentication
Indexing 250/250 documents
Deleting 0 documents
Indexing 250/250 documents
Deleting 0 documents
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:145)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:228)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:237)
Vor diesen habe ich die nutch/conf/schema/xml
Datei in /Nutch/solr-5.4.1/server/solr/configsets/data_driven_schema_configs/conf
kopiert und wie vorgeschlagen in managed-schema
umbenannt.
Was könnten meine möglichen Fehler sein? Danke im Voraus!
bearbeiten
Das ist mein Nutch log
...........................
...........................
2016-12-15 10:15:48,355 INFO indexer.IndexerMapReduce - IndexerMapReduce: crawldb: crawl/crawldb
2016-12-15 10:15:48,355 INFO indexer.IndexerMapReduce - IndexerMapReduce: linkdb: crawl/linkdb
2016-12-15 10:15:48,355 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20161214143435
2016-12-15 10:15:48,378 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: crawl/segments/20161214144230
2016-12-15 10:15:49,120 WARN conf.Configuration - file:/tmp/hadoop-kaidul/mapred/staging/kaidul1333791357/.staging/job_local1333791357_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
2016-12-15 10:15:49,122 WARN conf.Configuration - file:/tmp/hadoop-kaidul/mapred/staging/kaidul1333791357/.staging/job_local1333791357_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
2016-12-15 10:15:49,180 WARN conf.Configuration - file:/tmp/hadoop-kaidul/mapred/local/localRunner/kaidul/job_local1333791357_0001/job_local1333791357_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
2016-12-15 10:15:49,181 WARN conf.Configuration - file:/tmp/hadoop-kaidul/mapred/local/localRunner/kaidul/job_local1333791357_0001/job_local1333791357_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
2016-12-15 10:15:49,406 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off
2016-12-15 10:15:50,930 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
2016-12-15 10:15:51,137 INFO solr.SolrMappingReader - source: content dest: content
2016-12-15 10:15:51,137 INFO solr.SolrMappingReader - source: title dest: title
2016-12-15 10:15:51,137 INFO solr.SolrMappingReader - source: host dest: host
2016-12-15 10:15:51,137 INFO solr.SolrMappingReader - source: segment dest: segment
2016-12-15 10:15:51,137 INFO solr.SolrMappingReader - source: boost dest: boost
2016-12-15 10:15:51,137 INFO solr.SolrMappingReader - source: digest dest: digest
2016-12-15 10:15:51,137 INFO solr.SolrMappingReader - source: tstamp dest: tstamp
2016-12-15 10:15:51,243 INFO solr.SolrIndexWriter - Indexing 250/250 documents
2016-12-15 10:15:51,243 INFO solr.SolrIndexWriter - Deleting 0 documents
2016-12-15 10:15:51,384 INFO solr.SolrIndexWriter - Indexing 250/250 documents
2016-12-15 10:15:51,384 INFO solr.SolrIndexWriter - Deleting 0 documents
2016-12-15 10:15:51,414 WARN mapred.LocalJobRunner - job_local1333791357_0001
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://localhost:8983/solr: Expected mime type application/octet-stream but got text/html. <html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<title>Error 404 Not Found</title>
</head>
<body><h2>HTTP ERROR 404</h2>
<p>Problem accessing /solr/update. Reason:
<pre> Not Found</pre></p><hr><i><small>Powered by Jetty://</small></i><hr/>
</body>
</html>
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
............................
.............................