2017-09-19 6 views
0

Ich benutze Kesselpipe, um Text aus HTML zu bekommen. Es gibt jedoch ein Problem, das ich nicht lösen konnte. Ich habe eine Liste von 50k Elementen. Ich erstelle eine Menge von 1000 Elementen und bearbeite sie dann und speichere die resultierende RDD in hdfs. Der Fehler, den ich erlebt habe, ist dies:mit Kesselpipe mit pyspark

ERROR:root:Exception while sending command. 
Traceback (most recent call last): 
    File "/home/hadoopuser/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 883, in send_command 
    response = connection.send_command(command) 
    File "/home/hadoopuser/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1040, in send_command 
    "Error while receiving", e, proto.ERROR_ON_RECEIVE) 
Py4JNetworkError: Error while receiving 
Traceback (most recent call last): 
    File "/home/hadoopuser/CommonCrawl_Spark/CommonCrawl_Spark/all.py", line 265, in <module> 
    x = get_data(line[:-1],c) 
    File "/home/hadoopuser/CommonCrawl_Spark/CommonCrawl_Spark/all.py", line 208, in get_data 
    sc.parallelize(warcrecords).repartition(72).map(lambda s: classify(s)).saveAsTextFile(file_name) 
    File "/home/hadoopuser/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1552, in saveAsTextFile 
    File "/home/hadoopuser/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ 
    File "/home/hadoopuser/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 327, in get_return_value 
py4j.protocol.Py4JError: An error occurred while calling o40.saveAsTextFile 
17/09/19 18:11:10 INFO SparkContext: Invoking stop() from shutdown hook 
17/09/19 18:11:10 INFO SparkUI: Stopped Spark web UI at http://192.168.0.255:4040 
17/09/19 18:11:10 INFO DAGScheduler: Job 0 failed: saveAsTextFile at NativeMethodAccessorImpl.java:0, took 14.746797 s 
17/09/19 18:11:10 INFO DAGScheduler: ResultStage 1 (saveAsTextFile at NativeMethodAccessorImpl.java:0) failed in 7.906 s due to Stage cancelled because SparkContext was shut down 
17/09/19 18:11:10 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerStageCompleted([email protected]) 
17/09/19 18:11:10 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerJobEnd(0,1505824870317,JobFailed(org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down)) 
17/09/19 18:11:10 INFO StandaloneSchedulerBackend: Shutting down all executors 
17/09/19 18:11:10 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down 
17/09/19 18:11:10 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 
17/09/19 18:11:10 INFO MemoryStore: MemoryStore cleared 
17/09/19 18:11:10 INFO BlockManager: BlockManager stopped 
17/09/19 18:11:10 INFO BlockManagerMaster: BlockManagerMaster stopped 
17/09/19 18:11:10 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 
17/09/19 18:11:10 INFO SparkContext: Successfully stopped SparkContext 
17/09/19 18:11:10 INFO ShutdownHookManager: Shutdown hook called 
17/09/19 18:11:10 INFO ShutdownHookManager: Deleting directory /tmp/spark-35ea0cd4-4b78-408b-8c3a-9966c1f84763/pyspark-b73e541b-1182-4449-96bc-26eabca1803d 
17/09/19 18:11:10 INFO ShutdownHookManager: Deleting directory /tmp/spark-35ea0cd4-4b78-408b-8c3a-9966c1f84763 

In der hdfs Datei Resultierende ersten 1000 Elemente werden gespeichert, sondern gehen weiter wirft sie den obigen Fehler. Was ist das Problem?

+0

Ist das der vollständige Traceback? –

+0

der Teil, wo der Fehler begann und endete –

+0

Die letzte Zeile. 'Beim Aufruf ist ein Fehler aufgetreten. Geh und finde den Fehler. Sehen Sie sich die Spark History Server Executor/Treiber Protokolle –

Antwort

0

das Entfernen dieser Zeile aus dem Code hat den Trick gemacht. weiß immer noch nicht warum.

from boilerpipe.extract import Extractor