Ich benutze Kesselpipe, um Text aus HTML zu bekommen. Es gibt jedoch ein Problem, das ich nicht lösen konnte. Ich habe eine Liste von 50k Elementen. Ich erstelle eine Menge von 1000 Elementen und bearbeite sie dann und speichere die resultierende RDD in hdfs. Der Fehler, den ich erlebt habe, ist dies:mit Kesselpipe mit pyspark
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/home/hadoopuser/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 883, in send_command
response = connection.send_command(command)
File "/home/hadoopuser/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1040, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
Py4JNetworkError: Error while receiving
Traceback (most recent call last):
File "/home/hadoopuser/CommonCrawl_Spark/CommonCrawl_Spark/all.py", line 265, in <module>
x = get_data(line[:-1],c)
File "/home/hadoopuser/CommonCrawl_Spark/CommonCrawl_Spark/all.py", line 208, in get_data
sc.parallelize(warcrecords).repartition(72).map(lambda s: classify(s)).saveAsTextFile(file_name)
File "/home/hadoopuser/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1552, in saveAsTextFile
File "/home/hadoopuser/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/home/hadoopuser/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 327, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o40.saveAsTextFile
17/09/19 18:11:10 INFO SparkContext: Invoking stop() from shutdown hook
17/09/19 18:11:10 INFO SparkUI: Stopped Spark web UI at http://192.168.0.255:4040
17/09/19 18:11:10 INFO DAGScheduler: Job 0 failed: saveAsTextFile at NativeMethodAccessorImpl.java:0, took 14.746797 s
17/09/19 18:11:10 INFO DAGScheduler: ResultStage 1 (saveAsTextFile at NativeMethodAccessorImpl.java:0) failed in 7.906 s due to Stage cancelled because SparkContext was shut down
17/09/19 18:11:10 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerStageCompleted([email protected])
17/09/19 18:11:10 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerJobEnd(0,1505824870317,JobFailed(org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down))
17/09/19 18:11:10 INFO StandaloneSchedulerBackend: Shutting down all executors
17/09/19 18:11:10 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
17/09/19 18:11:10 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/09/19 18:11:10 INFO MemoryStore: MemoryStore cleared
17/09/19 18:11:10 INFO BlockManager: BlockManager stopped
17/09/19 18:11:10 INFO BlockManagerMaster: BlockManagerMaster stopped
17/09/19 18:11:10 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/09/19 18:11:10 INFO SparkContext: Successfully stopped SparkContext
17/09/19 18:11:10 INFO ShutdownHookManager: Shutdown hook called
17/09/19 18:11:10 INFO ShutdownHookManager: Deleting directory /tmp/spark-35ea0cd4-4b78-408b-8c3a-9966c1f84763/pyspark-b73e541b-1182-4449-96bc-26eabca1803d
17/09/19 18:11:10 INFO ShutdownHookManager: Deleting directory /tmp/spark-35ea0cd4-4b78-408b-8c3a-9966c1f84763
In der hdfs Datei Resultierende ersten 1000 Elemente werden gespeichert, sondern gehen weiter wirft sie den obigen Fehler. Was ist das Problem?
Ist das der vollständige Traceback? –
der Teil, wo der Fehler begann und endete –
Die letzte Zeile. 'Beim Aufruf ist ein Fehler aufgetreten. Geh und finde den Fehler. Sehen Sie sich die Spark History Server Executor/Treiber Protokolle –