Laden von Daten aus HDFS -Spark Scala

Ich habe eine Selbstanwendung mit SBT enthalten ist, und ich möchte, dass meine Daten von HDFS laden, habe ich diesen Befehl ein:Laden von Daten aus HDFS -Spark Scala

val loadfiles1 = sc.textFile("hdfs:///tmp/MySimpleProject/file1.dat")

aber ein Fehler es genere wie folgt aus:

[error] (run-main-0) java.io.IOException: Incomplete HDFS URI, no host: hdfs:/tmp/MyProjectSpark/file1.dat 
java.io.IOException: Incomplete HDFS URI, no host: hdfs:/tmp/MyProjectSpark/file1.dat 
     at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:133) 
     at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2433) 
     at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88) 
     at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467) 
     at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449) 
     at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367) 
     at org.apache.hadoop.fs.Path.getFileSystem(Path.java:287) 
     at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:221) 
     at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270) 
     at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199) 
     at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) 
     at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) 
     at scala.Option.getOrElse(Option.scala:120) 
     at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) 
     at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) 
     at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) 
     at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) 
     at scala.Option.getOrElse(Option.scala:120) 
     at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) 
     at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) 
     at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:248) 
     at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:246) 
     at scala.Option.getOrElse(Option.scala:120) 
     at org.apache.spark.rdd.RDD.partitions(RDD.scala:246) 
     at org.apache.spark.SparkContext.runJob(SparkContext.scala:1930) 
     at org.apache.spark.rdd.RDD.count(RDD.scala:1134) 
     at app$.main(App.scala:33) 
     at app.main(App.scala) 
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) 
     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
     at java.lang.reflect.Method.invoke(Method.java:606) 
[trace] Stack trace suppressed: run last compile:run for the full output. 
16/12/23 15:19:16 ERROR ContextCleaner: Error in cleaning thread 
java.lang.InterruptedException 
     at java.lang.Object.wait(Native Method) 
     at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135) 
     at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:175) 
     at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1249) 
     at org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:172) 
     at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:67) 
16/12/23 15:19:16 ERROR Utils: uncaught error in thread SparkListenerBus, stopping SparkContext 
java.lang.InterruptedException 
     at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:996) 
     at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303) 
     at java.util.concurrent.Semaphore.acquire(Semaphore.java:317) 
     at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(LiveListenerBus.scala:80) 
     at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79) 
     at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(LiveListenerBus.scala:79) 
     at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) 
     at org.apache.spark.scheduler.LiveListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(LiveListenerBus.scala:78) 
     at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1249) 
     at org.apache.spark.scheduler.LiveListenerBus$$anon$1.run(LiveListenerBus.scala:77) 
16/12/23 15:19:16 INFO SparkUI: Stopped Spark web UI at http://10.0.2.15:4040 
java.lang.RuntimeException: Nonzero exit code: 1 
     at scala.sys.package$.error(package.scala:27) 
[trace] Stack trace suppressed: run last compile:run for the full output. 
[error] (compile:run) Nonzero exit code: 1 
[error] Total time: 10 s, completed Dec 23, 2016 3:19:17 PM 
16/12/23 15:19:17 INFO DiskBlockManager: Shutdown hook called 
16/12/23 15:19:17 INFO ShutdownHookManager: Shutdown hook called 
16/12/23 15:19:17 INFO ShutdownHookManager: Deleting directory /tmp/spark-515b242b-7450-4215-9831-8e6976cb41ba 
16/12/23 15:19:17 INFO ShutdownHookManager: Deleting directory /tmp/spark-515b242b-7450-4215-9831-8e6976cb41ba/userFiles-ee18e822-55c7-4613-b3f7-03e5a4c896e1

Warum all dieser Fehler, nur ich möchte eine Datei von HDFS laden. Die Konfiguration des Funkens Zusammenhang ist die folgende:

val conf = new SparkConf().setAppName("My first project hadoop spark").setMaster("local[4]") 
val sc = new SparkContext(conf)

Und die Konfiguration der hdfs in der Datei site-core.xml ist folgende:

<property> 
     <name>fs.defaultFS</name> 
     <value>hdfs://sandbox.hortonworks.com:8020</value> 
     <final>true</final> 
    </property>

Danke.

Quelle

2016-12-23 Alicia

Benötigen Sie 3 Schrägstriche am Anfang (oder nur 2)? '///' –

Stacktrace heißt es, dass klar

Unvollständige HDFS URI, kein host: hdfs: /tmp/MyProjectSpark/file1.dat

Bitte geben hdfs NameNode Host und optional Port (Standard 8020, spezifizieren Sie, wenn es unterschiedlich ist).

So etwas wie dieses (vorausgesetzt, localhost ist Ihr NameNode):

hdfs: // localhost: 8020/tmp/MyProjectSpark/file1.dat

Quelle

2016-12-23 16:07:17 code

dies wurde bereits [geantwortet] (http://stackoverflow.com/a/32197116/647053) im Link-Kommentar. Ths ist Duplikat –

Danke. Können Sie mir sagen, wie ich die Anzahl der Server wissen kann, läuft meine Anwendung, wie ich darauf zugreifen oder sie zu sehen, weil ich gelernt habe, eine Spark UI (Benutzeroberfläche) von http: // : 4040 Ich kann die Größe ändern der Erinnerung an Testamentsvollstrecker und Fahrer und ich kann mein Cluster sehen. Ich habe versucht, mit diesem Befehl auf URL: http://127.0.0.1:4040, aber die Seite ist nicht zugänglich. Wie kann ich meinen Treiberknoten kennen? Vielen Dank. – Alicia

In der Spark UI können Sie alle von Ihrem Job verwendeten Executoren sehen. Die Spalte Executor ID/host zeigt an, auf welchem Knoten/Server der Executor ausgeführt wird. Gehen Sie zur Executor-Registerkarte oben (die letzte Registerkarte) und der letzte Eintrag wird Ihnen sagen, wo der Treiber ausgeführt wird. – code

Laden von Daten aus HDFS -Spark Scala

Antwort

Verwandte Themen