Hadoop Streaming API mit Mapper Skript Python - Nicht

gefundene Datei ich als Nutzer zu diesem Thema das gleiche Problem habe: Hadoop Streaming - Unable to find file error Hadoop Streaming API mit Mapper Skript Python - Nicht

Wie in dieser anderen Frage, ich bin mit einer Zip-Datei, den Code zusätzlichen Pythons enthält die ich von meinem Mapper importiere. hadoop streaming with python modules In der Skriptdatei, die ich unten veröffentlicht habe, können Sie die ZIP-Datei in Zeile 21 sehen, auf die im Aufruf der Hadoop Streaming API-JAR-Datei in Zeile 26 verwiesen wird. Ich verwende keine Pickle-Datei wie das oben erwähnte StackOverflow-Problem Berichte.

Ich entschied mich, mein Problem in einem neuen Thread zu posten, mit zusätzlichen Details, die für einen Kommentar auf dieser Seite nicht angemessen erschienen.

Die Hadoop Streaming-API wirft beim Ausführen meines Skripts eine Java-Ausnahme FileNotFound. Das Interessante ist, dass es im pseudo-verteilten Modus funktioniert, aber es funktioniert nicht, wenn ich einen Cluster aus ein paar Knoten habe (ich habe einen Cluster von 4 Knoten auf AWS).

Ich habe XRW-Berechtigungen für die Mapper-Datei und die deploy.sh, die in Zeile 7 unten genannt wird, legt XRW-Berechtigungen für die ZIP-Datei, die ebenfalls generiert wird.

Ist in meinem Aufruf an die Hadoop Streaming API etwas falsch oder liegt das Problem irgendwo in meinem Python-Code? (Beachten Sie, ist Code aus http://gurus.pyimagesearch.com und ich habe es in pseudo-verteilten Modus mit Erfolg getestet)

Hier ist mein Skript-Datei, die bei mir läuft:

1 #!/bin/sh 
2 
3 # grab the current working directory 
4 BASE=$(pwd) 
5 
6 # create the latest deployable package 
7 sbin/deploy.sh 
8 
9 # change directory to where Hadoop lives 
10 cd $HADOOP_HOME 
11 
12 # (potentially optional): turn off safe mode 
13 bin/hdfs dfsadmin -safemode leave 
14 
15 # remove the previous output directory 
16 bin/hdfs dfs -rm -r /user/ubuntu/ukbench/output 
17 
18 # define the set of local files that need to be present to run the Hadoop 
19 # job -- comma separate each file path 
20 FILES="${BASE}/feature_extractor_mapper.py,\ 
21 ${BASE}/deploy/pyimagesearch.zip" 
22 
23 # run the job on Hadoop 
24 bin/hadoop jar share/hadoop/tools/lib/hadoop-streaming-*.jar \ 
25  -D mapreduce.job.reduces=0 \ 
26  -files ${FILES} \ 
27  -mapper ${BASE}/feature_extractor_mapper.py \ 
28  -input /user/ubuntu/ukbench/input/ukbench_dataset.txt \ 
29  -output /user/ubuntu/ukbench/output

Und das ist die Stacktrace von der Ausführung des Skripts :

[email protected]:~/high_throughput_feature_extraction$ jobs/feature_extractor_mapper.sh 
Safe mode is OFF 
17/02/08 18:10:46 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes. 
Deleted /user/ubuntu/ukbench/output 
packageJobJar: [/tmp/hadoop-unjar2327603386373063535/] [] /tmp/streamjob380494102161319103.jar tmpDir=null 
17/02/08 18:10:48 INFO client.RMProxy: Connecting to ResourceManager at *I REMOVED THIS* 
17/02/08 18:10:48 INFO client.RMProxy: Connecting to ResourceManager at *I REMOVED THIS* 
17/02/08 18:10:49 INFO mapred.FileInputFormat: Total input paths to process : 1 
17/02/08 18:10:49 INFO mapreduce.JobSubmitter: number of splits:10 
17/02/08 18:10:49 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1486574928548_0004 
17/02/08 18:10:50 INFO impl.YarnClientImpl: Submitted application application_1486574928548_0004 
17/02/08 18:10:50 INFO mapreduce.Job: The url to track the job: http://*I REMOVED THIS*.compute.amazonaws.com:8088/proxy/ application_1486574928548_0004/ 
17/02/08 18:10:50 INFO mapreduce.Job: Running job: job_1486574928548_0004 
17/02/08 18:10:57 INFO mapreduce.Job: Job job_1486574928548_0004 running in uber mode : false 
17/02/08 18:10:57 INFO mapreduce.Job: map 0% reduce 0% 
17/02/08 18:11:12 INFO mapreduce.Job: Task Id : attempt_1486574928548_0004_m_000009_0, Status : FAILED 
Error: java.lang.RuntimeException: Error in configuring object 
     at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:112) 
     at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:78) 
     at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:136) 
     at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:449) 
     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343) 
     at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) 
     at java.security.AccessController.doPrivileged(Native Method) 
     at javax.security.auth.Subject.doAs(Subject.java:422) 
     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) 
     at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) 
Caused by: java.lang.reflect.InvocationTargetException 
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
     at java.lang.reflect.Method.invoke(Method.java:498) 
     at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109) 
     ... 9 more 
Caused by: java.lang.RuntimeException: Error in configuring object 
     at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:112) 
     at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:78) 
     at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:136) 
     at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:38) 
     ... 14 more 
Caused by: java.lang.reflect.InvocationTargetException 
     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
     at java.lang.reflect.Method.invoke(Method.java:498) 
     at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109) 
     ... 17 more 
Caused by: java.lang.RuntimeException: configuration exception 
     at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:222) 
     at org.apache.hadoop.streaming.PipeMapper.configure(PipeMapper.java:66) 
     ... 22 more 
Caused by: java.io.IOException: Cannot run program "/home/ubuntu/high_throughput_feature_extraction/feature_extractor_mapper.py": error=2, No such file or  directory 
     at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048) 
     at org.apache.hadoop.streaming.PipeMapRed.configure(PipeMapRed.java:209) 
     ... 23 more 
Caused by: java.io.IOException: error=2, No such file or directory 
     at java.lang.UNIXProcess.forkAndExec(Native Method) 
     at java.lang.UNIXProcess.<init>(UNIXProcess.java:247) 
     at java.lang.ProcessImpl.start(ProcessImpl.java:134) 
     at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029) 
     ... 24 more 

Container killed by the ApplicationMaster. 
Container killed on request. Exit code is 143 
Container exited with a non-zero exit code 143

Quelle

2017-02-08 drhoffma

dachte ich die Lösung für mein eigenes Problem heraus, das mich schon seit geraumer Zeit wurde plagen.

Aus irgendeinem Grund in Zeile 27 unten, mag es nicht den vollständigen Pfad und es mag das Python-Skript in Anführungszeichen.

Ich machte ein paar andere Änderungen ... hier ist eine Zusammenfassung aller Änderungen: -kommentieren Zeile 10, die in das Hadoop-Installationsverzeichnis ändert. -entfernen Sie die vollständigen Pfadreferenzen in den Zeilen 20 und 21 (da ich nicht im Hadoop-Verzeichnis bin ... siehe vorheriges Bullet) -referenzieren Sie das $ HADOOP_HOME-Verzeichnis in Zeile 24. Wenn Sie auf Cloudera Ihren Pfad zum Streaming sind Jar-Datei wird anders sein, also behalte das im Hinterkopf. -line 27: Entfernen Sie den vollständigen Pfad, da ich in dem Verzeichnis bin, wo diese Datei ist, und auch die py-Datei in Anführungszeichen setzen

Ich hoffe, das hilft anderen Menschen!

1 #!/bin/sh 
2 
3 # grab the current working directory 
4 BASE=$(pwd) 
5 
6 # create the latest deployable package 
7 sbin/deploy.sh 
8 
9 # change directory to where Hadoop lives 
10 #cd $HADOOP_HOME 
11 
12 # (potentially optional): turn off safe mode 
13 hdfs dfsadmin -safemode leave 
14 
15 # remove the previous output directory 
16 hdfs dfs -rm -r /user/ubuntu/ukbench/output 
17 
18 # define the set of local files that need to be present to run the Hadoop 
19 # job -- comma separate each file path 
20 FILES="feature_extractor_mapper.py,\ 
21 deploy/pyimagesearch.zip" 
22 
23 # run the job on Hadoop 
24 ${HADOOP_HOME}/bin/hadoop jar ${HADOOP_HOME}/share/hadoop/tools/lib/hadoop-streaming-*.jar \ 
25  -D mapreduce.job.reduces=0 \ 
26  -files ${FILES} \ 
27  -mapper "feature_extractor_mapper.py" \ 
28  -input /user/ubuntu/ukbench/input/ukbench_dataset.txt \ 
29  -output /user/ubuntu/ukbench/output

Quelle

2017-02-08 21:22:35 drhoffma

Hadoop Streaming API mit Mapper Skript Python - Nicht

Antwort

Verwandte Themen