2017-07-07 4 views
0

Ich versuche, einige Regex Operation Operation auf einer Spalte zu tun. Um mit dem Grundkleinbetrieb zu tun, dass ich, wie unten angegeben:toString die Daten Pyspark Dataframe

ist
df.select('name').map(lambda x: x.lower()) 

Hier df ein Datenrahmen, und der Vorgang wirft eine Ausnahme, wenn ich sammeln() aufrufen Betrieb.

Wenn ja, warum dieser Befehl wirft Ausnahme beim Sammeln der Pipelined RDD.

Fehle ich etwas?

Ausnahme zu lesen ist zu groß:

17/07/07 13:51:41 INFO SparkContext: Starting job: collect at <stdin>:1 
17/07/07 13:51:41 INFO DAGScheduler: Got job 55 (collect at <stdin>:1) with 4 output partitions 
17/07/07 13:51:41 INFO DAGScheduler: Final stage: ResultStage 63 (collect at <stdin>:1) 
17/07/07 13:51:41 INFO DAGScheduler: Parents of final stage: List() 
17/07/07 13:51:41 INFO DAGScheduler: Missing parents: List() 
17/07/07 13:51:41 INFO DAGScheduler: Submitting ResultStage 63 (PythonRDD[195] at collect at <stdin>:1), which has no missing parents 
17/07/07 13:51:41 INFO MemoryStore: Block broadcast_61 stored as values in memory (estimated size 11.7 KB, free 11.7 KB) 
17/07/07 13:51:41 INFO MemoryStore: Block broadcast_61_piece0 stored as bytes in memory (estimated size 6.7 KB, free 18.3 KB) 
17/07/07 13:51:41 INFO BlockManagerInfo: Added broadcast_61_piece0 in memory on localhost:54574 (size: 6.7 KB, free: 511.1 MB) 
17/07/07 13:51:41 INFO SparkContext: Created broadcast 61 from broadcast at DAGScheduler.scala:1006 
17/07/07 13:51:41 INFO DAGScheduler: Submitting 4 missing tasks from ResultStage 63 (PythonRDD[195] at collect at <stdin>:1) 
17/07/07 13:51:41 INFO TaskSchedulerImpl: Adding task set 63.0 with 4 tasks 
17/07/07 13:51:41 INFO TaskSetManager: Starting task 0.0 in stage 63.0 (TID 744, localhost, partition 0,PROCESS_LOCAL, 2097 bytes) 
17/07/07 13:51:41 INFO TaskSetManager: Starting task 1.0 in stage 63.0 (TID 745, localhost, partition 1,PROCESS_LOCAL, 2131 bytes) 
17/07/07 13:51:41 INFO TaskSetManager: Starting task 2.0 in stage 63.0 (TID 746, localhost, partition 2,PROCESS_LOCAL, 2132 bytes) 
17/07/07 13:51:41 INFO TaskSetManager: Starting task 3.0 in stage 63.0 (TID 747, localhost, partition 3,PROCESS_LOCAL, 2131 bytes) 
17/07/07 13:51:41 INFO Executor: Running task 0.0 in stage 63.0 (TID 744) 
17/07/07 13:51:41 INFO Executor: Running task 1.0 in stage 63.0 (TID 745) 
17/07/07 13:51:41 INFO Executor: Running task 2.0 in stage 63.0 (TID 746) 
17/07/07 13:51:41 INFO Executor: Running task 3.0 in stage 63.0 (TID 747) 
17/07/07 13:51:41 INFO PythonRunner: Times: total = 22, boot = 14, init = 7, finish = 1 
17/07/07 13:51:41 INFO PythonRunner: Times: total = 11, boot = 7, init = 3, finish = 1 
17/07/07 13:51:41 INFO PythonRunner: Times: total = 8, boot = 4, init = 3, finish = 1 
17/07/07 13:51:41 INFO PythonRunner: Times: total = 56, boot = 30, init = 26, finish = 0 
17/07/07 13:51:41 ERROR Executor: Exception in task 2.0 in stage 63.0 (TID 746) 
org.apache.spark.api.python.PythonException: Traceback (most recent call last): 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main 
    process() 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process 
    serializer.dump_stream(func(split_index, iterator), outfile) 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream 
    vs = list(itertools.islice(iterator, batch)) 
    File "<stdin>", line 1, in <lambda> 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1272, in __getattr__ 
    raise AttributeError(item) 
AttributeError: lower 

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166) 
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207) 
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) 
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) 
    at org.apache.spark.scheduler.Task.run(Task.scala:89) 
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
    at java.lang.Thread.run(Thread.java:745) 
17/07/07 13:51:41 ERROR Executor: Exception in task 3.0 in stage 63.0 (TID 747) 
org.apache.spark.api.python.PythonException: Traceback (most recent call last): 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main 
    process() 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process 
    serializer.dump_stream(func(split_index, iterator), outfile) 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream 
    vs = list(itertools.islice(iterator, batch)) 
    File "<stdin>", line 1, in <lambda> 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1272, in __getattr__ 
    raise AttributeError(item) 
AttributeError: lower 

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166) 
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207) 
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) 
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) 
    at org.apache.spark.scheduler.Task.run(Task.scala:89) 
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
    at java.lang.Thread.run(Thread.java:745) 
17/07/07 13:51:41 ERROR Executor: Exception in task 0.0 in stage 63.0 (TID 744) 
org.apache.spark.api.python.PythonException: Traceback (most recent call last): 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main 
    process() 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process 
    serializer.dump_stream(func(split_index, iterator), outfile) 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream 
    vs = list(itertools.islice(iterator, batch)) 
    File "<stdin>", line 1, in <lambda> 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1272, in __getattr__ 
    raise AttributeError(item) 
AttributeError: lower 

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166) 
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207) 
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) 
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) 
    at org.apache.spark.scheduler.Task.run(Task.scala:89) 
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
    at java.lang.Thread.run(Thread.java:745) 
17/07/07 13:51:41 ERROR Executor: Exception in task 1.0 in stage 63.0 (TID 745) 
org.apache.spark.api.python.PythonException: Traceback (most recent call last): 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main 
    process() 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process 
    serializer.dump_stream(func(split_index, iterator), outfile) 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream 
    vs = list(itertools.islice(iterator, batch)) 
    File "<stdin>", line 1, in <lambda> 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1272, in __getattr__ 
    raise AttributeError(item) 
AttributeError: lower 

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166) 
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207) 
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) 
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) 
    at org.apache.spark.scheduler.Task.run(Task.scala:89) 
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
    at java.lang.Thread.run(Thread.java:745) 
17/07/07 13:51:41 WARN TaskSetManager: Lost task 1.0 in stage 63.0 (TID 745, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last): 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main 
    process() 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process 
    serializer.dump_stream(func(split_index, iterator), outfile) 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream 
    vs = list(itertools.islice(iterator, batch)) 
    File "<stdin>", line 1, in <lambda> 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1272, in __getattr__ 
    raise AttributeError(item) 
AttributeError: lower 

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166) 
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207) 
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) 
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) 
    at org.apache.spark.scheduler.Task.run(Task.scala:89) 
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
    at java.lang.Thread.run(Thread.java:745) 

17/07/07 13:51:41 ERROR TaskSetManager: Task 1 in stage 63.0 failed 1 times; aborting job 
17/07/07 13:51:41 INFO TaskSchedulerImpl: Removed TaskSet 63.0, whose tasks have all completed, from pool 
17/07/07 13:51:41 INFO TaskSetManager: Lost task 0.0 in stage 63.0 (TID 744) on executor localhost: org.apache.spark.api.python.PythonException (Traceback (most recent call last): 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main 
    process() 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process 
    serializer.dump_stream(func(split_index, iterator), outfile) 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream 
    vs = list(itertools.islice(iterator, batch)) 
    File "<stdin>", line 1, in <lambda> 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1272, in __getattr__ 
    raise AttributeError(item) 
AttributeError: lower 
) [duplicate 1] 
17/07/07 13:51:41 INFO TaskSchedulerImpl: Removed TaskSet 63.0, whose tasks have all completed, from pool 
17/07/07 13:51:41 INFO TaskSetManager: Lost task 3.0 in stage 63.0 (TID 747) on executor localhost: org.apache.spark.api.python.PythonException (Traceback (most recent call last): 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main 
    process() 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process 
    serializer.dump_stream(func(split_index, iterator), outfile) 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream 
    vs = list(itertools.islice(iterator, batch)) 
    File "<stdin>", line 1, in <lambda> 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1272, in __getattr__ 
    raise AttributeError(item) 
AttributeError: lower 
) [duplicate 2] 
17/07/07 13:51:41 INFO TaskSchedulerImpl: Removed TaskSet 63.0, whose tasks have all completed, from pool 
17/07/07 13:51:41 INFO TaskSetManager: Lost task 2.0 in stage 63.0 (TID 746) on executor localhost: org.apache.spark.api.python.PythonException (Traceback (most recent call last): 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main 
    process() 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process 
    serializer.dump_stream(func(split_index, iterator), outfile) 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream 
    vs = list(itertools.islice(iterator, batch)) 
    File "<stdin>", line 1, in <lambda> 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1272, in __getattr__ 
    raise AttributeError(item) 
AttributeError: lower 
) [duplicate 3] 
17/07/07 13:51:41 INFO TaskSchedulerImpl: Removed TaskSet 63.0, whose tasks have all completed, from pool 
17/07/07 13:51:41 INFO TaskSchedulerImpl: Cancelling stage 63 
17/07/07 13:51:41 INFO DAGScheduler: ResultStage 63 (collect at <stdin>:1) failed in 0.114 s 
17/07/07 13:51:41 INFO DAGScheduler: Job 55 failed: collect at <stdin>:1, took 0.120849 s 
Traceback (most recent call last): 
    File "<stdin>", line 1, in <module> 
    File "/usr/local/spark/python/pyspark/rdd.py", line 771, in collect 
    port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) 
    File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__ 
    File "/usr/local/spark/python/pyspark/sql/utils.py", line 45, in deco 
    return f(*a, **kw) 
    File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value 
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. 
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 63.0 failed 1 times, most recent failure: Lost task 1.0 in stage 63.0 (TID 745, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last): 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main 
    process() 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process 
    serializer.dump_stream(func(split_index, iterator), outfile) 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream 
    vs = list(itertools.islice(iterator, batch)) 
    File "<stdin>", line 1, in <lambda> 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1272, in __getattr__ 
    raise AttributeError(item) 
AttributeError: lower 

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166) 
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207) 
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) 
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) 
    at org.apache.spark.scheduler.Task.run(Task.scala:89) 
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
    at java.lang.Thread.run(Thread.java:745) 

Driver stacktrace: 
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431) 
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419) 
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418) 
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) 
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418) 
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) 
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) 
    at scala.Option.foreach(Option.scala:236) 
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799) 
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640) 
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) 
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) 
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) 
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) 
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845) 
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858) 
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929) 
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927) 
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) 
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) 
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) 
    at org.apache.spark.rdd.RDD.collect(RDD.scala:926) 
    at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:405) 
    at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) 
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 
    at java.lang.reflect.Method.invoke(Method.java:498) 
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) 
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) 
    at py4j.Gateway.invoke(Gateway.java:259) 
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) 
    at py4j.commands.CallCommand.execute(CallCommand.java:79) 
    at py4j.GatewayConnection.run(GatewayConnection.java:209) 
    at java.lang.Thread.run(Thread.java:745) 
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main 
    process() 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process 
    serializer.dump_stream(func(split_index, iterator), outfile) 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream 
    vs = list(itertools.islice(iterator, batch)) 
    File "<stdin>", line 1, in <lambda> 
    File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/types.py", line 1272, in __getattr__ 
    raise AttributeError(item) 
AttributeError: lower 

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166) 
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207) 
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) 
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70) 
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) 
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) 
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) 
    at org.apache.spark.scheduler.Task.run(Task.scala:89) 
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) 
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
     at java.util.concu 

rrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
    ... 1 more 
+0

Welche Ausnahme erhalten Sie? Was ist die Fehlermeldung? – DNA

Antwort

1

Sie die integrierten Funktionen für Datenrahmen verwenden können, anstatt es zu RDD Umwandlung:

from pyspark.sql.functions import * 
from pyspark.sql import Row 

df = sqlContext.createDataFrame([Row("John"), Row("Mary")], ["name"]) 

lower_df = df.withColumn("name", lower(col("name"))) 

Dann bewerben Regex, können Sie regexp_replace(str, pattern, replacement) verwenden:

result = lower_df.withColumn("new_col", regexp_replace("name", "o", "\*")) 

result.show() 

+----+-------+ 
|name|new_col| 
+----+-------+ 
|john| j*hn| 
|mary| mary| 
+----+-------+ 

Weitere Informationen zu den verfügbaren Funktionen finden Sie in der pyspark docs.

+0

Traceback (jüngste Aufforderung zuletzt): File "" Linie 1 in Nameerror: name 'unteren' ist nicht –

+0

definiert Ich bin mit Spark-1.6 –

+0

@JackDaniel Sie importiert haben 'aus pyspark.sql.functions Import * '? –

Verwandte Themen