2017-03-25 6 views
1

Wir müssen einige große Dateien in Azure Data Lake Speicher von verschachtelten JSON in CSV konvertieren konvertieren. Da die Python-Module pandas, numpy in Azure data lake analytics neben den Standard-Modulen unterstützt werden, ist es meines Erachtens möglich, dies mit Python zu erreichen. Hat jemand den Python-Code, um dies zu erreichen?U-SQL mit Python zum Konvertieren von JSON zu CSV in Azure Data Lake speichern

Quelle Format:

{ "Loc": "TDM", "Thema": "Lage", "LocMac": "Lage/fe: 7a: xx: xx: xx: xx "," seq ":" 296083773 "," Zeitstempel ": 1488986751," op ":" OP_UPDATE "," topicSeq ":" 46478211 "," sourceId ":" AFBWmHs "," Standort ": {" staEthMac ": { "addr": "/ xxxxx"}, "staLocationX": 1643.8915, "staLocationY": 571.04205, "errorLevel": 1076, "assoziiert": 0, "campusId": "n5THo6IINuOSVZ/cTidNVA ==", "buildingId": "7hY/xx ==", "floorId": "xxxxxxxxxx + BYoo0A ==", "hashedStaEthMac": "xxxx/pMVyK4Gu9qG6w =", "locAlgorithm": "ALGORITHM_ESTIMATION", "Einheit": "FEET"}, "EventProcessedUtcTime ":" 2017-03-08T15: 35: 02.3847947Z "," PartitionId ": 3," EventEnqueuedUtcTime ":" 2017-03-08T15: 35: 03.7510000Z "," IoTHub ": {" Mess ageId ": null," CorrelationId ": null," ConnectionDeviceId ":" xxxxx "," ConnectionDeviceGenerationId ":" 636243184116591838 "," EnqueuedTime ":" 0001-01-01T00: 00: 00.0000000 "," StreamId ": null}}

Erwartete Ausgabe

TDM, Lage, Lage/80: 7a: bf: d4: d6: 50,974851970,1490004475, OP_UPDATE, 151.002.334, xxxxxxx, ghq/1NZQ, 977,7259, 638.8827,490,1, n5THo6IINuOSVZ/cTidNVA ==, 7HY/jVh9NRqqxF6gbqT7Jw ==, LV/ZiQRQMS2wwKiKTvYNBQ ==, H5rrAD/jg1Fnkmo1Zmquau/Qn1U =, ALGORITHM_ESTIMATION, Füße

Antwort

0

Gemäß Ihrer Beschreibung, basierend auf meinem Verständnis, ist Ihr Schlüsselbedarf, wie Daten in Azure Data Lake Store aus dem JSON-Format in CSV-Format in Python mit pandas/numpy-Pakete konvertieren. Also habe ich Ihre Quelldaten angesehen und angenommen, dass es keinen Array-Typ in JSON gibt. Dann habe ich den folgenden Code für die Beispieldatenkonvertierung entworfen.

Hier ist mein Beispielcode für eine JSON-Format-Objekt-Zeichenfolge. Als Referenz füge ich einige Kommentare hinzu, um meine Idee zu verstehen, die der Schlüssel ist Methode für die Umwandlung der Struktur {"A": 0, "B": {"C": 1}} in die Struktur [["A", "B.C"], [0, 1]].

import json 
import pandas as pd 

# Source Data string 
json_raw = '''{"Loc":"TDM","Topic":"location","LocMac":"location/fe:7a:xx:xx:xx:xx","seq":"296083773","timestamp":1488986751,"op":"OP_UPDATE","topicSeq":"46478211","sourceId":"AFBWmHSe","location":{"staEthMac":{"addr":"/xxxxx"},"staLocationX":1643.8915,"staLocationY":571.04205,"errorLevel":1076,"associated":0,"campusId":"n5THo6IINuOSVZ/cTidNVA==","buildingId":"7hY/xx==","floorId":"xxxxxxxxxx+BYoo0A==","hashedStaEthMac":"xxxx/pMVyK4Gu9qG6w=","locAlgorithm":"ALGORITHM_ESTIMATION","unit":"FEET"},"EventProcessedUtcTime":"2017-03-08T15:35:02.3847947Z","PartitionId":3,"EventEnqueuedUtcTime":"2017-03-08T15:35:03.7510000Z","IoTHub":{"MessageId":null,"CorrelationId":null,"ConnectionDeviceId":"xxxxx","ConnectionDeviceGenerationId":"636243184116591838","EnqueuedTime":"0001-01-01T00:00:00.0000000","StreamId":null}}''' 

# Load source data string to a Python dict 
json_data = json.loads(json_raw) 

# The key method `flattern` for converting `dict` to `2D-list` 
def flattern(data, key): 
    keys = [] 
    values = [] 
    if key is None: 
     for key in data: 
      if type(data[key]) is dict: 
       keys.extend(flattern(data[key], key)[0]) 
       values.extend(flattern(data[key], key)[1]) 
      else: 
       keys.append(key) 
       values.append(data[key]) 
    else: 
     for subkey in data: 
      if type(data[subkey]) is dict: 
       keys.extend(flattern(data[subkey], key+"."+subkey)[0]) 
       values.extend(flattern(data[subkey], subkey)[1]) 
      else: 
       keys.append(key+"."+subkey) 
       values.append(data[subkey]) 
    return [keys, values] 

list2D = flattern(json_data, None) 
df = pd.DataFrame([list2D[1],], columns=list2D[0]) 

# If you want to extract the items `Loc` & `Topic` & others like `location.staEthMac.addr`, you just need to create a list for them. 
selected = ["Loc", "Topic"] 
# Use `selected` list to select the columns you want. 
result = df.ix[:,selected] 
# Transform DataFrame to csv string 
csv_raw = "\n".join([",".join(lst) for lst in pd.np.array(result)]) 

Ich hoffe, es hilft.

Verwandte Themen