2013-09-25 14 views
6

ich bin kein Programmierer und auch neu zu Python, habe ich eine Liste von dicts aus einer JSON-Datei kommen:Python - Duplikate suchen in einer Liste von Wörterbüchern und gruppieren sie

# JSON file (film.json) 
[{"year": ["1999"], "director": ["Wachowski"], "film": ["The Matrix"], "price": ["19,00"]}, 
{"year": ["1994"], "director": ["Tarantino"], "film": ["Pulp Fiction"], "price": ["20,00"]}, 
{"year": ["2003"], "director": ["Tarantino"], "film": ["Kill Bill vol.1"], "price": ["10,00"]}, 
{"year": ["2003"], "director": ["Wachowski"], "film": ["The Matrix Reloaded"], "price": ["9,99"]}, 
{"year": ["1994"], "director": ["Tarantino"], "film": ["Pulp Fyction"], "price": ["15,00"]}, 
{"year": ["1994"], "director": ["E. de Souza"], "film": ["Street Fighter"], "price": ["2,00"]}, 
{"year": ["1999"], "director": ["Wachowski"], "film": ["The Matrix"], "price": ["20,00"]}, 
{"year": ["1982"], "director": ["Ridley Scott"], "film": ["Blade Runner"], "price": ["19,99"]}] 

i json importieren Datei mit:

import json 
json_file = open('film.json') 
f = json.load(json_file) 

aber nach, dass ich bin nicht in der Lage Vorkommen in f und trennen sie in Gruppen von Filmtiteln zu finden. Dies ist, was ich suche zu erreichen:

## result grouped by 'film' 
#group 1 
{"year": ["1999"], "director": ["Wachowski"], "film": ["The Matrix"], "price": ["19,00"]} 
{"year": ["1999"], "director": ["Wachowski"], "film": ["The Matrix"], "price": ["20,00"]} 
#group 2 
{"year": ["1994"], "director": ["Tarantino"], "film": ["Pulp Fiction"], "price": ["20,00"]} 
{"year": ["1994"], "director": ["Tarantino"], "film": ["Pulp Fyction"], "price": ["15,00"]} 
#group X 
... 

Oder besser:

new_dict = { 'group1':[[],[],...] , 'group2':[[],[],...] , 'groupX':[...] } 

Im Moment ist mit verschachtelten for testen bin, aber ohne Glück ..

Danke .

Anmerkung: "Zellstoff fyction" ist ein gesuchter Fehler für die zukünftige Implementierung mit Fuzzy-String-Matching, denn jetzt muß ich nur einen 'Duplikate Grouper'

note2: mit Python 2.x

+0

Worauf gruppieren Sie? Titel allein? Titel + Regisseur + Jahr? – wim

+0

http://docs.python.org/2/library/itertools.html#itertools.groupby – dm03514

+0

Warum benennen Sie Ihre Gruppen nicht nach Film? –

Antwort

8

Weil Ihre Daten nicht sortiert ist, verwenden Sie einen collections.defaultdict() object eine Liste für neue Schlüssel zu materialisieren, dann Schlüssel durch Filmtitel:

from collections import defaultdict 

grouped = defaultdict(list) 

for film in f: 
    grouped[film['film'][0]].append(film) 

Der film['film'][0] Wert der Filme zu einer Gruppe verwendet wird. Sie müssten eine kanonische Version dieses Schlüssels erstellen, wenn Sie eine komplexere Titelgruppierung verwenden möchten.

Demo:

>>> from collections import defaultdict 
>>> import json 
>>> with open('film.json') as film_file: 
...  f = json.load(film_file) 
... 
>>> grouped = defaultdict(list) 
>>> for film in f: 
...  grouped[film['film'][0]].append(film) 
... 
>>> grouped 
defaultdict(<type 'list'>, {u'Street Fighter': [{u'director': [u'E. de Souza'], u'price': [u'2,00'], u'film': [u'Street Fighter'], u'year': [u'1994']}], u'Pulp Fiction': [{u'director': [u'Tarantino'], u'price': [u'20,00'], u'film': [u'Pulp Fiction'], u'year': [u'1994']}], u'Pulp Fyction': [{u'director': [u'Tarantino'], u'price': [u'15,00'], u'film': [u'Pulp Fyction'], u'year': [u'1994']}], u'The Matrix': [{u'director': [u'Wachowski'], u'price': [u'19,00'], u'film': [u'The Matrix'], u'year': [u'1999']}, {u'director': [u'Wachowski'], u'price': [u'20,00'], u'film': [u'The Matrix'], u'year': [u'1999']}], u'Blade Runner': [{u'director': [u'Ridley Scott'], u'price': [u'19,99'], u'film': [u'Blade Runner'], u'year': [u'1982']}], u'Kill Bill vol.1': [{u'director': [u'Tarantino'], u'price': [u'10,00'], u'film': [u'Kill Bill vol.1'], u'year': [u'2003']}], u'The Matrix Reloaded': [{u'director': [u'Wachowski'], u'price': [u'9,99'], u'film': [u'The Matrix Reloaded'], u'year': [u'2003']}]}) 
>>> from pprint import pprint 
>>> pprint(dict(grouped)) 
{u'Blade Runner': [{u'director': [u'Ridley Scott'], 
        u'film': [u'Blade Runner'], 
        u'price': [u'19,99'], 
        u'year': [u'1982']}], 
u'Kill Bill vol.1': [{u'director': [u'Tarantino'], 
         u'film': [u'Kill Bill vol.1'], 
         u'price': [u'10,00'], 
         u'year': [u'2003']}], 
u'Pulp Fiction': [{u'director': [u'Tarantino'], 
        u'film': [u'Pulp Fiction'], 
        u'price': [u'20,00'], 
        u'year': [u'1994']}], 
u'Pulp Fyction': [{u'director': [u'Tarantino'], 
        u'film': [u'Pulp Fyction'], 
        u'price': [u'15,00'], 
        u'year': [u'1994']}], 
u'Street Fighter': [{u'director': [u'E. de Souza'], 
         u'film': [u'Street Fighter'], 
         u'price': [u'2,00'], 
         u'year': [u'1994']}], 
u'The Matrix': [{u'director': [u'Wachowski'], 
        u'film': [u'The Matrix'], 
        u'price': [u'19,00'], 
        u'year': [u'1999']}, 
       {u'director': [u'Wachowski'], 
        u'film': [u'The Matrix'], 
        u'price': [u'20,00'], 
        u'year': [u'1999']}], 
u'The Matrix Reloaded': [{u'director': [u'Wachowski'], 
          u'film': [u'The Matrix Reloaded'], 
          u'price': [u'9,99'], 
          u'year': [u'2003']}]} 

Mit SoundEx gruppieren Filme wäre so einfach wie:

from itertools import groupby, islice, ifilter 

_codes = ('bfpv', 'cgjkqsxz', 'dt', 'l', 'mn', 'r') 
_sounds = {c: str(i) for i, code in enumerate(_codes, 1) for c in code} 
_sounds.update(dict.fromkeys('aeiouy')) 
def soundex(word, _sounds=_sounds): 
    grouped = groupby(_sounds[c] for c in word.lower() if c in _sounds) 
    if _sounds.get(word[0].lower()): 
     next(grouped) # remove first group. 
    sdx = ''.join([k for k, g in islice((g for g in grouped if g[0]), 3)]) 
    return word[0].upper() + format(sdx, '<03') 

grouped_by_soundex = defaultdict(list) 
for film in f: 
    grouped_by_soundex[soundex(film['film'][0])].append(film) 

in resultierenden:

>>> pprint(dict(grouped_by_soundex)) 
{u'B436': [{u'director': [u'Ridley Scott'], 
      u'film': [u'Blade Runner'], 
      u'price': [u'19,99'], 
      u'year': [u'1982']}], 
u'K414': [{u'director': [u'Tarantino'], 
      u'film': [u'Kill Bill vol.1'], 
      u'price': [u'10,00'], 
      u'year': [u'2003']}], 
u'P412': [{u'director': [u'Tarantino'], 
      u'film': [u'Pulp Fiction'], 
      u'price': [u'20,00'], 
      u'year': [u'1994']}, 
      {u'director': [u'Tarantino'], 
      u'film': [u'Pulp Fyction'], 
      u'price': [u'15,00'], 
      u'year': [u'1994']}], 
u'S363': [{u'director': [u'E. de Souza'], 
      u'film': [u'Street Fighter'], 
      u'price': [u'2,00'], 
      u'year': [u'1994']}], 
u'T536': [{u'director': [u'Wachowski'], 
      u'film': [u'The Matrix'], 
      u'price': [u'19,00'], 
      u'year': [u'1999']}, 
      {u'director': [u'Wachowski'], 
      u'film': [u'The Matrix Reloaded'], 
      u'price': [u'9,99'], 
      u'year': [u'2003']}, 
      {u'director': [u'Wachowski'], 
      u'film': [u'The Matrix'], 
      u'price': [u'20,00'], 
      u'year': [u'1999']}]} 
0

Wenn es sich um eine aus war und ich In Eile würde ich es so machen. Unter der Annahme, für dieses Beispiel, dass Ihre Liste der Wörterbücher lod ist, und dass der Filmtitel wird immer nur eine Liste sein mit einem Elemente

new_dict = {k:[d for d in lod if d.get('film')[0] == k] for k in set(d.get('film')[0] for d in l)} 

Um es besser lesbar zu machen, und erklären, was es tut, die gleiche Sache gebrochen aus, wieder die Liste der Wörterbücher ist lod:

#get all the unique film names 
# note: the [0] is because its a list for the title, and set doesn't work with lists, 
#so we're just taking the first one for this example. 
films = set(d.get('film')[0] for d in lod) 


#create a dictionary 
new_dict = {} 

#iterate over the unique film names 
for k in films: 
    #make a list of all the films that match the name we're on 
    filmswiththisname = [d for d in lod if d.get('film')[0] == k] 
    #add the list of films to the new dictionary with the film name as the key. 
    new_dict[k] = filmswiththisname 
Verwandte Themen