0

Ich möchte eine grundlegende Textähnlichkeitsroutine mit semantischem Abstand unter Verwendung von WordNet und NLTK in Python implementieren. Dies ist die Idee: Erweitere zwei Konzepte/Prases/Kategorien A und B mit Synsets, Hyponymen, Hypernymmen, Meronymen, Metonymen und Rechendistanz zwischen zwei gebildeten Vektoren, a und b. Ich bin mir sicher, wie ich diese berechnen werde, vielleicht als Kosinus-Distanz.Grundlegende Textähnlichkeit durch WorldNet-Synsets für Taxonomie-Mapping/Merging

Meine Eingabedaten für die meisten Fälle sind nicht aus Phrasen, sondern aus Eigennamen oder Substantiven (Produktnamen mit Marken- oder Produktkategorien). Zum Beispiel würde ich gerne feststellen, dass "Resort" ist ein "Luxushotel" oder "schwarzer Kaviar" ist "Gourmet", A - "schwarzer Kaviar", B - "Gourmet".

Inwieweit könnte das funktionieren und wie laufe ich WordNet auf und ab, um es ein bisschen ausgeklügelter als eine Ebene nach oben und unten mit Hypo/Hyper-Nyms zu machen.

Ich bin auf der Suche nach einer einfachen Basislösung, die gut genug funktioniert, nicht mit anspruchsvollen Dingen wie Whoosh oder so etwas.

Sollte ich etwas besseres als WordNet verwenden?


UPDATE:

ich die folgende Art und Weise jeder Nominalphrase bin Verarbeitung (mit NLTK & WordNet): 1. Für jedes Wort in einem Satz, den ich ein Synset (nur Substantive) sammle, dann ergänzen I es mit einem synset von hybernyms und hyponyms für jedes Element im synset. Fürs Erste nehme ich alle Synsets in die Liste und ignoriere die Hierarchie. 2. Ich wiederhole den Vorgang für die Schlüsselwörter, die jede Kategorie meiner Kategorien beschreiben. 3. Jetzt habe ich eine Liste von Synsets für jede Kategorie und für mein Ziel. Berechnen Sie einfach eine Entfernung zu jedem (Cosinus oder Wu und Palmers Distanz). Ich sammle paarweise Abstände in meinen beiden Vektoren, summiere sie, normalisiere durch die Anzahl der Schlüsselwörter, die die Kategorie oder ein Ziel beschreiben. Dann wähle ich einen Mindestabstand.

Klingt wie ziemlich einfach und ineffizient. Was ist der nächste Schritt, um es besser zu machen?

Ich bin interessiert, es von Grund auf neu zu machen, es ist auch die beste Übung zu verstehen, wie die Dinge funktionieren und wie es getan werden muss.


Beispiel: word_list - Ziel: [ 'Schule', 'Kinder', 'Lehrer']

Kategorien: [[ 'Geschäft', 'Organisation', 'Unternehmen'], ['Bildung', 'Schule', 'Universität']]

erweiterte Liste für Zielkonzept 'Bildung', 3 Schlüsselworte: [Synset ('school.n.01'), Synset ('school.n. 02 '), Synset (' school.n.03 '), Synset (' school.n.04 '), Synset (' school.n.05 '), Synset (' school.n.06 '), Synset ('school.n.07'), Synset ('child.n.01'), Synset ('kid.n.02'), S ynset ('kyd.n.01'), Synset ('child.n.02'), Synset ('kid.n.05'), Synset ('teacher.n.01'), Synset ('lehrer.n .02 '), Synset (' educational_institution.n.01 '), Synset (' building.n.01 '), Synset (' education.n.03 '), Synset (' body.n.02 '), Synset ('time_period.n.01'), Synset ('educational_institution.n.01'), Synset ('tiergruppe.n.01'), Synset ('academy.n.03'), Synset ('alma_mater.n. 01 '), Synset (' conservatory.n.01 '), Synset (' correspondence_school.n.01 '), Synset (' crammer.n.03 '), Synset (' dance_school.n.01 '), Synset ('dancing_school.n.01'), Synset ('day_school.n.02'), Synset ('direct-grant_school.n.01'), Synset ('driving_school.n.01'), Synset ('finishing_school.n .01 '), Synset (' flying_school.n.01 '), Synset (' grade_school.n.01 '), Synset (' graduate_school.n.01 '), Synset (' sprache_schule.n.01 '), Synset (' nachtschule.n.01 '), Synset (' pflege_schule.n.01 '), Synset (' private_schule.n.01 '), Synset ('public_school.n.01'), Synset ('religious_school.n.01'), Synset ('riding_school.n.01'), Synset ('secondary_school.n.01'), Synset ('secretarial_school.n.01') '), Synset (' sunday_school.n.01 '), Synset (' technische_school.n.01 '), Synset (' trainings_school.n.01 '), Synset (' veterinary_school.n.01 '), Synset (' conservatory.n.02 '), Synset (' day_school.n.03 '), Synset (' art_nouveau.n.01 '), Synset (' ashcan_school.n.01 '), Synset (' dekonstruktivismus.n.01 ')), Synset ('historical_school.n.01'), Synset ('lake_poets.n.01'), Synset ('pointillism.n.01'), Synset ('secession.n.01')]

Erweiterte Liste für Kategorie Konzept 'Business', 3 Schlüsselworte, 223 in erweiterter Liste: [Synset ('business.n.01'), Synset ('commercial_enterprise.n.02'), Synset ('occupation.n .01 '), Synset (' business.n.04 '), Synset (' business.n.05 '), Synset (' business.n.06 '), Synset (' business.n.07 '), Synset ('clientele.n.01'), Synset ('business.n.09'), Synset ('organisation.n.01'), Synset ('arrangement.n.03'), Synset ('administration.n. 02 '), Synset (' organisation.n.04 '), Synset (' organisation.n.05 '), Synset (' organisation.n.06 '), Synset (' konstitution.n.02 '), Synset ('company.n.01'), Synset ('company.n.02'), Synset ('company.n.03'), Synset ('company.n.04'), Synset ('caller.n.01 '), Synset (' company.n.06 '), Synset (' party.n.03 '), Synset (' ship's_company.n.01 '), Synset (' company.n.09 '), Synset ('enterprise.n.02'), Synset ('commerce.n.01'), Synset ('activity.n.01'), Synset ('concern.n.04'), Synset ('aim.n. 02 '), Synset (' business_activity.n.01 '), Synset (' sector.n.02 '), Synset (' people.n.01 '), Synset (' acting.n.01 '), Synset ('social_group.n.01'), Synset ('Struktur.n.03'), Synset ('body.n.02'), Synset ('administration.n.01'), Synset ('orderlichkeit.n.01 '), Synset (' activity.n.01 '), Synset (' anfang.n.05 '), Synset (' institution.n.01 '), Synset (' armee_einheit.n.01 '), Synset (' friendship.n.01 '), Synset (' organisation.n.01 '), Synset (' visitor.n.01 ')), Synset ('social_gathering.n.01'), Synset ('set.n.05'), Synset ('complement.n.03'), Synset ('unit.n.03'), Synset ('agentur .n.02 '), Synset (' brokerage.n.02 '), Synset (' carrier.n.05 '), Synset (' kette.n.04 '), Synset (' firm.n.01 ') , Synset ('franchise.n.02'), Synset ('hersteller.n.01'), Synset ('partnerschaft.n.01'), Synset ('prozessor.n.01'), Synset ('shipbuilder. n.03 '), Synset (' underperformer.n.02 '), Synset (' advertising.n.02 '), Synset (' agribusiness.n.01 '), Synset (' butchery.n.02 '), Synset ('construction.n.07'), Synset ('discount_business.n.01'), Synset ('mitarbeitereigene_enterprise.n.01'), Synset ('field.n.06'), Synset ('finance .n.01 '), Synset (' fishing.n.02 '), Synset (' industry.n.02 '), Synset (' packaging.n.01 '), Synset (' printing.n.02 ') , Synset ('publication.n.04'), Synset ('real-estate_business.n.01'), Synset ('storage.n.03'), Synset ('tourism.n.01'), Synset (' transportation.n.05 '), Synset (' vent ure.n.03 '), Synset (' accountancy.n.01 '), Synset (' termin.n.05 '), Synset (' career.n.01 '), Synset (' catering.n.01 '), Synset ('süsswaren.n.03'), Synset ('beschäftigung.n.02'), Synset ('landwirtschaft.n.02'), Synset ('spiel.n.10'), Synset ('metier .n.02 '), Synset (' fotografie.n.03 '), Synset (' position.n.06 '), Synset (' beruf.n.02 '), Synset (' sport.n.02 ') , Synset ('trade.n.02'), Synset ('Laufband.n.03'), Synset ('Anlässe.n.01'), Synset ('land-office_business.n.01'), Synset (' trade.n.03 '), Synset (' big_business.n.01 '), Synset (' shtik.n.02 '), Synset (' adhocracy.n.01 '), Synset (' affiliate.n.02 ')), Synset ('alliance.n.03'), Synset ('association.n.01'), Synset ('blau.n.03'), Synset ('bürokratie.n.03'), Synset ('company .n.04 '), Synset (' defense.n.09 '), Synset (' deputation.n.01 '), Synset (' enterprise.n.02 '), Synset (' establishment.n.05 ') , Synset ('föderation.n.01'), Synset ('fiefdom.n.02'), Synset ('fire_brigade.n.01'), Synset ('force.n.04'), Synset ('girl_scouts. n.01 '), Synset (' grey.n.04 '), Synset (' hierarchy.n.02 '), Synset (' ho st.n.06 '), Synset (' institution.n.01 '), Synset (' line_of_defense.n.01 '), Synset (' line_organization.n.01 '), Synset (' machine.n.03 ')), Synset ('machine.n.05'), Synset ('musical_organization.n.01'), Synset ('nichtstaatliche_organisation.n.01'), Synset ('party.n.01'), Synset ('peace_corps .n.01 '), Synset (' polity.n.02 '), Synset (' pool.n.03 '), Synset (' professionelle_organisation.n.01 '), Synset (' quango.n.01 ') , Synset ('tammany_hall.n.01'), Synset ('union.n.01'), Synset ('unit.n.03'), Synset ('calendar.n.01'), Synset ('classification_system. n.01 '), Synset (' contrivance.n.04 '), Synset (' koordinatensystem.n.01 '), Synset (' datenstruktur.n.01 '), Synset (' design.n.02 '), Synset ('distribution.n.01'), Synset ('genetic_map.n.01'), Synset ('kinship_system.n.01'), Synset ('gitter.n.01'), Synset ('living_arrangement.n .01 '), Synset (' ontology.n.01 '), Synset (' county_council.n.01 '), Synset (' curia.n.01 '), Synset (' executive.n.02 '), Synset ("government_officials.n.01"), Synset ("judiciary.n.01"), Synset ("management.n.02 '), Synset (' top_brass.n.01 '), Synset (' nonprofit_organisation.n.01 '), Synset (' rationalisierung.n.04 '), Synset (' reorganisation.n.01 '), Synset ('selbstorganisation.n.01'), Synset ('syndication.n.01'), Synset ('listing.n.02'), Synset ('order.n.15'), Synset ('randomization.n .01 '), Synset (' Systematisierung.n.01 '), Synset (' Territorialisierung.n.01 '), Synset (' Kollektivierung.n.01 '), Synset (' Kolonisierung.n.01 '), Synset ('communization.n.02'), Synset ('federation.n.03'), Synset ('gewerkschaft.n.01'), Synset ('broadcasting_company.n.01'), Synset ('bureau_de_change.n. 01 '), Synset (' car_company.n.01 '), Synset (' closed_shop.n.01 '), Synset (' corporate_investor.n.01 '), Synset (' distributor.n.03 '), Synset ("dot-com.n.01"), Synset ("drug_company.n.01"), Synset ("east_india_company.n.01"), Synset ("electronics_company.n.01"), Synset ("film_company.n") .01 '), Synset (' food_company.n.01 '), Synset (' furniture_company.n.01 '), Synset (' holding_company.n.01 '), Synset (' joint-stock_company.n.01 ') , Synset ('limited_company.n.01'), Synset (' livery_company.n.01 '), Synset (' mining_company.n.01 '), Synset (' mover.n.04 '), Synset (' oil_company.n.01 '), Synset (' open_shop.n.01 ')), Synset ('packaging_company.n.01'), Synset ('pipeline_company.n.01'), Synset ('printing_concern.n.01'), Synset ('record_company.n.01'), Synset ('service .n.04 '), Synset (' shipper.n.02 '), Synset (' shipping_company.n.01 '), Synset (' steel_company.n.01 '), Synset (' stock_company.n.01 ') , Synset ('tochtergesellschaft.n.01'), Synset ('target_firma.n.01'), Synset ('denken_tank.n.01'), Synset ('transportation_firma.n.01'), Synset ('union_shop. n.01 '), Synset (' white_knight.n.01 '), Synset (' trainband.n.01 '), Synset (' freemasonry.n.01 '), Synset (' ballet_company.n.01 '), Synset ('chorus.n.05'), Synset ('zirkus.n.01'), Synset ('minstrel_show.n.01'), Synset ('minstrelsy.n.01'), Synset ('opera_company.n .01 '), Synset (' theater_firma.n.01 '), Synset (' teilnahme.n.03 '), Synset (' kohorte.n.01 '), Synset (' nummer.n.07 '), Synset ('fatigue_party.n.01'), Synset ('landing_party.n.01'), Synset ('party_to_the_action .n.01 '), Synset (' rescue_party.n.01 '), Synset (' search_party.n.01 '), Synset (' stretcher_party.n.01 '), Synset (' war_party.n.01 ') ]

Erweiterte Liste für Kategorie Konzept 'Bildung' - 97 Synsets: [Synset ('education.n.01'), Synset ('education.n.02'), Synset ('education.n.03')), Synset ('education.n.04'), Synset ('education.n.05'), Synset ('department_of_education.n.01'), Synset ('school.n.01'), Synset ('school.n. .n.02 '), Synset (' school.n.03 '), Synset (' school.n.04 '), Synset (' school.n.05 '), Synset (' school.n.06 ') , Synset ('school.n.07'), Synset ('university.n.01'), Synset ('university.n.02'), Synset ('university.n.03'), Synset ('activity. n.01 '), Synset (' inhalt.n.05 '), Synset (' lernen.n.01 '), Synset (' beruf.n.02 '), Synset (' erziehen.n.01 '), Synset ('executive_department.n.01'), Synset ('education_institution.n.01'), Synset ('building.n.01'), Synset ('education .n.03 '), Synset (' body.n.02 '), Synset (' time_period.n.01 '), Synset (' educational_institution.n.01 '), Synset (' animal_group.n.01 ') , Synset ('body.n.02'), Synset ('establishment.n.04'), Synset ('educational_institution.n.01'), Synset ('coeducation.n.01'), Synset ('continuing_education. n.01 '), Synset (' course.n.01 '), Synset (' elementary_education.n.01 '), Synset (' extension.n.04 '), Synset (' extracurricular_activity.n.01 '), Synset ('higher_education.n.01'), Synset ('secondary_education.n.01'), Synset ('team_teaching.n.01'), Synset ('work-study_program.n.01'), Synset ('Erleuchtung .n.01 '), Synset (' eruditeness.n.01 '), Synset (' experience.n.01 '), Synset (' foundation.n.04 '), Synset (' physical_education.n.01 ') , Synset ('Akkulturation.n.03'), Synset ('mastering.n.01'), Synset ('school.n.03'), Synset ('self-education.n.01'), Synset (' special_education.n.01 '), Synset (' berufsbildung.n.01 '), Synset (' unterricht.n.01 '), Synset (' academy.n.03 '), Synset (' alma_mater.n.01 ')), Synset ('conservatory.n.01'), Synset ('correspondence_s chool.n.01 '), Synset (' cammer.n.03 '), Synset (' dance_school.n.01 '), Synset (' dancing_school.n.01 '), Synset (' day_school.n.02 ')), Synset ('direct-grant_school.n.01'), Synset ('driving_school.n.01'), Synset ('finishing_school.n.01'), Synset ('fliegende_schule.n.01'), Synset ('grade_school.n.01'), Synset ('graduate_school.n.01'), Synset ('sprache_school.n.01'), Synset ('nachtschool.n.01'), Synset ('pflege_school.n.01') '), Synset (' private_school.n.01 '), Synset (' public_school.n.01 '), Synset (' religious_school.n.01 '), Synset (' reitende_schule.n.01 '), Synset (' secundary_school.n.01 '), Synset (' secretarial_school.n.01 '), Synset (' sunday_school.n.01 '), Synset (' technical_school.n.01 '), Synset (' training_school.n.01 ')), Synset ('veterinary_school.n.01'), Synset ('conservatory.n.02'), Synset ('day_school.n.03'), Synset ('art_nouveau.n.01'), Synset ('ashcan_school .n.01 '), Synset (' dekonstruktivismus.n.01 '), Synset (' historische_schule.n.01 '), Synset (' see_poets.n.01 '), Synset (' pointillismus.n.01 ') , Synset ('sezession.n.01 '), Synset (' Kleid.n.02 '), Synset (' varsity.n.01 '), Synset (' city_university.n.01 '), Synset (' oxbridge.n.01 '), Synset ('redbrick_university.n.01'), Synset ('multiversity.n.01'), Synset ('open_university.n.01')]

Erweiterte Liste für mein Ziel, 57 Synsets: [Synset ('Schule .n.01 '), Synset (' school.n.02 '), Synset (' school.n.03 '), Synset (' school.n.04 '), Synset (' school.n.05 ') , Synset ('school.n.06'), Synset ('school.n.07'), Synset ('child.n.01'), Synset ('kid.n.02'), Synset ('kyd. n.01 '), Synset (' child.n.02 '), Synset (' kid.n.05 '), Synset (' teacher.n.01 '), Synset (' teacher.n.02 '), Synset ('educational_institution.n.01'), Synset ('building.n.01'), Synset ('education.n.03'), Synset ('body.n.02'), Synset ('time_period.n .01 '), Synset (' educational_institution.n.01 '), Synset (' tiergruppe.n.01 '), Synset (' academy.n.03 '), Synset (' alma_mater.n.01 '), Synset ('Wintergarten.n.01'), Synset ('correspondence_school.n.01'), Synset ('crammer.n.03'), Synset ('dance_school.n.01'), Synset ('dancing_school.n.01'), Synset ('day_school .n.02 '), Synset (' direct-grant_school.n.01 '), Synset (' driving_school.n.01 '), Synset (' finishing_school.n.01 '), Synset (' fliegende_schule.n.01 '), Synset (' grad_school.n.01 '), Synset (' graduate_school.n.01 '), Synset (' language_school.n.01 '), Synset (' night_school.n.01 '), Synset (' pflege_schule.n.01 '), Synset (' private_schule.n.01 '), Synset (' öffentliche_schule.n.01 '), Synset (' religiöse_schule.n.01 '), Synset (' reitende_schule.n.01 ')), Synset ('secondary_school.n.01'), Synset ('secretarial_school.n.01'), Synset ('sunday_school.n.01'), Synset ('technical_school.n.01'), Synset ('training_school .n.01 '), Synset (' veterinary_school.n.01 '), Synset (' conservatory.n.02 '), Synset (' day_school.n.03 '), Synset (' art_nouveau.n.01 ') , Synset ('ashcan_school.n.01'), Synset ('dekonstruktivismus.n.01'), Synset ('historical_school.n.01'), Synset ('lake_poets.n.01'), Synset ('pointillism. n.01 '), Synset (' secession.n.01')]


Ich habe 3 Vektoren, Ziel - 57, Geschäft - 223 und Bildung - 97.

nun paarweise Wu und Palmer Abstände zwischen Ziel und Unternehmen berechnen, dividieren von 57x223 = 12711; zwischen Ziel und Bildung, dividiere durch 57x97 = 5529.

Ziel Geschäft Entfernung: 2305,709117171037/5529 = 0,9125370052417936 Ziel zu Bildung Entfernung: 5045,417101981877/12711 = 0,39693313680921066

Min Abstand zur Bildung. Das ist eine richtige Antwort.

Antwort

0

WordNet + einige Ähnlichkeit kann eine Lösung sein. Sie können Word2Vec auch verwenden, um die semantische Entfernung von Wörtern zu bestimmen, die Sie aus der WordNet-synset/* nyms-Suche erhalten.

Vielleicht könnte jemand mit einer bestimmten Bibliothek helfen (in dem Moment fällt mir nichts ein, was man direkt benutzen könnte).