2017-11-26 2 views
2

Ich habe die folgende Tabelle psql. Es hat insgesamt etwa 2 Milliarden Zeilen.Schmerzhaft langsame Postgres-Abfrage mit WHERE in vielen benachbarten Zeilen

id word  lemma  pos    textid source  
1 Stuffing stuff  vvg    190568 AN   
2 her  her  appge   190568 AN   
3 key  key  nn1    190568 AN   
4 into  into  ii    190568 AN   
5 the  the  at    190568 AN   
6 lock  lock  nn1    190568 AN   
7 she  she  appge   190568 AN   
8 pushed  push  vvd    190568 AN   
9 her  her  appge   190568 AN   
10 way  way  nn1    190568 AN   
11 into  into  ii    190568 AN   
12 the  the  appge   190568 AN   
13 house  house  nn1    190568 AN   
14 .      .    190568 AN   
15 She  she  appge   190568 AN   
16 had  have  vhd    190568 AN   
17 also  also  rr    190568 AN   
18 cajoled cajole  vvd    190568 AN   
19 her  her  appge   190568 AN   
20 way  way  nn1    190568 AN   
21 into  into  ii    190568 AN   
22 the  the  at    190568 AN   
23 home  home  nn1    190568 AN   
24 .      .    190568 AN   
.. ...  ...  ..    ...  .. 

Ich mag die folgende Tabelle erstellen, die zeigt all „Weg“ -constructions mit den Worten: Side-by-Side und einige Daten aus den Spalten „Quelle“, „Lemma“ und „po“.

source  word word  word  lemma  pos  word  word  word  word  word  lemma  pos  word  word  
AN   lock she  pushed  push  vvd  her  way  into  the  house  house  nn1  .   she 
AN   had also  cajoled cajole  vvd  her  way  into  the  home  home  nn1  .   A   
AN   tried to   force  force  vvi  her  way  into  the  palace  palace  nn1  ,   officials 

Hier können Sie den Code sehe ich verwenden:

copy(
SELECT c1.source, c1.word, c2.word, c3.word, c4.word, c4.lemma, c4.pos, c5.word, c6.word, c7.word, c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word 

FROM 

orderedflatcorpus AS c1, orderedflatcorpus AS c2, orderedflatcorpus AS c3, orderedflatcorpus AS c4, orderedflatcorpus AS c5, orderedflatcorpus AS c6, orderedflatcorpus AS c7, orderedflatcorpus AS c8, orderedflatcorpus AS c9, orderedflatcorpus AS c10, orderedflatcorpus AS c11 

WHERE 

c1.word LIKE '%' AND 
c2.word LIKE '%' AND 
c3.word LIKE '%' AND 
c4.pos LIKE 'v%' AND 
c5.pos = 'appge' AND 
c6.lemma = 'way' AND 
c7.pos LIKE 'i%' AND 
c8.word = 'the' AND 
c9.pos LIKE 'n%' AND 
c10.word LIKE '%' AND 
c11.word LIKE '%' 

AND 

c1.id + 1 = c2.id AND c1.id + 2 = c3.id AND c1.id + 3 = c4.id AND c1.id + 4 = c5.id AND c1.id + 5 = c6.id AND c1.id + 6 = c7.id AND c1.id + 7 = c8.id AND c1.id + 8 = c9.id AND c1.id + 9 = c10.id AND c1.id + 10 = c11.id 

ORDER BY c1.id 
) 
TO 
'/home/postgres/Results/OUTPUT.csv' 
DELIMITER E'\t' 
csv header; 

Die Abfrage fast 9 Stunden dauert für die zwei Milliarden Zeilen (das Ergebnis hat etwa 19.000 Zeilen) auszuführen.

Was könnte ich tun, um die Leistung zu verbessern?

Die Spalten "word", "pos" und "lemma" haben bereits einen btree-Index.

Sollte ich bei meinem Code bleiben und einfach einen leistungsfähigeren Server mit mehr Kernen/einer schnelleren CPU und mehr RAM verwenden (meiner hat nur 8 GB RAM, nur 2 Kerne und 2,8 GHz)? Oder würden Sie eine andere, effizientere SQL-Abfrage empfehlen?

Danke!

+2

Was ist die Absicht von 'c1.word LIKE '%''? Dieser Ausdruck ist immer wahr. – Bohemian

+0

Ich möchte nur sicherstellen, dass das Wort extrahiert wird; Es spielt keine Rolle, welches Wort es ist. LIKE '%' ist als Platzhalter gedacht. Würden Sie einen anderen Ansatz vorschlagen? – Znusgy

+0

''%'' stimmt mit der leeren Zeichenfolge überein. Vielleicht meinst du "_%" (mindestens ein Zeichen)? – Bohemian

Antwort

0

Ich empfehle mit modernen Join-Syntax, die auch das Problem beheben kann:

SELECT 
    c1.source, c1.word, c2.word, c3.word, c4.word, c4.lemma, c4.pos, c5.word, c6.word, c7.word, c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word 
FROM orderedflatcorpus AS c1 
JOIN orderedflatcorpus AS c2 ON c1.id + 1 = c2.id 
JOIN orderedflatcorpus AS c3 ON c1.id + 2 = c3.id 
JOIN orderedflatcorpus AS c4 ON c1.id + 3 = c4.id 
JOIN orderedflatcorpus AS c5 ON c1.id + 4 = c5.id 
JOIN orderedflatcorpus AS c6 ON c1.id + 5 = c6.id 
JOIN orderedflatcorpus AS c7 ON c1.id + 6 = c7.id 
JOIN orderedflatcorpus AS c8 ON c1.id + 7 = c8.id 
JOIN orderedflatcorpus AS c9 ON c1.id + 8 = c9.id 
JOIN orderedflatcorpus AS c10 ON c1.id + 9 = c10.id 
JOIN orderedflatcorpus AS c11 ON c1.id + 10 = c11.id 
WHERE c4.pos LIKE 'v%' 
AND c5.pos = 'appge' 
AND c6.lemma = 'way' 
AND c7.pos LIKE 'i%' 
AND c8.word = 'the' 
AND c9.pos LIKE 'n%' 

Hinweise:

  • redundante LIKE s
  • entfernt
  • ORDER BY entfernt, da es sehr teuer ist. CSV (wie Tabellenzeilen) benötigen keine Bestellung, um gültig zu sein. Wenn Sie unbedingt eine Bestellung benötigen, verwenden Sie Befehlszeilentools, um sie nach der Ausführung der Abfrage zu bestellen.
+0

Vielen Dank, dieser Suchausdruck dauert fünf Stunden statt neun. Exzellente Arbeit. – Znusgy

+0

Probieren Sie meine (erste) Lösung. (es wird wahrscheinlich nur einen Bruchteil davon dauern) Der zweite könnte zehnmal schneller sein, aber es wird einige Zeit dauern, bis der DDL korrigiert ist) – wildplasser

+0

Ich werde deins sofort versuchen! – Znusgy

1

Schritt 1: eine Fensterfunktion verwenden benachbarte Aufzeichnungen zu erhalten, (12 Tabellen sehr nahe an der Grenze, wo geqo übernimmt) die schmerzhafte Selbst-Join zu vermeiden:


copy(
WITH stuff AS (
    SELECT c1.id , c1.source, c1.word 
    , LEAD (c1.word, 1) OVER (www) AS c2w 
    , LEAD (c1.word, 2) OVER (www) AS c3w 
    , LEAD (c1.word, 3) OVER (www) AS c4w 
    , LEAD (c1.lemma, 3) OVER (www) AS c4l 
    , LEAD (c1.pos, 3) OVER (www) AS c4p 
    , LEAD (c1.pos, 4) OVER (www) AS c5p 
    , LEAD (c1.word, 4) OVER (www) AS c5w 
    , LEAD (c1.word, 5) OVER (www) AS c6w 
    , LEAD (c1.lemma, 5) OVER (www) AS c6l 
    , LEAD (c1.word, 6) OVER (www) AS c7w 
    , LEAD (c1.pos, 6) OVER (www) AS c7p 
    , LEAD (c1.word, 7) OVER (www) AS c8w 
    , LEAD (c1.word, 8) OVER (www) AS c9w 
    , LEAD (c1.lemma, 8) OVER (www) AS c9l 
    , LEAD (c1.pos, 8) OVER (www) AS c9p 
    , LEAD (c1.word, 9) OVER (www) AS c10w 
    , LEAD (c1.word, 10) OVER (www) AS c11w 
    FROM orderedflatcorpus AS c1 
    WINDOW www AS (ORDER BY id) 
    ) 
SELECT id , source, word 
    , c2w 
    , c3w 
    , c4w 
    , c4l 
    , c4p 
    , c5w 
    , c6w 
    , c7w 
    , c8w 
    , c9w 
    , c9l 
    , c9p 
    , c10w 
    , c11w 
FROM stuff 
WHERE 1=1 
AND c4p LIKE 'v%' 
AND c5p = 'appge' 
AND c6l = 'way' 
AND c7p LIKE 'i%' 
AND c8w = 'the' 
AND c9p LIKE 'n%' 
ORDER BY id 
) 
-- TO '/home/postgres/Results/OUTPUT.csv' DELIMITER E'\t' csv header; 
TO '/tmp/OUTPUT2.csv' DELIMITER E'\t' csv header; 

Schritt 2: [Datenmodell] Die {word, lemma, pos} -Spalten scheinen eine Gruppe mit geringer Kardinalität zu sein, Sie könnten sie in ein separates Token/Lemma/pos-table ausdrücken:


-- An index to speedup the unique extraction and final update 
    -- (the index will be dropped automatically 
    -- once the columns are dropped) 
    CREATE INDEX ON tmp.orderedflatcorpus (word, lemma, pos); 

    ANALYZE tmp.orderedflatcorpus; 
    -- table containing the "squeezed out" domain 
    CREATE TABLE tmp.words AS 
    SELECT DISTINCT word, lemma, pos 
    FROM tmp.orderedflatcorpus 
      ; 
    ALTER TABLE tmp.words 
    ADD COLUMN id SERIAL NOT NULL PRIMARY KEY; 

    ALTER TABLE tmp.words 
    ADD UNIQUE (word , lemma, pos); 

    -- The original table needs an FK "link" to the new table 
    ALTER TABLE tmp.orderedflatcorpus 
     ADD column words_id INTEGER -- NOT NULL 
     REFERENCES tmp.words(id) 
     ; 
    -- FK constraints are helped a lot by a supportive index. 
    CREATE INDEX orderedflatcorpus_words_id_fk ON tmp.orderedflatcorpus (words_id) 
    ; 
    ANALYZE tmp.orderedflatcorpus; 
    ANALYZE tmp.words; 
    -- Initialize the FK column in the original table. 
    -- we need NOT DISTINCT FROM here, since the joined 
    -- columns could contain NULLs , which MUST compare equal. 
    -- ------------------------------------------------------ 
    UPDATE tmp.orderedflatcorpus dst 
     SET words_id = src.id 
     FROM tmp.words src 
    WHERE src.word IS NOT DISTINCT FROM dst.word 
     AND dst.lemma IS NOT DISTINCT FROM src.lemma 
     AND dst.pos IS NOT DISTINCT FROM src.pos 
      ; 
    ALTER TABLE tmp.orderedflatcorpus 
    DROP column word 
    , DROP column lemma 
    , DROP column pos 
      ; 

Und die neue Abfrage, mit einem JOIN auf die Wörter-Tabelle:


copy(
WITH stuff AS (
    SELECT c1.id , c1.source, w.word 
    , LEAD (w.word, 1) OVER (www) AS c2w 
    , LEAD (w.word, 2) OVER (www) AS c3w 
    , LEAD (w.word, 3) OVER (www) AS c4w 
    , LEAD (w.lemma, 3) OVER (www) AS c4l 
    , LEAD (w.pos, 3) OVER (www) AS c4p 
    , LEAD (w.pos, 4) OVER (www) AS c5p 
    , LEAD (w.word, 4) OVER (www) AS c5w 
    , LEAD (w.word, 5) OVER (www) AS c6w 
    , LEAD (w.lemma, 5) OVER (www) AS c6l 
    , LEAD (w.word, 6) OVER (www) AS c7w 
    , LEAD (w.pos, 6) OVER (www) AS c7p 
    , LEAD (w.word, 7) OVER (www) AS c8w 
    , LEAD (w.word, 8) OVER (www) AS c9w 
    , LEAD (w.lemma, 8) OVER (www) AS c9l 
    , LEAD (w.pos, 8) OVER (www) AS c9p 
    , LEAD (w.word, 9) OVER (www) AS c10w 
    , LEAD (w.word, 10) OVER (www) AS c11w 
    FROM orderedflatcorpus AS c1 
    JOIN words w ON w.id=c1.words_id 
    WINDOW www AS (ORDER BY c1.id) 
    ) 
SELECT id , source, word 
    , c2w , c3w 
    , c4w , c4l , c4p 
    , c5w 
    , c6w 
    , c7w 
    , c8w 
    , c9w , c9l , c9p 
    , c10w 
    , c11w 
FROM stuff 
WHERE 1=1 
AND c4p LIKE 'v%' 
AND c5p = 'appge' 
AND c6l = 'way' 
AND c7p LIKE 'i%' 
AND c8w = 'the' 
AND c9p LIKE 'n%' 
ORDER BY id 
) 
-- TO '/home/postgres/Results/OUTPUT.csv' DELIMITER E'\t' csv header; 
TO '/tmp/OUTPUT3.csv' DELIMITER E'\t' csv header; 

Hinweis: I zwei Zeilen in der Ausgabe zu erhalten, weil ich die Bedingungen etwas zu sehr entspannt habe ...


aktualisieren: die erste Abfrage, die Vermeidung der CTE


copy(
SELECT id , source, word 
     , c2w 
     , c3w 
     , c4w 
     , c4l 
     , c4p 
     , c5w 
     , c6w 
     , c7w 
     , c8w 
     , c9w 
     , c9l 
     , c9p 
     , c10w 
     , c11w 
FROM (
     SELECT c1.id , c1.source, c1.word 
     , LEAD (c1.word, 1) OVER (www) AS c2w 
     , LEAD (c1.word, 2) OVER (www) AS c3w 
     , LEAD (c1.word, 3) OVER (www) AS c4w 
     , LEAD (c1.lemma, 3) OVER (www) AS c4l 
     , LEAD (c1.pos, 3) OVER (www) AS c4p 
     , LEAD (c1.pos, 4) OVER (www) AS c5p 
     , LEAD (c1.word, 4) OVER (www) AS c5w 
     , LEAD (c1.word, 5) OVER (www) AS c6w 
     , LEAD (c1.lemma, 5) OVER (www) AS c6l 
     , LEAD (c1.word, 6) OVER (www) AS c7w 
     , LEAD (c1.pos, 6) OVER (www) AS c7p 
     , LEAD (c1.word, 7) OVER (www) AS c8w 
     , LEAD (c1.word, 8) OVER (www) AS c9w 
     , LEAD (c1.lemma, 8) OVER (www) AS c9l 
     , LEAD (c1.pos, 8) OVER (www) AS c9p 
     , LEAD (c1.word, 9) OVER (www) AS c10w 
     , LEAD (c1.word, 10) OVER (www) AS c11w 
     FROM orderedflatcorpus AS c1 
     WINDOW www AS (ORDER BY id) 
     ) stuff 
WHERE 1=1 
AND c4p LIKE 'v%' 
AND c5p = 'appge' 
AND c6l = 'way' 
AND c7p LIKE 'i%' 
AND c8w = 'the' 
AND c9p LIKE 'n%' 
ORDER BY id 
) 
-- TO '/home/postgres/Results/OUTPUT.csv' DELIMITER E'\t' csv header; 
TO '/tmp/OUTPUT2a.csv' DELIMITER E'\t' csv header; 

[eine ähnliche Transformation könnte man auf der zweiten Abfrage ausgeführt werden]


UPDATE2 Die Unterabfrageversion für die zwei Tabellenvarianten.


-- copy(
-- EXPLAIN ANALYZE 
SELECT c1i, c1s, c1w 
     , c2w , c3w 
     , c4w , c4l , c4p 
     , c5w 
     , c6w 
     , c7w 
     , c8w 
     , c9w , c9l , c9p 
     , c10w 
     , c11w 
FROM (
     SELECT c1.id AS c1i 
     , c1.source AS c1s 
     , w1.word AS c1w 
     , LEAD (w1.word, 1) OVER www AS c2w 
     , LEAD (w1.word, 2) OVER www AS c3w 
     , LEAD (w1.word, 3) OVER www AS c4w 
     , LEAD (w1.lemma, 3) OVER www AS c4l 
     , LEAD (w1.pos, 3) OVER www AS c4p 
     , LEAD (w1.pos, 4) OVER www AS c5p 
     , LEAD (w1.word, 4) OVER www AS c5w 
     , LEAD (w1.word, 5) OVER www AS c6w 
     , LEAD (w1.lemma, 5) OVER www AS c6l 
     , LEAD (w1.word, 6) OVER www AS c7w 
     , LEAD (w1.pos, 6) OVER www AS c7p 
     , LEAD (w1.word, 7) OVER www AS c8w 
     , LEAD (w1.word, 8) OVER www AS c9w 
     , LEAD (w1.lemma, 8) OVER www AS c9l 
     , LEAD (w1.pos, 8) OVER www AS c9p 
     , LEAD (w1.word, 9) OVER www AS c10w 
     , LEAD (w1.word, 10) OVER www AS c11w 
     FROM orderedflatcorpus c1 
     JOIN words w1 ON w1.id=c1.words_id 
     WHERE 1=1 
/*  These *could* to prune out unmatched items, but I could not get it to work ... 
     AND EXISTS (SELECT *FROM orderedflatcorpus c4 JOIN words w4 ON w4.id=c4.words_id 
       WHERE c4.id = 3+c1.id -- AND w4.pos LIKE 'v%' 
       ) -- OMG 
     AND EXISTS (SELECT *FROM orderedflatcorpus c5 JOIN words w5 ON w5.id=c5.words_id 
       WHERE c5.id = 4+c1.id -- AND w5.pos = 'appge' 
       ) -- OMG 
     AND EXISTS (SELECT *FROM orderedflatcorpus c7 JOIN words w7 ON w7.id=c7.words_id 
       WHERE c7.id = 6+c1.id -- AND w7.pos LIKE 'i%' 
       ) -- OMG 
     AND EXISTS (SELECT *FROM orderedflatcorpus c9 JOIN words w9 ON w9.id=c9.words_id 
       WHERE c9.id = 8+c1.id -- AND w9.pos LIKE 'n%' 
       ) -- OMG 
     AND EXISTS (SELECT *FROM orderedflatcorpus c8 JOIN words w8 ON w8.id=c8.words_id 
       WHERE c8.id = 7+c1.id -- AND w8.word = 'the' 
       ) -- OMG 
*/ 
     WINDOW www AS (ORDER BY c1.id ROWS BETWEEN CURRENT ROW AND 10 FOLLOWING) 
     ) stuff 
WHERE 1=1 
AND c4p LIKE 'v%' 
AND c5p = 'appge' 
AND c6l = 'way' 
AND c7p LIKE 'i%' 
AND c8w = 'the' 
AND c9p LIKE 'n%' 
ORDER BY c1i 
     ; 
    --) 
-- TO '/home/postgres/Results/OUTPUT.csv' DELIMITER E'\t' csv header; 
-- TO '/tmp/OUTPUT3b.csv' DELIMITER E'\t' csv header; 
+0

Lieber Wildplasser, vielen Dank für deine Lösungen, ich mag sie wirklich, weil sie so elegant sind. Leider hat dein Schritt 1 fast neun Stunden gedauert - was sicherlich nicht an deinem Code liegt, sondern an meinem langsamen Server. Ich glaube, dass der Flaschenhals die ORDER-Anweisung in Ihrer Fensterdefinition ist - die Bestellung von 2 Milliarden Zeilen dauert Ewigkeiten. – Znusgy

+1

Nein, es ist nicht die "Reihenfolge nach ID" (unter der Annahme, ID ist die PK), IMHO ist es die materialisierte auf dem CTE statt (siehe den Plan). Das Ersetzen des CTE durch einen Subselect * könnte * helfen. – wildplasser

0

Versuchen wir Ihre Anfrage nur ein bisschen neu zu formatieren und sehen, was wir sehen können. Das erste, was zu tun ist, um es zu ändern über ANSI-Stil verwenden Joins so können wir deutlich sehen, was die Beziehungen sind:

SELECT c1.source, c1.word, c2.word, c3.word, c4.word, 
     c4.lemma, c4.pos, c5.word, c6.word, c7.word, 
     c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word 
    FROM orderedflatcorpus c1 
    INNER JOIN orderedflatcorpus c2 
    ON c2.ID = c1.ID + 1 AND 
     c2.WORD LIKE '%' 
    INNER JOIN orderedflatcorpus c3 
    ON c3.ID = c1.ID + 2 AND 
     c3.WORD LIKE '%' 
    INNER JOIN orderedflatcorpus c4 
    ON c4.ID = c1.ID + 3 AND 
     c4.pos LIKE 'v%' 
    INNER JOIN orderedflatcorpus c5 
    ON c5.ID = c1.ID + 4 AND 
     c5.pos = 'appge' 
    INNER JOIN orderedflatcorpus c6 
    ON c6.ID = c1.ID + 5 AND 
     c6.lemma = 'way' 
    INNER JOIN orderedflatcorpus c7 
    ON c7.ID = c1.ID + 6 AND 
     c7.pos LIKE 'i%' 
    INNER JOIN orderedflatcorpus c8 
    ON c8.ID = c1.ID + 7 AND 
     c8.word = 'the' 
    INNER JOIN orderedflatcorpus c9 
    ON c9.ID = c1.ID + 8 AND 
     c9.pos LIKE 'n%' 
    INNER JOIN orderedflatcorpus c10 
    ON c10.ID = c1.ID + 9 AND 
     c10.WORD LIKE '%' 
    INNER JOIN orderedflatcorpus c11 
    ON c11.ID = c1.ID + 10 AND 
     c11.WORD LIKE '%' 
WHERE c1.WORD LIKE '%' 
ORDER BY c1.id 

OK, zunächst einmal - alle, die sind wie diese Abfrage töten. Lasst uns sie eliminieren, wo wir können. Ich werde hier dieses Wort kann nicht in ORDEREDFLATCORPUS NULL zu übernehmen und damit die alle IS LIKE '%' Bedingungen eliminiert werden:

SELECT c1.source, c1.word, c2.word, c3.word, c4.word, 
     c4.lemma, c4.pos, c5.word, c6.word, c7.word, 
     c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word 
    FROM orderedflatcorpus c1 
    INNER JOIN orderedflatcorpus c2 
    ON c2.ID = c1.ID + 1 
    INNER JOIN orderedflatcorpus c3 
    ON c3.ID = c1.ID + 2 
    INNER JOIN orderedflatcorpus c4 
    ON c4.ID = c1.ID + 3 AND 
     c4.pos LIKE 'v%' 
    INNER JOIN orderedflatcorpus c5 
    ON c5.ID = c1.ID + 4 AND 
     c5.pos = 'appge' 
    INNER JOIN orderedflatcorpus c6 
    ON c6.ID = c1.ID + 5 AND 
     c6.lemma = 'way' 
    INNER JOIN orderedflatcorpus c7 
    ON c7.ID = c1.ID + 6 AND 
     c7.pos LIKE 'i%' 
    INNER JOIN orderedflatcorpus c8 
    ON c8.ID = c1.ID + 7 AND 
     c8.word = 'the' 
    INNER JOIN orderedflatcorpus c9 
    ON c9.ID = c1.ID + 8 AND 
     c9.pos LIKE 'n%' 
    INNER JOIN orderedflatcorpus c10 
    ON c10.ID = c1.ID + 9 
    INNER JOIN orderedflatcorpus c11 
    ON c11.ID = c1.ID + 10 
ORDER BY c1.id 

Wenn WORD kann NULL sein, dann müssen Sie verwenden:

SELECT c1.source, c1.word, c2.word, c3.word, c4.word, 
     c4.lemma, c4.pos, c5.word, c6.word, c7.word, 
     c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word 
    FROM orderedflatcorpus c1 
    INNER JOIN orderedflatcorpus c2 
    ON c2.ID = c1.ID + 1 AND 
     c2.WORD IS NOT NULL 
    INNER JOIN orderedflatcorpus c3 
    ON c3.ID = c1.ID + 2 AND 
     c3.WORD IS NOT NULL 
    INNER JOIN orderedflatcorpus c4 
    ON c4.ID = c1.ID + 3 AND 
     c4.pos LIKE 'v%' 
    INNER JOIN orderedflatcorpus c5 
    ON c5.ID = c1.ID + 4 AND 
     c5.pos = 'appge' 
    INNER JOIN orderedflatcorpus c6 
    ON c6.ID = c1.ID + 5 AND 
     c6.lemma = 'way' 
    INNER JOIN orderedflatcorpus c7 
    ON c7.ID = c1.ID + 6 AND 
     c7.pos LIKE 'i%' 
    INNER JOIN orderedflatcorpus c8 
    ON c8.ID = c1.ID + 7 AND 
     c8.word = 'the' 
    INNER JOIN orderedflatcorpus c9 
    ON c9.ID = c1.ID + 8 AND 
     c9.pos LIKE 'n%' 
    INNER JOIN orderedflatcorpus c10 
    ON c10.ID = c1.ID + 9 AND 
     c10.WORD IS NOT NULL 
    INNER JOIN orderedflatcorpus c11 
    ON c11.ID = c1.ID + 10 AND 
     c11.WORD IS NOT NULL 
WHERE c1.WORD IS NOT NULL 
ORDER BY c1.id 

Als nächstes - diese Abfrage muss so viel wie möglich filtern so früh wie es möglich ist. Die Datenbank-Abfrageoptimierer kann der Lage sein, dies herauszufinden, aber wir es, indem sie die equijoins zuerst in der Verknüpfungsliste etwas Hilfe geben, und dann die ID-Berechnungen Einstellen der Informationen zu reflektieren bekommen wir zuerst:

SELECT c1.source, c1.word, c2.word, c3.word, c4.word, 
     c4.lemma, c4.pos, c5.word, c6.word, c7.word, 
     c8.word, c9.word, c9.lemma, c9.pos, c10.word, c11.word 
    FROM DUAL 
    INNER JOIN orderedflatcorpus c5 
    ON c5.pos = 'appge' 
    INNER JOIN orderedflatcorpus c6 
    ON c6.ID = c5.ID + 1 AND 
     c6.lemma = 'way' 
    INNER JOIN orderedflatcorpus c8 
    ON c8.ID = c5.ID + 3 AND 
     c8.word = 'the' 
    INNER JOIN orderedflatcorpus c1 
    ON c1.ID = c5.ID - 4 AND 
    INNER JOIN orderedflatcorpus c2 
    ON c2.ID = c5.ID - 3 
    INNER JOIN orderedflatcorpus c3 
    ON c3.ID = c5.ID - 2 
    INNER JOIN orderedflatcorpus c4 
    ON c4.ID = c5.ID - 1 AND 
     c4.pos LIKE 'v%' 
    INNER JOIN orderedflatcorpus c7 
    ON c7.ID = c5.ID + 2 AND 
     c7.pos LIKE 'i%' 
    INNER JOIN orderedflatcorpus c9 
    ON c9.ID = c5.ID + 4 AND 
     c9.pos LIKE 'n%' 
    INNER JOIN orderedflatcorpus c10 
    ON c10.ID = c5.ID + 5 
    INNER JOIN orderedflatcorpus c11 
    ON c11.ID = c5.ID + 6 
ORDER BY c1.id 

Als nächstes müssen wir überlegen, welche Indizes am nützlichsten wären. Ich denke, die folgenden Indizes lohnen würde mit:

(ID) 
(ID, WORD) 
(ID, LEMMA) 
(ID, POS) 

diese Indizes Setzen Sie auf, diese Abfrage ausführen, und sehen, ob es hilft. Überprüfen Sie auch die ID-Berechnungen - I denken Ich habe sie richtig, aber ... :-)

Viel Glück.

Verwandte Themen