2016-08-01 5 views
1

Wir haben eine Tabelle von Werten, die in eine de-normalisierte Menge erweitert wurde, und ich muss es neu zu normalisieren, die niedrigste Anzahl von Referenzsätzen finden.Oracle konsolidieren oder re-normalisieren Zeilensätze

Eine vereinfachte Version der Quelldaten wie folgt aussieht etwas:

Period Group Item Seq 
------ ----- ---- --- 
    1  A  1 1 
    1  A  2 2 
    1  A  3 3 
    1  B  1 1 
    1  B  2 2 
    1  B  3 3 
    1  C  1 1 
    1  C  4 2 
    1  C  5 3 
    1  D  2 1 
    1  D  1 2 
    1  D  3 3 
    1  E  1 1 
    1  E  2 2 
    1  F  2 1 
    1  F  1 2 
    1  F  3 3 

ich die minimale Anzahl der Listen in den Daten definiert extrahieren möchten und einen Verweis auf die auf Zeit und Gruppe basierte Liste zuweisen. Eine Liste besteht aus einer geordneten Sequenz von Elementen. Hier sind die 4-Listen in der obigen Daten definiert:

List Item Seq 
---- ---- --- 
    1  2 1 
    1  1 2 
    1  3 3 
    2  1 1 
    2  2 2 
    2  3 3 
    3  1 1 
    3  4 2 
    3  5 3 
    4  1 1 
    4  2 2 

und die Ausgabe, die ich erreichen möchte:

Period Group List 
------ ----- ---- 
    1  A  2 
    1  B  2 
    1  C  3 
    1  D  1 
    1  E  4 
    1  F  1 

Ich habe eine Lösung, die mit ORA_HASH und LIST_AGG arbeitet einen Hash über die erzeugen Elemente der Gruppe, aber es schlägt fehl, wenn die Anzahl der Elemente in einer Gruppe größer als 400 ist. Der resultierende Fehler ist ORA-01489: Das Ergebnis der Verkettung von Strings ist zu lang.

Ich suche nach einer allgemeinen Lösung, die unabhängig von der Anzahl der Elemente in einer Gruppe in einem bestimmten Zeitraum funktionieren würde.

Elemente werden durch einen ganzzahligen Wert unter 100.000 gekennzeichnet. Realistisch werden wir nie mehr als 4000 Artikel in einer Gruppe sehen.

Dies ist logisch ähnlich dem, was funktioniert für bis zu 400 Gruppenelement Datensätze:

WITH  
the_source_data as (
    select 1 as the_period, 'A' as the_group, 1 as the_item, 1 as the_seq from dual union 
    select 1 as the_period, 'A' as the_group, 2 as the_item, 2 as the_seq from dual union 
    select 1 as the_period, 'A' as the_group, 3 as the_item, 3 as the_seq from dual union 
    select 1 as the_period, 'B' as the_group, 1 as the_item, 1 as the_seq from dual union 
    select 1 as the_period, 'B' as the_group, 2 as the_item, 2 as the_seq from dual union 
    select 1 as the_period, 'B' as the_group, 3 as the_item, 3 as the_seq from dual union 
    select 1 as the_period, 'C' as the_group, 1 as the_item, 1 as the_seq from dual union 
    select 1 as the_period, 'C' as the_group, 4 as the_item, 2 as the_seq from dual union 
    select 1 as the_period, 'C' as the_group, 5 as the_item, 3 as the_seq from dual union 
    select 1 as the_period, 'D' as the_group, 2 as the_item, 1 as the_seq from dual union 
    select 1 as the_period, 'D' as the_group, 1 as the_item, 2 as the_seq from dual union 
    select 1 as the_period, 'D' as the_group, 3 as the_item, 3 as the_seq from dual union 
    select 1 as the_period, 'E' as the_group, 1 as the_item, 1 as the_seq from dual union 
    select 1 as the_period, 'E' as the_group, 2 as the_item, 2 as the_seq from dual union 
    select 1 as the_period, 'F' as the_group, 2 as the_item, 1 as the_seq from dual union 
    select 1 as the_period, 'F' as the_group, 1 as the_item, 2 as the_seq from dual union 
    select 1 as the_period, 'F' as the_group, 3 as the_item, 3 as the_seq from dual   
),  
cte_list_hash as (
select 
    the_period, 
    the_group, 
    ora_hash(listagg(to_char(the_item, '00000')||to_char(the_seq, '0000')) within group (order by the_seq)) as list_hash 
from 
    the_source_data 
group by 
    the_period, 
    the_group 
), 
cte_unique_lists as 
(
select 
    list_hash, 
    min(the_period) keep (dense_rank first order by the_period, the_group) as the_period, 
    min(the_group) keep (dense_rank first order by the_period, the_group) as the_group 
from 
    cte_list_hash 
group by 
    list_hash 
), 
cte_list_base as 
(
select  
    the_period, 
    the_group, 
    list_hash, 
    rownum as the_list   
from 
    cte_unique_lists 
) 
select 
    A.the_period, 
    A.the_group, 
    B.the_list 
from 
    cte_list_hash A 
    inner join 
    cte_list_base B 
     on A.list_hash = B.list_hash; 

Jede Hilfe in die richtige Richtung zu finden, diese zu ergreifen, würde sehr geschätzt werden.

Antwort

1

Hier ist eine Möglichkeit, Ihre Ergebnisse ohne Verwendung von LISTAGG und ohne ORA-01489 Fehler zu erhalten.

Der Hauptvorbehalt ist, dass es die Listen unterschiedlich nummeriert, was Sie in Ihrem Beispiel hatten, aber diese Nummerierung schien mir willkürlich. Diese Version nummeriert sie basierend auf der Ordinalposition der ersten Periode/Gruppe, die die Liste verwendet. Das heißt, die von der Gruppe A in Periode 1 verwendete Liste wäre beispielsweise "Liste # 1".

Ich warf einige Beispieldaten für Periode 2 ein, nur um sicherzustellen, dass das auch richtig passierte.

Hoffentlich erklären die Kommentare in der SQL unten den Ansatz klar genug.

Endlich ... Ich habe keine Ahnung, wie lange das auf einem großen Datensatz laufen wird. Die Kreuzverbindung kann problematisch sein.

WITH  
the_source_data as (
    select 1 as the_period, 'A' as the_group, 1 as the_item, 1 as the_seq from dual union 
    select 1 as the_period, 'A' as the_group, 2 as the_item, 2 as the_seq from dual union 
    select 1 as the_period, 'A' as the_group, 3 as the_item, 3 as the_seq from dual union 
    select 1 as the_period, 'B' as the_group, 1 as the_item, 1 as the_seq from dual union 
    select 1 as the_period, 'B' as the_group, 2 as the_item, 2 as the_seq from dual union 
    select 1 as the_period, 'B' as the_group, 3 as the_item, 3 as the_seq from dual union 
    select 1 as the_period, 'C' as the_group, 1 as the_item, 1 as the_seq from dual union 
    select 1 as the_period, 'C' as the_group, 4 as the_item, 2 as the_seq from dual union 
    select 1 as the_period, 'C' as the_group, 5 as the_item, 3 as the_seq from dual union 
    select 1 as the_period, 'D' as the_group, 2 as the_item, 1 as the_seq from dual union 
    select 1 as the_period, 'D' as the_group, 1 as the_item, 2 as the_seq from dual union 
    select 1 as the_period, 'D' as the_group, 3 as the_item, 3 as the_seq from dual union 
    select 1 as the_period, 'E' as the_group, 1 as the_item, 1 as the_seq from dual union 
    select 1 as the_period, 'E' as the_group, 2 as the_item, 2 as the_seq from dual union 
    select 1 as the_period, 'F' as the_group, 2 as the_item, 1 as the_seq from dual union 
    select 1 as the_period, 'F' as the_group, 1 as the_item, 2 as the_seq from dual union 
    select 1 as the_period, 'F' as the_group, 3 as the_item, 3 as the_seq from dual union 
    select 2 as the_period, 'F' as the_group, 1 as the_item, 1 as the_seq from dual union 
    select 2 as the_period, 'F' as the_group, 4 as the_item, 2 as the_seq from dual union 
    select 2 as the_period, 'F' as the_group, 5 as the_item, 3 as the_seq from dual   

), 
-- this CTE counts the number of rows in each period, group. We need this to avoid matching a long list to a shorter list that 
-- happens to share the same order, as far is it goes. 
sd2 as (
select sd.*, count(*) over (partition by sd.the_period, sd.the_group) cnt from the_source_data sd), 
-- this CTE joins every row to every other rows and then filters based on matches of item#, seq, and list length 
-- it then counts the number of matches by period and group (cnt3) 
sd3 as ( 
select sd2a.the_period, sd2a.the_group, sd2a.the_item, sd2a.the_seq, sd2a.cnt, 
sd2b.the_period the_period2, sd2b.the_group the_group2, sd2b.the_item the_item2, sd2b.the_seq the_seq2, sd2b.cnt cnt2 
, count(*) over (partition by sd2a.the_period, sd2a.the_group, sd2b.the_period, sd2b.the_group) cnt3 
from sd2 sd2a cross join sd2 sd2b 
where sd2b.the_item= sd2a.the_item 
and  sd2b.the_seq = sd2a.the_seq 
and  sd2a.cnt = sd2b.cnt), 
-- This CTE filters to period, groups that had the same number of matches as elements in the original period, group. I.e., it 
-- filters to perfect list matches: all elements the same, in the same order, and the list lengths are the same. 
-- for each, it gets the first period and group # that share the list 
sd4 as ( 
select the_period, the_group, --min(the_group2) over (partition by the_period, the_group) first_in_group 
min(the_period2) keep (DENSE_RANK FIRST ORDER BY the_period2, the_group2) OVER (partition by the_period, the_group) first_period, 
min(the_group2) keep (DENSE_RANK FIRST ORDER BY the_period2, the_group2) OVER (partition by the_period, the_group) first_group 
from sd3 where cnt = cnt3) 
-- We'll arbitrarily name the lists based on the ordinal position of the first period and group that uses the list. 
select distinct the_period, the_group, dense_rank() over (partition by null order by first_period, first_group) list 
from sd4 
order by 1,2 
+0

Dank Matthew, das funktioniert perfekt und schnell genug auf die Anzahl der Zeilen, die wir in der Tabelle haben. Kudos! –

Verwandte Themen