Ich habe diese Abfrage mit Anmerkungen, Anzahl und bedingten Ausdrücken, die sehr langsam läuft, dauert es ewig.Django Abfrage mit Annotation und bedingte Zählung zu langsam
Ich habe zwei Modelle eines speichert Instagram Publikationen und ein anderes, das Twitter-Publikationen speichert. Jede Veröffentlichung hat auch ein FK zu einem anderen Modell, das ein hexagonales geografisches Gebiet innerhalb einer Stadt darstellt.
Publikationen [FK] -> HexCityArea
TwitterPublication [FK] -> HexCityArea
Ich versuche, die Publikationen für jedes Sechseck zu zählen, aber die Veröffentlichungen von anderen Feldern wie Datum vorgefilterten werden , so dass der Code ist:
instagram_publications_ids = list(instagram_publications.values_list('id', flat=True))
twitter_publications_ids = list(twitter_publications.values_list('id', flat=True))
print "\n[HEXAGONS QUERY]> List of publications ids insta\n %s \n" % instagram_publications.query
print instagram_publications.explain()
print "\n[HEXAGONS QUERY]> List of publications ids twitter\n %s \n" % twitter_publications.query
print twitter_publications.explain()
# Get count of publications by hexagon
resultant_hexagons = HexagonalCityArea.objects.filter(city=city).annotate(
instagram_count=Count(Case(
When(publication__id__in=instagram_publications_ids, then=1),
output_field=IntegerField(),
))
).annotate(
twitter_count=Count(Case(
When(twitterpublication__id__in=twitter_publications_ids, then=1),
output_field=IntegerField(),
))
)#filter(instagram_count__gt=0).filter(twitter_count__gt=0) # Discard empty hexagons
# For debug only
print "\n[HEXAGONS QUERY]> Count of publications\n %s \n" % resultant_hexagons.query
print resultant_hexagons.explain()
resultant_hexagons_list = list(resultant_hexagons)
# Iterate remaining hexagons
city_hexagons = [h for h in resultant_hexagons_list if h.instagram_count > 0 or h.twitter_count > 0]
Wie Sie sehen können, zuerst bekomme ich die Liste der IDs von ausgewählten Publikationen und ich benutze sie später nur jene Publikationen zu zählen.
Ein Problem, das ich sehe, ist, dass die Liste der IDs um 28000 Elemente sehr lang ist, aber wenn ich die Liste der IDs nicht die gewünschten Ergebnisse bekomme, funktioniert die count-Bedingung nicht richtig und alle Veröffentlichungen der Stadt werden gezählt. mit Liste der IDs
Ich habe versucht, dies zu vermeiden:
resultant_hexagons = HexagonalCityArea.objects.filter(city=city).annotate(
instagram_count=Count(Case(
When(publication__in=instagram_publications, then=1),
output_field=IntegerField(),
))
).annotate(
twitter_count=Count(Case(
When(twitterpublication__in=twitter_publications, then=1),
output_field=IntegerField(),
))
).filter(instagram_count__gt=0).filter(twitter_count__gt=0) # Discard empty hexagons
# For debug only
print "\n[HEXAGONS QUERY]> Count of publications\n %s \n" % resultant_hexagons.query
print resultant_hexagons.explain()
Hier ist die erzeugte SQL:
SELECT
"instanalysis_hexagonalcityarea"."id",
"instanalysis_hexagonalcityarea"."created",
"instanalysis_hexagonalcityarea"."modified",
"instanalysis_hexagonalcityarea"."geom",
"instanalysis_hexagonalcityarea"."city_id",
COUNT(
CASE
WHEN
"instanalysis_publication"."id" IN
(
SELECT
U0."id"
FROM
"instanalysis_publication" U0
INNER JOIN
"instanalysis_instagramlocation" U1
ON (U0."location_id" = U1."id")
INNER JOIN
"instanalysis_spot" U2
ON (U1."spot_id" = U2."id")
INNER JOIN
"instanalysis_city" U3
ON (U2."city_id" = U3."id")
WHERE
(
U3."name" = Durban
AND U0."publication_date" >= 2016 - 12 - 01 00:00:00 + 01:00
AND U0."publication_date" <= 2016 - 12 - 11 00:00:00 + 01:00
)
)
THEN
1
ELSE
NULL
END
) AS "instagram_count", COUNT(
CASE
WHEN
"instanalysis_twitterpublication"."id" IN
(
SELECT
U0."id"
FROM
"instanalysis_twitterpublication" U0
INNER JOIN
"instanalysis_twitterlocation" U1
ON (U0."location_id" = U1."id")
INNER JOIN
"instanalysis_spot" U2
ON (U1."spot_id" = U2."id")
INNER JOIN
"instanalysis_city" U3
ON (U2."city_id" = U3."id")
WHERE
(
U3."name" = Durban
AND U0."publication_date" >= 2016 - 12 - 01 00:00:00 + 01:00
AND U0."publication_date" <= 2016 - 12 - 11 00:00:00 + 01:00
)
)
THEN
1
ELSE
NULL
END
) AS "twitter_count"
FROM
"instanalysis_hexagonalcityarea"
LEFT OUTER JOIN
"instanalysis_publication"
ON ("instanalysis_hexagonalcityarea"."id" = "instanalysis_publication"."hexagon_id")
LEFT OUTER JOIN
"instanalysis_twitterpublication"
ON ("instanalysis_hexagonalcityarea"."id" = "instanalysis_twitterpublication"."hexagon_id")
WHERE
"instanalysis_hexagonalcityarea"."city_id" = 7
GROUP BY
"instanalysis_hexagonalcityarea"."id"
HAVING
(COUNT(
CASE
WHEN
"instanalysis_publication"."id" IN
(
SELECT
U0."id"
FROM
"instanalysis_publication" U0
INNER JOIN
"instanalysis_instagramlocation" U1
ON (U0."location_id" = U1."id")
INNER JOIN
"instanalysis_spot" U2
ON (U1."spot_id" = U2."id")
INNER JOIN
"instanalysis_city" U3
ON (U2."city_id" = U3."id")
WHERE
(
U3."name" = Durban
AND U0."publication_date" >= 2016 - 12 - 01 00:00:00 + 01:00
AND U0."publication_date" <= 2016 - 12 - 11 00:00:00 + 01:00
)
)
THEN
1
ELSE
NULL
END
) > 0
AND COUNT(
CASE
WHEN
"instanalysis_twitterpublication"."id" IN
(
SELECT
U0."id"
FROM
"instanalysis_twitterpublication" U0
INNER JOIN
"instanalysis_twitterlocation" U1
ON (U0."location_id" = U1."id")
INNER JOIN
"instanalysis_spot" U2
ON (U1."spot_id" = U2."id")
INNER JOIN
"instanalysis_city" U3
ON (U2."city_id" = U3."id")
WHERE
(
U3."name" = Durban
AND U0."publication_date" >= 2016 - 12 - 01 00:00:00 + 01:00
AND U0."publication_date" <= 2016 - 12 - 11 00:00:00 + 01:00
)
)
THEN
1
ELSE
NULL
END
) > 0)
Dies ist wesentlich schneller finden analize erklären:
GroupAggregate (cost=1.14..743590.08 rows=3300 width=184) (actual time=5186.606..46907.530 rows=334 loops=1)
Group Key: instanalysis_hexagonalcityarea.id
Filter: ((count(CASE WHEN (hashed SubPlan 3) THEN 1 ELSE NULL::integer END) > 0) AND (count(CASE WHEN (hashed SubPlan 4) THEN 1 ELSE NULL::integer END) > 0))
Rows Removed by Filter: 2966
-> Merge Left Join (cost=1.14..320194.96 rows=7166797 width=184) (actual time=4851.792..17369.232 rows=70436610 loops=1)
Merge Cond: (instanalysis_hexagonalcityarea.id = instanalysis_publication.hexagon_id)
-> Merge Left Join (cost=0.71..21686.40 rows=49328 width=180) (actual time=109.033..164.451 rows=30857 loops=1)
Merge Cond: (instanalysis_hexagonalcityarea.id = instanalysis_twitterpublication.hexagon_id)
-> Index Scan using instanalysis_hexagonalcityarea_pkey on instanalysis_hexagonalcityarea (cost=0.29..591.47 rows=3300 width=176) (actual time=22.783..23.878 rows=3300 loops=1)
Filter: (city_id = 7)
Rows Removed by Filter: 7282
-> Index Scan using instanalysis_twitterpublication_5c78aecb on instanalysis_twitterpublication (cost=0.42..64392.25 rows=504291 width=8) (actual time=0.018..111.677 rows=170305 loops=1)
-> Materialize (cost=0.43..501402.61 rows=3754731 width=8) (actual time=0.011..6788.670 rows=71922153 loops=1)
-> Index Scan using instanalysis_publication_5c78aecb on instanalysis_publication (cost=0.43..492015.78 rows=3754731 width=8) (actual time=0.005..4034.838 rows=1778030 loops=1)
SubPlan 1
-> Nested Loop (cost=0.72..105061.24 rows=27624 width=4) (actual time=0.326..74.024 rows=21824 loops=1)
-> Nested Loop (cost=0.29..620.11 rows=2767 width=4) (actual time=0.024..2.915 rows=3374 loops=1)
-> Nested Loop (cost=0.00..143.13 rows=504 width=4) (actual time=0.016..0.618 rows=829 loops=1)
Join Filter: (u2.city_id = u3.id)
Rows Removed by Join Filter: 3350
-> Seq Scan on instanalysis_city u3 (cost=0.00..1.10 rows=1 width=4) (actual time=0.004..0.006 rows=1 loops=1)
Filter: ((name)::text = 'Durban'::text)
Rows Removed by Filter: 7
-> Seq Scan on instanalysis_spot u2 (cost=0.00..89.79 rows=4179 width=8) (actual time=0.001..0.242 rows=4179 loops=1)
-> Index Scan using instanalysis_instagramlocation_e72b53d4 on instanalysis_instagramlocation u1 (cost=0.29..0.89 rows=6 width=8) (actual time=0.001..0.002 rows=4 loops=829)
Index Cond: (spot_id = u2.id)
-> Index Scan using instanalysis_publication_e274a5da on instanalysis_publication u0 (cost=0.43..37.45 rows=30 width=8) (actual time=0.006..0.021 rows=6 loops=3374)
Index Cond: (location_id = u1.id)
Filter: ((publication_date >= '2016-11-30 23:00:00+00'::timestamp with time zone) AND (publication_date <= '2016-12-10 23:00:00+00'::timestamp with time zone))
Rows Removed by Filter: 80
SubPlan 2
-> Hash Join (cost=2595.62..25893.51 rows=9013 width=4) (actual time=22.511..73.141 rows=6220 loops=1)
Hash Cond: (u0_1.location_id = u1_1.id)
-> Seq Scan on instanalysis_twitterpublication u0_1 (cost=0.00..22927.36 rows=74772 width=8) (actual time=15.212..59.628 rows=75775 loops=1)
Filter: ((publication_date >= '2016-11-30 23:00:00+00'::timestamp with time zone) AND (publication_date <= '2016-12-10 23:00:00+00'::timestamp with time zone))
Rows Removed by Filter: 428516
-> Hash (cost=2348.24..2348.24 rows=19790 width=4) (actual time=6.538..6.538 rows=15589 loops=1)
Buckets: 32768 Batches: 1 Memory Usage: 805kB
-> Nested Loop (cost=0.70..2348.24 rows=19790 width=4) (actual time=0.023..5.052 rows=15589 loops=1)
-> Nested Loop (cost=0.28..39.28 rows=504 width=4) (actual time=0.015..0.186 rows=829 loops=1)
-> Seq Scan on instanalysis_city u3_1 (cost=0.00..1.10 rows=1 width=4) (actual time=0.003..0.004 rows=1 loops=1)
Filter: ((name)::text = 'Durban'::text)
Rows Removed by Filter: 7
-> Index Scan using instanalysis_spot_c7141997 on instanalysis_spot u2_1 (cost=0.28..33.14 rows=504 width=8) (actual time=0.010..0.124 rows=829 loops=1)
Index Cond: (city_id = u3_1.id)
-> Index Scan using instanalysis_twitterlocation_e72b53d4 on instanalysis_twitterlocation u1_1 (cost=0.42..3.93 rows=65 width=8) (actual time=0.001..0.004 rows=19 loops=829)
Index Cond: (spot_id = u2_1.id)
SubPlan 3
-> Nested Loop (cost=0.72..105061.24 rows=27624 width=4) (actual time=0.348..80.863 rows=21824 loops=1)
-> Nested Loop (cost=0.29..620.11 rows=2767 width=4) (actual time=0.028..3.507 rows=3374 loops=1)
-> Nested Loop (cost=0.00..143.13 rows=504 width=4) (actual time=0.016..0.646 rows=829 loops=1)
Join Filter: (u2_2.city_id = u3_2.id)
Rows Removed by Join Filter: 3350
-> Seq Scan on instanalysis_city u3_2 (cost=0.00..1.10 rows=1 width=4) (actual time=0.003..0.004 rows=1 loops=1)
Filter: ((name)::text = 'Durban'::text)
Rows Removed by Filter: 7
-> Seq Scan on instanalysis_spot u2_2 (cost=0.00..89.79 rows=4179 width=8) (actual time=0.001..0.276 rows=4179 loops=1)
-> Index Scan using instanalysis_instagramlocation_e72b53d4 on instanalysis_instagramlocation u1_2 (cost=0.29..0.89 rows=6 width=8) (actual time=0.001..0.003 rows=4 loops=829)
Index Cond: (spot_id = u2_2.id)
-> Index Scan using instanalysis_publication_e274a5da on instanalysis_publication u0_2 (cost=0.43..37.45 rows=30 width=8) (actual time=0.007..0.022 rows=6 loops=3374)
Index Cond: (location_id = u1_2.id)
Filter: ((publication_date >= '2016-11-30 23:00:00+00'::timestamp with time zone) AND (publication_date <= '2016-12-10 23:00:00+00'::timestamp with time zone))
Rows Removed by Filter: 80
SubPlan 4
-> Hash Join (cost=2595.62..25893.51 rows=9013 width=4) (actual time=41.392..92.680 rows=6220 loops=1)
Hash Cond: (u0_3.location_id = u1_3.id)
-> Seq Scan on instanalysis_twitterpublication u0_3 (cost=0.00..22927.36 rows=74772 width=8) (actual time=32.641..78.020 rows=75775 loops=1)
Filter: ((publication_date >= '2016-11-30 23:00:00+00'::timestamp with time zone) AND (publication_date <= '2016-12-10 23:00:00+00'::timestamp with time zone))
Rows Removed by Filter: 428516
-> Hash (cost=2348.24..2348.24 rows=19790 width=4) (actual time=7.907..7.907 rows=15589 loops=1)
Buckets: 32768 Batches: 1 Memory Usage: 805kB
-> Nested Loop (cost=0.70..2348.24 rows=19790 width=4) (actual time=0.044..6.136 rows=15589 loops=1)
-> Nested Loop (cost=0.28..39.28 rows=504 width=4) (actual time=0.026..0.220 rows=829 loops=1)
-> Seq Scan on instanalysis_city u3_3 (cost=0.00..1.10 rows=1 width=4) (actual time=0.006..0.008 rows=1 loops=1)
Filter: ((name)::text = 'Durban'::text)
Rows Removed by Filter: 7
-> Index Scan using instanalysis_spot_c7141997 on instanalysis_spot u2_3 (cost=0.28..33.14 rows=504 width=8) (actual time=0.016..0.135 rows=829 loops=1)
Index Cond: (city_id = u3_3.id)
-> Index Scan using instanalysis_twitterlocation_e72b53d4 on instanalysis_twitterlocation u1_3 (cost=0.42..3.93 rows=65 width=8) (actual time=0.001..0.005 rows=19 loops=829)
Index Cond: (spot_id = u2_3.id)
Planning time: 50.735 ms
Execution time: 46908.482 ms
Das Problem ist, dass ich nicht bekomme, was ich will, es scheint mehr Publikationen zu zählen. Die publicatons werden vorher nach Datum gefiltert, und ich möchte nur zählen, wie viele dieser gefilterten Publikationen in jedem Sechseck sind, aber es scheint alle Publikationen nach Hexagon zu zählen, als ob die When-Klausel nicht funktionierte.
Danke für Ihre Hilfe.
warum [count Aggregat] (https://docs.djangoproject.com/en/1.10/topics/db/aggregation/#generating-aggregates-for-each-item-in-a-queryset) ist keine Option? Theoretisch sollten zwei Aggregatabfragen mit "count" viel effizienter sein als eine Union-Abfrage mit IN-Klausel – Marat
Danke für Ihre Kommentare @Marat. Es ist viel schneller, ja, aber das Problem ist, dass ich falsche Ergebnisse bekomme. Ich habe den Beitrag mit dem SQL aktualisiert und erkläre die Analyse. –