In case of the OSA scenario, there are lots of different filters and due to the possibility that a column might contain several values, like categories, arrays must be used for a proper data model. The problem is that special indexes for each array needs to be created and this apparently introduces noticeable overhead in case of query times.
Since GIN indexes can be thought of inverted lists, the hypothesis is that a multi column that contains all filters in a single array might help to reduce the cardinality.
The experiment it as follows and the result is that the multi query should run much faster without lower costs:
create table oneshot_multi as select * from oneshot; alter table oneshot_multi add column multi text[]; update oneshot set multi=(picalike_gender || picalike_category || clusters::text[] || array[shop_id] || updated_weeks::text[]); create index super_index on oneshot_multi using gin(multi); VACUUM (ANALYZE, VERBOSE) oneshot_multi;
Hint: the multi column is of type text and thus integers must be casted to strings.
Then we can evaluate the same query with the two modes:
explain (analyze) SELECT picalike_id, hamming(bitset, b'00010111110110011010110001011001111010101011001110101000001111000100111100110101110001000000001111011111111011100111100101010100010110011011001011110110110111001101010010001101101111100110101110111011100101111000101011001101100000001101001010101110111101101011101011110000100111000010101101100101010110110111101110100011111110110000110010100101100110001011011010000000010101010100011010111001101111001111011001000011010110101010111100010100111101100101000001001001000011011000000100011000011001111011010011010000') AS dist FROM oneshot WHERE 1=1 AND shop_id IN ('pundc_de_crawler', 'soliver_de_crawler') AND picalike_groups && array[86] AND clusters @> array[70] AND hamming(bitset, b'00010111110110011010110001011001111010101011001110101000001111000100111100110101110001000000001111011111111011100111100101010100010110011011001011110110110111001101010010001101101111100110101110111011100101111000101011001101100000001101001010101110111101101011101011110000100111000010101101100101010110110111101110100011111110110000110010100101100110001011011010000000010101010100011010111001101111001111011001000011010110101010111100010100111101100101000001001001000011011000000100011000011001111011010011010000') <= 103 ORDER BY dist ASC LIMIT 10;
and now the new one:
explain (analyze) SELECT picalike_id, hamming(bitset, b'00010111110110011010110001011001111010101011001110101000001111000100111100110101110001000000001111011111111011100111100101010100010110011011001011110110110111001101010010001101101111100110101110111011100101111000101011001101100000001101001010101110111101101011101011110000100111000010101101100101010110110111101110100011111110110000110010100101100110001011011010000000010101010100011010111001101111001111011001000011010110101010111100010100111101100101000001001001000011011000000100011000011001111011010011010000') AS dist FROM oneshot_multi WHERE 1=1 AND multi && array['pundc_de_crawler', 'soliver_de_crawler'] AND multi && array['fashion_top_t-shirt'] AND multi @> array['70'] AND hamming(bitset, b'00010111110110011010110001011001111010101011001110101000001111000100111100110101110001000000001111011111111011100111100101010100010110011011001011110110110111001101010010001101101111100110101110111011100101111000101011001101100000001101001010101110111101101011101011110000100111000010101101100101010110110111101110100011111110110000110010100101100110001011011010000000010101010100011010111001101111001111011001000011010110101010111100010100111101100101000001001001000011011000000100011000011001111011010011010000') <= 103 ORDER BY dist ASC LIMIT 10;
HINT: “@>” does not make sense for the multi approach … cluster needs to be like this c_{no}_{id}
Keywords: postgresql, psql pgsql
Due to the nature of GIN indexes, 'index only' scans are not possible.
We switched from a oneshot table to a materialized view and it worked as expected. The query runtime is the same as on the table and we can refresh it concurrently. However, after some time, the query runtime, compared to a 'pure table' query is about 8-10 times longer. The index data for the materialized view is properly updated and enforced with vaccum analyze, but this does not help.
As a temporary kludge, we created a second materialized view + indexes + vacuum analyze and then we dropped the old object and rename the test object:
alter materialized view oneshot_test rename to oneshot;
But this is NO solution and we need to find out why this happens in the first place.
https://www.alibabacloud.com/blog/optimizing-real-time-tagging-on-postgresql_594689
Especially the paragraph 'Method 1: GIN Composite Indexes'