Postgres

We are migrating the v5 to Postgres, since PostgreSQL is now a hard requirement. This site contains useful information, hints, recipes and other stuff related to the database.

System setup

A Debian/Ubuntu like system is assumed:

[postgres is automatically selected due to dependency resolution]

sudo apt-get install postgresql-10-similarity postgresql-10-jsquery python3-psycopg2

Links for the individual packages:
https://github.com/eulerto/pg_similarity
https://github.com/postgrespro/jsquery
https://www.psycopg.org/docs/

To check that everything went fine, restart the service:
sudo su service postgresql restart service postgresql status lsof -nPi | grep postgres

The configuration can be found here: /etc/postgresql/10/main/postgresql.conf

To combine a local DB with docker, we have to ensure that psql listens on all devices, not just localhost. This modification can be done with a slight modification of '/etc/postgresql/10/main/postgresql.conf':

listen_addresses = '*'

Save the changes and restart the service:
service postgresql restart

To allow connection from a docker network, the default config needs to be modified:
host all all 172.17.0.2/16 md5
the row has to be added to /etc/postgresql/10/main/pg_hba.conf
The modification requires root privileges.

Maintenance

From time to time it is important to 're-organizing' a database which is known as 'vacuum'. See the documentation for more details: https://www.postgresql.org/docs/current/routine-vacuuming.html

It is also recommended [1] to perform vacuum + analyze whenever a lot of inserts / delete statements have been performed:

VACUUM (ANYALYZE) tablename;

In contrast to vacuum, the statement needs to be done for all tables. The process is I/O expensive and thus should be scheduled accordingly.

[1] https://www.postgresql.org/docs/current/sql-vacuum.html

Creating a DB

The initial work needs to be done as 'postgres' sudo su su postgres createdb products

Now open the postgres client in there, commands start with “\”:
psql products

\? Help \quit leave postgres prompt \d show tables

Creating a user

A new user must be created whose name matches the system user. For our setup, we create a new user with super powers:
psql products CREATE USER timo WITH PASSWORD 'timo'; GRANT ALL PRIVILEGES ON DATABASE products to timo; ALTER USER timo WITH SUPERUSER;

Transferring Data

There are two tools that can be used:

pg_dump pg_restore

The output includes both meta data and actual data.

Multi Threading / Python

The psycopg2 docs say that psycopg2 is “level 2 thread safe”, which means that connections can be shared across threads, but cursors can't. Therefore, a connection can be shared but each thread needs a new cursor.

AsyncIO / Python

asyncpg offers great features, idiotproof pooling and great performance.

Cheat Sheet

https://zaiste.net/posts/postgresql-primer-for-busy-people/

Searching in Arrays

There are different ways to search in arrays, but by far the most popular one for our use-cases is the 'IN' or containment operator. For example, let the shop_category be {A,B,C} and you want to retrieve all entries with category A:
select * from dingens_table where shop_category @> array['A'];

So, what is going on here: the '@>' means that 'A' is part of the array. It should be noted that the syntax is extremely important, it is not array[“A”] but array['A']. All patterns must be encapsulated into an array.

In case we want all matches with two categories {A,B} the pattern changes to:
shop_category @> array['A', 'B']

The order is not important, thus it does not matter if 'A' comes first. But it should be noted that this is no OR-query meaning 'A' or 'B' it means the subset {A, B} is present. In case of ORs, it must be done explicitly category @> array['A'] OR category @> array['B'].

An alternative syntax is: select * from filters where 5990 = any(prices)

Warning: the claim needs to be checked, but according to some posts 'ANY' does not use a GIN index

See https://www.postgresql.org/docs/9.6/functions-array.html for a reference

In case the data that is stored in the array can be enumerated, there is a method to convert them internally to integer via 'enum'. See this post for more details: https://dba.stackexchange.com/questions/250923/how-to-optimize-query-for-text-array-in-postgresql-using-operator
or the psql documentation: https://www.postgresql.org/docs/12/datatype-enum.html

Efficiency

Indexing arrays is a bit special and might not work as expected, since according to the documentation the whole array is indexed and not necessarily on an per-item basis. This can lead to performance penalties when a CONTAINS query is used. As a workaround, it is possible to use the intarray extension, strings can be often converted to ints by hashing or enumerating, to get a performance boost.

The syntax of the column definition does not change, but the index command is slightly different:
CREATE INDEX features_picalike_groups_idx ON features USING GIN ("picalike_groups" gin__int_ops);
instead of
CREATE INDEX features_picalike_groups_idx ON features USING GIN (“picalike_groups”);

For this to work, the extension needs to be created once:
CREATE EXTENSION intarray;

As expected this only works on integer columns and only for GIN indexes. All kind of contains query should now be much faster.

Additional links about the topic how to efficiently work with arrays follow: https://stackoverflow.com/questions/4058731/can-postgresql-index-array-columns
https://stackoverflow.com/questions/34740721/faster-search-in-intarray-column
https://dba.stackexchange.com/questions/105316/how-to-speed-up-order-by-sorting-when-using-gin-index-in-postgresql
https://stackoverflow.com/questions/45777844/postgres-why-isnt-this-query-using-the-gin-index-on-the-array-value

Partitioning Tables

A partitioned table is seperated into smaller tables by a key or a condition. Having very big tables, performance can increase with partitioning: - faster deletes - you might even drop just a sub_table/partition - faster concurrent inserts into multiple partition at once - smaller indexes per partition - partitions can even be queried individually

However, partitioned data has a huge performance penalty if queries need to access multiple partitions. Therefore the use case for partitioning is only applicable if the partitions can be layed out in a way that performance critical queries only stay within a certain partition. Examples would be: - billing data for customers - timeseries data, that's only evaluated within a year/month

Another advantage of partitioning is the possibility to use storage tiers. In timeseries the last year might be very hot data, so that would be stored on SSDs. Last years data would only be useful under certain conditions therefore cold and stored on HDDs

Optimization / Lessons learned

When it comes to analyzing query runtime and also cardinality reduction or what index is actually used, ANALYZE (EXPLAIN) is by far the most important tool to understand queries. With a better understanding it is often possible to rephrase it, adjusting indexes or types, or even change the underlying data model.
The output is explained in this post: https://thoughtbot.com/blog/reading-an-explain-analyze-query-plan

The question is whether to hold everything in a single table or to divide data into logical components. The latter has the advantage that the design is much clearer and easier, but the drawback is that joining tables can be a bottleneck for 'real-time queries'.

We evaluated a design where we use a materialized view with concurrent updates to still be able to serve queries during a refresh and the are no drawbacks. Thus, we stay with the old design to store data in dedicated tables but then, we use the view to combine them.

Parallel Queries:
https://blog.programster.org/postgresql-parallel-queries
https://wiki.postgresql.org/wiki/Parallel_Query
there are quite a few config settings to modify which we need to evaluate careful.

Materialized Views

Instead of an ordinary view, a materialized one behaves almost exactly like a table object with the exception that you cannot update it. But in contrast to a table, the view can be generated on the fly by using some filter criteria on various tables.

The rough idea is that you aggregate a specific view of the data and refresh it periodically to ensure that it always has fresh data. Like an update once in the night to have fresh data in the morning, or after some accumulated updates of the base tables.

For a concurrent update, the only restriction is to have a column with a unique ID which is almost always given for referential integrity. The concurrent mode allows you to still query the old(!) content and a switch is done when the refresh process finished. That means the update is _not_ done incrementally.

References:
https://www.postgresql.org/docs/12/sql-refreshmaterializedview.html
https://dba.stackexchange.com/questions/86779/refresh-materalized-view-incrementally-in-postgresql

Advantages

the 'array' datatype: instead of defining cluster_1, cluster_2, .., cluster_n we can define a column as an array like that 'cluster int[]', the length does not need to be fixed. Indexing is possible with the GIN method. But the performance for strings might not be optimal.
there is a bit datatype that allows arbitrarily long bitsets, like 512 'bitset bit[512]' which avoids to group the bits into groups of 64-bit integers and also solves the unsigned uint problem.
full json support allows us to insert content without the necessity to setup a schema first and furthermore, json arrays can be also efficiently indexed with the GIN method.
Corpex is experienced with PostgreSQL and thus can help with setup and operations.

JSON

Learning by example:
products=# select * from meta where metadata->'category' @> '[["Taschen"]]';
products=# select metadata->>'category' from meta where metadata->'category' @> '[["Taschen"]]';
products=# explain select metadata->>'category' from meta where metadata->'category' @> '[["Taschen"]]' limit 1;\\

This is from a valid product blob and the 'category' is a list of list which is why we need '[['.

To check if a key exists in a dictionary. Our example is that we are saving the attributes of all products in 'scores' and if we want to check if the attribute 'black' is present the query looks like this:

select count(*) FROM attributes a, filters f where a.picalike_id=f.picalike_id and a.metadata->'scores' ? 'black';

Specialized Indexes

In case there are queries which are highly specialized, for instance a huge customer that always wants to fire the same query type, it is possible to build a partial index that is conditioned of a subset of data. The syntax is as follows:
CREATE INDEX orders_unbilled_index ON orders (order_nr) WHERE billed is not true;

The documentation can be found here: https://www.postgresql.org/docs/current/indexes-partial.html

Block Range Indexes

A special time of index is the block range index that can be used for timestamps and many other things. Details can be found here: https://www.postgresql.org/docs/9.5/brin-intro.html

Compound Indexes

The ugly at first, a compound index might help, but often single indexes might be as efficient as the multi column solution and it is not encouraged to use it for three columns, but it might still make sense if the 'base columns' are present in every query. More details can be found:
https://www.postgresql.org/docs/12/indexes-multicolumn.html

Depending on the kind of query, it makes sense to index multiple columns. This might speed up query execution, but might have the problem that when queries change that the index is not used as expected. As usual, EXPLAIN is the way to go to ensure that the query plan looks exactly as planned and the index is used at all.

We stumbled about slower execution time for OSA queries. What makes them special? We have a reference shop and a picalike_category with the aim to query a list of competitors (others shops) with the same picalike_category. Furthermore, we one more cluster IDs. For the sake of simplicity, we just use the first cluster. Then a typical query looks like this:
select picalike_id, ham_dist(ref_bitset, bitset) from oneshot where cluster @> array[x] and picalike_category @> array[y] and shop_id IN (customer_list)
The explain clearly shows that there is a merge phase (AND) for the individual indexes which is why we decided to create a compound index for picalike_category/cluster. We verified with EXPLAIN and also a realistic OSA query evaluation that the index indeed boost performance.

In general compound indexes can be also used for columns with different types, but not all types are equally useful for indexes in general. For instance, we use intarray to perform a more efficient 'contains' @> operation for the cluster and the category.

Bottom line, whenever multiple filters are always present in a query, it makes sense to think about indexes that span multiple columns. And also ensure that the index reduces the cardinality, otherwise, in our case, sorting and the distance calculation have to work on a larger rowset.

In psql multi column indexes are not very special and can be created like that:

CREATE INDEX oneshot_group_cluster_idx on oneshot USING GIN("clusters" gin__int_ops, "picalike_groups" gin__int_ops);

The special 'gin__int_ops' flags just indicate to use the intarray extension, but it also works without it:
CREATE INDEX oneshot_group_cluster_idx on oneshot USING GIN("clusters" , "shop_category");

But it should be noted that GIN indexes on non-integer data are not optimal when it comes to query performance [1].

[1] https://stackoverflow.com/questions/4058731/can-postgresql-index-array-columns

Feature

It should be noted that GIN multi column indexes are also used when just one of the fields is present in the query. This is in contrast to BTREE indexes and is no 'error'.

Limitations

It is not possible to combine different index types to a compound index, e.g. BTREE (default) and GIN.
Open Question: is it possible to use a compound index on columns that are queried by a different Operator (e.g. “=” and “@>”) ?

Recheck Conditions

When 'dirty' is present in the output this indicates that work_mem is too low. But in general the recheck is always present. For more details:
https://stackoverflow.com/questions/50959814/what-does-recheck-cond-in-explain-result-mean
https://dba.stackexchange.com/questions/106264/recheck-cond-line-in-query-plans-with-a-bitmap-index-scan

Planing Your Query

To really understand the output of explain, one must know different scan types and how joins are executed. In general, joins can lead to different methods:

nested loop [expensive since one side is read multiple times]
merge join
hash join

And there are different scan types:

[bitmap] index scan
seq scan [sequential scans should be avoided for obvious reasons]

It should be noted that the operators ⇐, >= do not support efficient joins, see stackoverflow posting. This is relevant for us, if we use range queries, like 'price'.

References:
http://tuning.postgres.de/postgresql_joins_planen
https://en.wikipedia.org/wiki/Bitmap_index
https://stackoverflow.com/questions/6592626/what-is-a-bitmap-heap-scan-in-a-query-plan
https://stackoverflow.com/questions/46510860/postgresql-avoid-nested-loop-with-join

Docker

To use extensions is a bit more complicated since, for instance, postgresql-x-similarity is not available from the debian/alpine repository. An example can be found in v5_backend:

RUN apt-get update && apt-get upgrade -y && apt-get -y install \
  postgresql-server-dev-12 git make gcc \
  && rm -rf /var/lib/apt/lists/* \
  && git clone https://github.com/eulerto/pg_similarity.git \
  && cd pg_similarity \
  && USE_PGXS=1 make \
  && USE_PGXS=1 make install

HINT: 'postgresql-server-dev-12' must match the major version of the PostgreSQL of the docker image which should be 13 by default, but we are still using 12.

This command downloads and build the extension and also registers it. The key to success is that the code already follows the extension structure of Postgresql.

Furthermore, the default shared memory size is way too low, thus always use something like for docker run:

--shm-size 1G

where the actual value depends on the RAM to use for the instance.

Permissions

Since we store the data outside the docker container, we need to ensure that the permissions are set properly. To do this, we map the postgres user from the container to an existing user outside the docker. The uid/gid has to be set during the build process:

RUN addgroup –gid $GROUP_ID user
RUN adduser –disabled-password –gecos –uid $USER_ID –gid $GROUP_ID user RUN usermod -o -g $GROUP_ID -u $USER_ID postgres We create a new user/group with the provided arguments and overwrite the uid/gid of the existing postgres to these values. The fact that the uid is not unique any longer is of no concern, since we never use the new user at all.docker build –build-arg USER_ID=$(id -u) –build-arg GROUP_ID=$(id -g)In the Dockerfile, each variable must be declared as 'ARG USER_ID' for example. ==== Setup ==== Docker image (with quick reference): https://hub.docker.com/_/postgres Download image and run container with:docker run –name <container_name> -e POSTGRES_PASSWORD=<password> -d postgres–> this will download the latest stable release (12.4 as of 14.09.2020) and start a container with the default user “postgres”. Connect:docker exec -it <container_name> psql -U <user>–> where you can run psql commands as usual Run commands from outside of the container:docker exec -it <container_name> psql -U <user> -c “SQL_command;”Setup scripts (compare quick reference): If you would like to do additional initialization in an image derived from this one, add one or more *.sql, *.sql.gz, or *.sh scripts under /docker-entrypoint-initdb.d (creating the directory if necessary). After the entrypoint calls initdb to create the default postgres user and database, it will run any *.sql files, run any executable *.sh scripts, and source any non-executable *.sh scripts found in that directory to do further initialization before starting the service. Scripts in /docker-entrypoint-initdb.d are only run if you start the container with a data directory that is empty; Example for user setting may also be found on quick reference. To check the connection use: psql –db products –host localhost –port 5401 –user docker ==== Additional Infos ==== For some images a kludge might be required to install psycopg2 via pip:RUN pip install psycopg2-binaryTo be successful, we need install some additional packages:RUN apk update && apk add postgresql-dev gcc python3-dev musl-dev===== RAM-Disk ===== To emulate the memory tables from maria db, there are several options. First, it should be noted that the table is not updated often. There are no spontaneous inserts, just a batch insert to update the features. Thus, we are mainly concerned with read performance. One option would be to store the database entirely on a ram disk, but as pointed out, in case of 'read-only' there are other like, 'shared_buffers' Still, it would be possible to create a table space on a ram disk, but we need to figure out if this really boosts the performance. The general recipe looks like that: mount -t tmpfs -o size=1500M none /ramdisk mkdir /ramdisk/pg chown postgres.postgres /ramdisk/pg chmod 700 /ramdisk/pg create tablespace foo_ramdisk location '/ramdisk/pg'; create database foo_test tablespace = foo_ramdisk; create table my_ram_table(..) TABLESPACE foo_ramdisk; ==== Should you use RAM disk? ==== There are some voices against RAM disk, main issues (seem to) being data safety (having parts of data on RAM disk and a crash there will influence/corrupt the complete system), limited storage (if more storage than allowed is required → error) and limited advantages (e.g. compared to big enough shared buffer). https://stackoverflow.com/questions/50701725/postgresql-on-a-ram-disk http://blog.vorona.ca/postgresql-on-ram-disk-does-not-worth-the-trouble.html https://schinckel.net/2019/09/25/speeding-up-postgres-using-a-ram-disk/ https://www.2ndquadrant.com/en/blog/postgresql-no-tablespaces-on-ramdisks/ We evaluated the query performance with a single feature table and a query that determines the hamming distance with respect to some category filters. The runtime for the normal database and the database on the RAM-disk did not differ. This is not conclusive, but evidence that ram disks do not provide advantages out-of-the-box. ===== Generated Columns ===== To model range queries, for instance the price, generated columns can be used: https://www.postgresql.org/docs/12/ddl-generated-columns.html The idea is to automatically extract min/max from a given array column. However, they are not available in older PSQL versions and thus, we currently implement a fallback until we have a docker solution. ===== Tuning =====work_mem (integer)Sets the maximum amount of memory to be used by a query operation (such as a sort or hash table) before writing to temporary disk files. Sort operations are used for ORDER BY, DISTINCT, and merge joins. Hash tables are used in hash joins, hash-based aggregation, and hash-based processing of IN subqueries.shared_buffers (integer)Sets the amount of memory the database server uses for shared memory buffers. There is also a tool that helps to optimize settings based on a pre-defined configuration: https://pgtune.leopard.in.ua/#/ ===== Reference Project ===== To see PostgreSQL in action, visit the v5_backend git https://git.picalike.corpex-kunden.de/incubator/v5-backend ===== Johnny's Postgres Wisdoms ===== ==== With statement (Common Table Expression (CTE)) ==== Example from documentation: <code> WITH regional_sales AS ( SELECT region, SUM(amount) AS total_sales FROM orders GROUP BY region ), top_regions AS ( SELECT region FROM regional_sales WHERE total_sales > (SELECT SUM(total_sales)/10 FROM regional_sales) ) SELECT region, product, SUM(quantity) AS product_units, SUM(amount) AS product_sales FROM orders WHERE region IN (SELECT region FROM top_regions) GROUP BY region, product; </code> The with statement allows for more readable, easier to optimize queries in complex situations. The use of the with statement can usually replace sub-selects or inline views. Especially When referring to the same inline view multiple times, postgresql is able to do a CTE - Scan instead of a normal Tablescan, wich ends up to be faster. See official docs: https://www.postgresql.org/docs/13/queries-with.html ==== Recursive queries ==== Example from documentation: <code> WITH RECURSIVE t(n) AS ( VALUES (1) UNION ALL SELECT n+1 FROM t WHERE n < 100 ) SELECT sum(n) FROM t; </code> Imagine a recursive query to union as long as a certain condition is satisfied. In practice, although beiing kind of complex, it allows answering complex questions like for example how often a product has been reduced in a given timefram while ignoring unchanged prices or price raises. ==== Index Tuning with Compound indexes ==== This is a poor mans guide to index tuning. == Usefulness/meaningfullness of an Index == Let's say there's a table having two columns and 1.000.000 rows. column 1 has 5000 distinct values and the second one has two. Assuming the data is kind of regularly distributed and most of your queries would involve a value of column 1. Then a lookup to the index of column 1 would restrict the resultset to 200 rows. An index on column 2 would restrict the resultset only to 500.000 rows. You dind't gain much from the index in that case. This makes an index on column 1 more useful. An index is more useful the shorter the path to a small resultset is. == Indexes and sorting == Indexes can - even partially when inside a compound index - improve sorting performance. For that the index direction has to match the sort direction. For effective sorting with a compound index, the sort-fields should be at the start of the index fields or come after the fields every targeted query uses. For me there are two types of indexing goals: 1. Satisfy a query fully from the index if possible Queries, that don't return many columns - Especially referring to columns not holding that many distinct values are easy to Satisfy fully from an index. The index has to include all fields returned and queried. A very big Index with a lot of columns might make sense, but usually you won't gain anything from an index, that's as big as the table. The order of the indexed fields is important. You start with fields making a meaningful difference on the left. Something that's very distinct (in terms of the filter condition). Then you work your way through the right. 2. An Index that helps you to find the data The fields you need are either all or the most meaningful fields of the where clause. As abover go from left to right by meaningfullness. Create indexes only for slow queries and large tables. If you index everything, your insert performance will suck. === Distinct On === Example: <code> select distinct on (username) username, login_date as last_login from user_logs where action='login' order by username, login_date desc </code> Docs: https://www.postgresql.org/docs/13/sql-select.html You can distinct not only on the full row, but also on certain fields. For that, the ordering of the results has to match the distinct on clause. Sorting after the initial sort determines which values to keep. === Materialized Views === Docs: https://www.postgresql.org/docs/9.3/sql-creatematerializedview.html Materialized views are easy to use and are a good fit for data, that doesn't change as often. And is acessed frequently. An example would be a large dataset that's updated weekly. Queries often go to a grouped subquery that's than filtered or further aggregated. In this case it could make sense to create a materialized view for the aggregated data and refresh it manually when the data is updated. === Killing Queries === In case some parallel issued query are very complex and overload the system, it is possible to gracefully ask all active queries to shutdown. HINT: the following query really kills *ALL* queries with no exception. SELECT pg_cancel_backend(pid) FROM pg_stat_activity WHERE state = 'active' and pid <> pg_backend_pid()''

Table of Contents