Table of Contents
MonetDB Evaluation 2022
I used a docker image for it: https://hub.docker.com/r/monetdb/monetdb
docker pull monetdb/monetdb
username/password (monetdb/monetdb)
Documentation: https://pymonetdb.readthedocs.io/en/latest/api.html
Some performance tips: https://www.monetdb.org/documentation-Jul2021/admin-guide/performance-tips/performance-tips/
Python Client pymonetdb: https://pymonetdb.readthedocs.io/en/latest/api.html
There is a blocker with respect to the transfer performance: https://github.com/gijzelaerr/pymonetdb/issues/94
Evaluating query plans / performance:
- PLAN is the relational query plan
- EXPLAIN is the SQL algebra [without costs]
- TRACE: is doing timings but not per 'statement'
There is an embedding version does not need a server and stores the data on disk: https://www.monetdb.org/documentation-Jul2021/dev-guide/mbedded-python/
Activating Python 3 in the docker image
Install required packages
docker exec -it --user root monetdb /bin/bash yum install MonetDB-python3 python3-numpy
Do the database settings
monetdb stop demo monetdb set embedpy3=yes demo monetdb start demo monetdb get embedpy3 demo
Restrictions
- VACUUM: we now disallow vacuum on system tables. The vacuum function isn't safe enough for these tables. A better vacuum solution for the system tables is needed. [..]
The memory is growing and growing: https://www.monetdb.org/documentation-Jul2021/admin-guide/system-resources/memory-footprint/ Bottom line: We need cgroups to limit the memory size, docker does it automatically and we need to avoid that the process gets killed by the OOM killer.
Kludge: prevent that the process is “oom”ed if it hits the memory limit:
echo -1000 > /proc/[monetdb_pid]/oom_adj
Snippets
Define a python function to calculate the hamming distance of need (query) and database rows (stings). Without an error, the data needs to be stored as a string, converted to numpy which is not efficient.
CREATE FUNCTION python_hamdist(strings CLOB , needle CLOB) RETURNS INTEGER LANGUAGE PYTHON { c = numpy.fromstring(needle, sep=" ", dtype=numpy.float32) qq = numpy.array([numpy.fromstring(q, sep=" ") for q in strings], dtype=numpy.float32) return numpy.abs(qq - c).sum(axis=1).tolist()
Bottom Line
There are no array data types we need for the sim DB. Plus, there are a lot of recent issues regarding performance issues with the python client or the python interface. And I was not able to find any reports that weren't written by the MonetDB guys. Therefore, the risk for a live MonetDB is big, since tests with smaller datasets already showed some limitations, like import speed and data transfer performance from the DB to the caller.
References:
https://github.com/MonetDB/MonetDB/issues/4048
https://stackoverflow.com/questions/65074614/monetdb-full-disk-how-to-manually-free-space
https://stackoverflow.com/questions/65079976/monetdb-set-specific-embedded-python-version
https://www.monetdb.org/documentation-Jul2021/dev-guide/mbedded-python/