This page aims to benchmark various database-like tools popular in open-source data science. It runs regularly against very latest versions of these packages and automatically updates. We provide this as a service to both developers of these packages and to users. We hope to add joins and updates with a focus on ordered operations which are hard to achieve in (unordered) SQL. We hope to add more solutions over time although the most interesting solutions seems to be not mature enough. See README.md for detailed status.

We limit the scope to what can be achieved on a single machine. Laptop size memory (8GB) and server size memory (250GB) are in scope. Out-of-memory using local disk such as NVMe is in scope. Multi-node systems such as Spark running in single machine mode is in scope, too. Machines are getting bigger: EC2 X1 has 2TB RAM and 1TB NVMe disk is under $300. If you can perform the task on a single machine, then perhaps you should. To our knowledge, nobody has yet compared this software in this way and published results too.

We also include the syntax being timed alongside the timing. This way you can immediately see whether you are doing these tasks or not, and if the timing differences matter to you or not. A 10x difference may be irrelevant if that’s just 1s vs 0.1s on your data size. The intention is that you click the tab for the size of data you have.

Because we have been asked many times to do so, the first task and initial motivation for this page, was to update the benchmark designed and run by Matt Dowle (creator of data.table) in 2014 here. The methodology and reproducible code can be obtained there. Exact code of this report and benchmark script can be found at h2oai/db-benchmark created by Jan Gorecki funded by H2O.ai. In case of questions/feedback, feel free to file an issue there.

Groupby

Plot below presents just single input data and basic set of questions. Complete results of groupby task benchmark can be found in h2oai.github.io/db-benchmark/groupby.html report.

0.5 GB

5 GB

50 GB

Environment configuration

Listed solutions where run using following versions of languages:
- R 3.5.1
- python 3.6
- Julia 1.0.2

Component Value
cpu_model Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
cpu_cores 20
memory_model DIMM DDR4 Synchronous 2133 MHz
memory_gb 125.8

Benchmark run took around 33.8 hours.

Report was generated on: 2019-01-14 22:21:57 PST.