This page presents results of h2oai.github.io/db-benchmark groupby task benchmark for various datasizes and various data characteristis (cardinality, percentage of missing values, pre-sorted input). There are 10 different questions run for each input data, questions are categorized into two groups. Basic questions refers to set of 5 questions designed by Matt Dowle (creator of data.table) in 2014 here. Advanced questions are 5 new questions meant to cover more complex queries, which are also less obvious to optimize.

Groupby

Below timings are presented for a single dataset case having random order, no NAs (missing values) and particular cardinality factor (group size question 1 k=100). To see timings for other cases scroll down to full timings table. If a solution is missing on particular data size timings table refer to benchplot for a reason and check its speed on smaller data size tab.

0.5 GB

Set of basic questions

Set of advanced questions

Details table

Plots of all cases can be found at G1_1e7_1e2_0_0.basic.png, G1_1e7_1e1_0_0.basic.png, G1_1e7_2e0_0_0.basic.png, G1_1e7_1e2_0_1.basic.png, G1_1e7_1e2_0_0.advanced.png, G1_1e7_1e1_0_0.advanced.png, G1_1e7_2e0_0_0.advanced.png, G1_1e7_1e2_0_1.advanced.png. Below first run timings.

rows q1_grp.size NA_pct pre_sorted question data.table dplyr pandas pydatatable spark dask juliadf clickhouse cudf
1e7 100 0 unsorted sum v1 by id1 0.112 0.249 0.246 0.285 1.609 0.106 4.511 0.153 0.503
1e7 100 0 unsorted sum v1 by id1:id2 0.106 0.521 0.689 0.401 2.585 0.195 0.573 0.116 0.190
1e7 100 0 unsorted sum v1 mean v3 by id3 0.132 0.808 0.509 0.843 1.861 1.659 1.102 0.177 2.047
1e7 100 0 unsorted mean v1:v3 by id4 0.262 0.644 0.552 0.509 0.775 0.246 0.662 0.053 0.083
1e7 100 0 unsorted sum v1:v3 by id6 0.202 1.479 0.895 0.225 1.697 0.333 1.110 0.124 0.214
1e7 100 0 unsorted median v3 sd v3 by id2 id4 0.960 1.562 1.352 NA NA NA 2.695 0.416 NA
1e7 100 0 unsorted max v1 - min v2 by id2 id4 0.383 0.891 0.914 0.158 1.392 0.269 1.979 0.150 NA
1e7 100 0 unsorted largest two v3 by id2 id4 0.735 5.907 3.790 1.378 3.068 67.364 1.844 1.003 NA
1e7 100 0 unsorted regression v1 v2 by id2 id4 0.725 1.237 12.255 NA 1.340 NA 1.983 0.148 NA
1e7 100 0 unsorted sum v3 count by id1:id6 1.690 NA NA 9.933 4.973 14.308 3.122 1.776 2.998
1e7 10 0 unsorted sum v1 by id1 0.152 0.114 0.194 0.154 1.738 0.080 4.557 0.053 0.448
1e7 10 0 unsorted sum v1 by id1:id2 0.192 0.231 0.523 0.291 1.984 0.145 0.325 0.159 0.140
1e7 10 0 unsorted sum v1 mean v3 by id3 0.267 2.771 0.847 2.813 2.685 12.816 1.812 0.310 2.367
1e7 10 0 unsorted mean v1:v3 by id4 0.248 0.462 0.546 0.128 0.720 0.206 0.671 0.050 0.058
1e7 10 0 unsorted sum v1:v3 by id6 0.226 6.589 1.563 0.489 2.090 1.536 2.120 0.236 0.215
1e7 10 0 unsorted median v3 sd v3 by id2 id4 0.825 0.758 0.951 NA NA NA 2.289 0.115 NA
1e7 10 0 unsorted max v1 - min v2 by id2 id4 0.360 0.411 0.703 0.113 0.698 0.158 1.520 0.145 NA
1e7 10 0 unsorted largest two v3 by id2 id4 0.603 5.264 3.388 2.636 2.744 4.267 1.475 1.217 NA
1e7 10 0 unsorted regression v1 v2 by id2 id4 0.435 0.522 2.539 NA 0.733 NA 1.535 0.091 NA
1e7 10 0 unsorted sum v3 count by id1:id6 2.202 NA NA 6.271 4.990 29.638 4.018 1.831 3.216
1e7 2 0 unsorted sum v1 by id1 0.187 0.129 0.195 0.139 1.691 0.076 4.834 0.051 0.425
1e7 2 0 unsorted sum v1 by id1:id2 0.159 0.181 0.563 0.203 1.271 0.144 0.313 0.150 NA
1e7 2 0 unsorted sum v1 mean v3 by id3 0.476 6.734 2.168 3.744 3.092 24.303 3.895 0.763 NA
1e7 2 0 unsorted mean v1:v3 by id4 0.297 0.302 0.546 0.121 0.811 0.261 0.696 0.053 NA
1e7 2 0 unsorted sum v1:v3 by id6 1.421 14.817 4.605 1.581 2.231 4.114 3.410 0.493 NA
1e7 2 0 unsorted median v3 sd v3 by id2 id4 0.498 0.636 0.919 NA NA NA 2.395 0.365 NA
1e7 2 0 unsorted max v1 - min v2 by id2 id4 0.167 0.272 0.705 0.107 0.655 0.215 1.466 0.141 NA
1e7 2 0 unsorted largest two v3 by id2 id4 0.492 5.481 3.219 0.569 4.586 4.080 1.693 1.843 NA
1e7 2 0 unsorted regression v1 v2 by id2 id4 0.309 0.388 2.014 NA 0.603 NA 1.413 0.185 NA
1e7 2 0 unsorted sum v3 count by id1:id6 4.256 NA NA 4.802 4.911 43.042 5.987 1.760 NA
1e7 100 0 sorted sum v1 by id1 0.048 0.105 0.202 0.097 1.708 0.075 4.594 0.035 0.433
1e7 100 0 sorted sum v1 by id1:id2 0.049 0.153 0.560 0.298 1.744 0.160 0.316 0.076 0.110
1e7 100 0 sorted sum v1 mean v3 by id3 0.125 0.725 0.418 0.774 2.074 1.435 1.113 0.158 0.972
1e7 100 0 sorted mean v1:v3 by id4 0.224 0.617 0.587 0.250 0.806 0.188 0.710 0.047 0.057
1e7 100 0 sorted sum v1:v3 by id6 0.219 1.544 0.914 0.250 2.047 0.366 1.164 0.120 0.191
1e7 100 0 sorted median v3 sd v3 by id2 id4 0.544 1.058 1.015 NA NA NA 2.344 0.233 NA
1e7 100 0 sorted max v1 - min v2 by id2 id4 0.165 0.392 0.739 0.117 1.207 0.199 1.485 0.082 NA
1e7 100 0 sorted largest two v3 by id2 id4 0.651 5.566 3.821 1.818 2.987 69.792 1.526 0.983 NA
1e7 100 0 sorted regression v1 v2 by id2 id4 0.530 0.757 12.028 NA 1.227 NA 1.452 0.106 NA
1e7 100 0 sorted sum v3 count by id1:id6 0.817 NA NA 9.040 5.591 13.654 3.210 1.766 2.969
1e7 NA NA NA NA 22.451 NA NA NA NA NA 83.919 17.733 NA

5 GB

Set of basic questions

Set of advanced questions

Details table

Plots of all cases can be found at G1_1e8_1e2_0_0.basic.png, G1_1e8_1e1_0_0.basic.png, G1_1e8_2e0_0_0.basic.png, G1_1e8_1e2_0_1.basic.png, G1_1e8_1e2_0_0.advanced.png, G1_1e8_1e1_0_0.advanced.png, G1_1e8_2e0_0_0.advanced.png, G1_1e8_1e2_0_1.advanced.png. Below first run timings.

rows q1_grp.size NA_pct pre_sorted question data.table dplyr pandas pydatatable spark dask juliadf clickhouse cudf
1e8 100 0 unsorted sum v1 by id1 0.952 2.021 2.092 2.366 3.299 0.603 6.117 0.329 NA
1e8 100 0 unsorted sum v1 by id1:id2 0.994 5.999 6.726 3.946 3.526 1.454 4.912 0.774 NA
1e8 100 0 unsorted sum v1 mean v3 by id3 1.417 16.788 7.476 7.934 6.172 164.928 14.982 1.592 NA
1e8 100 0 unsorted mean v1:v3 by id4 1.533 8.347 6.083 1.791 1.330 1.607 3.817 0.295 NA
1e8 100 0 unsorted sum v1:v3 by id6 1.612 38.866 13.724 3.241 4.813 6.555 18.782 1.511 NA
1e8 100 0 unsorted median v3 sd v3 by id2 id4 9.655 14.860 13.964 NA NA NA 17.343 1.589 NA
1e8 100 0 unsorted max v1 - min v2 by id2 id4 3.913 9.364 8.726 1.020 2.210 1.198 14.075 0.736 NA
1e8 100 0 unsorted largest two v3 by id2 id4 6.475 74.672 44.655 91.260 12.631 142.403 13.927 10.547 NA
1e8 100 0 unsorted regression v1 v2 by id2 id4 5.661 10.908 44.889 NA 2.774 NA 16.225 0.942 NA
1e8 100 0 unsorted sum v3 count by id1:id6 16.357 NA NA 69.427 39.200 591.459 36.504 18.391 NA
1e8 10 0 unsorted sum v1 by id1 1.282 1.387 2.165 1.130 2.130 0.660 6.986 0.313 NA
1e8 10 0 unsorted sum v1 by id1:id2 1.444 2.713 5.359 2.109 3.093 1.082 1.918 1.484 NA
1e8 10 0 unsorted sum v1 mean v3 by id3 4.816 35.655 15.711 26.065 11.192 465.753 34.221 3.747 NA
1e8 10 0 unsorted mean v1:v3 by id4 2.366 4.571 6.068 0.937 1.307 1.909 3.924 0.324 NA
1e8 10 0 unsorted sum v1:v3 by id6 6.678 86.364 30.189 4.667 7.646 28.810 31.544 2.809 NA
1e8 10 0 unsorted median v3 sd v3 by id2 id4 9.056 9.451 9.526 NA NA NA 12.678 0.871 NA
1e8 10 0 unsorted max v1 - min v2 by id2 id4 3.663 4.495 6.950 0.916 1.819 0.970 9.802 1.230 NA
1e8 10 0 unsorted largest two v3 by id2 id4 9.757 70.819 39.384 7.136 13.522 64.726 9.637 13.461 NA
1e8 10 0 unsorted regression v1 v2 by id2 id4 4.745 5.932 27.667 NA 2.075 NA 11.175 0.569 NA
1e8 10 0 unsorted sum v3 count by id1:id6 27.037 NA NA 60.030 29.037 765.439 46.002 17.345 NA
1e8 2 0 unsorted sum v1 by id1 1.424 1.692 1.959 0.815 2.918 0.634 9.735 0.337 NA
1e8 2 0 unsorted sum v1 by id1:id2 1.474 2.137 5.146 1.301 2.114 1.074 1.992 1.211 NA
1e8 2 0 unsorted sum v1 mean v3 by id3 32.028 104.242 25.572 42.031 16.986 824.347 46.478 6.591 NA
1e8 2 0 unsorted mean v1:v3 by id4 2.289 3.182 5.430 0.905 1.374 1.696 4.442 0.400 NA
1e8 2 0 unsorted sum v1:v3 by id6 6.062 221.063 59.172 15.378 14.238 67.683 39.776 4.662 NA
1e8 2 0 unsorted median v3 sd v3 by id2 id4 4.833 6.433 8.703 NA NA NA 15.465 3.352 NA
1e8 2 0 unsorted max v1 - min v2 by id2 id4 1.683 3.329 6.741 0.962 1.602 1.075 11.500 0.963 NA
1e8 2 0 unsorted largest two v3 by id2 id4 4.911 67.282 36.880 2.011 39.979 35.800 13.843 27.562 NA
1e8 2 0 unsorted regression v1 v2 by id2 id4 3.354 4.067 20.036 NA 1.824 NA 12.383 1.434 NA
1e8 2 0 unsorted sum v3 count by id1:id6 19.269 NA NA 49.392 37.913 1334.087 63.819 16.699 NA
1e8 100 0 sorted sum v1 by id1 0.357 0.934 2.314 0.740 2.846 0.641 6.369 0.239 NA
1e8 100 0 sorted sum v1 by id1:id2 0.346 1.327 5.413 1.796 3.157 1.191 1.936 0.520 NA
1e8 100 0 sorted sum v1 mean v3 by id3 1.163 15.528 5.790 8.191 6.494 145.566 14.976 1.557 NA
1e8 100 0 sorted mean v1:v3 by id4 1.576 7.932 5.980 3.468 1.245 1.901 3.892 0.577 NA
1e8 100 0 sorted sum v1:v3 by id6 1.357 38.213 14.063 3.410 4.882 7.944 19.458 1.443 NA
1e8 100 0 sorted median v3 sd v3 by id2 id4 5.476 7.314 9.485 NA NA NA 11.900 0.784 NA
1e8 100 0 sorted max v1 - min v2 by id2 id4 1.396 3.327 7.009 0.809 2.050 1.348 8.186 0.456 NA
1e8 100 0 sorted largest two v3 by id2 id4 4.405 68.761 43.433 88.453 10.757 120.713 8.935 9.814 NA
1e8 100 0 sorted regression v1 v2 by id2 id4 3.137 4.497 31.760 NA 2.550 NA 8.782 0.700 NA
1e8 100 0 sorted sum v3 count by id1:id6 7.697 NA NA 65.407 38.164 560.748 34.043 18.710 NA
1e8 NA NA NA NA 223.650 NA NA NA NA NA 652.481 176.870 NA

50 GB

Set of basic questions

Set of advanced questions

Details table

Plots of all cases can be found at G1_1e9_1e2_0_0.basic.png, G1_1e9_1e1_0_0.basic.png, G1_1e9_2e0_0_0.basic.png, G1_1e9_1e2_0_1.basic.png, G1_1e9_1e2_0_0.advanced.png, G1_1e9_1e1_0_0.advanced.png, G1_1e9_2e0_0_0.advanced.png, G1_1e9_1e2_0_1.advanced.png. Below first run timings.

rows q1_grp.size NA_pct pre_sorted question data.table dplyr pandas pydatatable spark dask juliadf clickhouse cudf
1e9 100 0 unsorted sum v1 by id1 9.021 21.925 NA 38.844 18.094 NA NA 62.072 NA
1e9 100 0 unsorted sum v1 by id1:id2 12.687 64.117 NA 30.137 24.915 NA NA 22.363 NA
1e9 100 0 unsorted sum v1 mean v3 by id3 13.200 237.633 NA 97.510 74.070 NA NA 83.883 NA
1e9 100 0 unsorted mean v1:v3 by id4 15.705 86.245 NA 26.040 8.631 NA NA 51.899 NA
1e9 100 0 unsorted sum v1:v3 by id6 13.559 526.225 NA 20.695 63.283 NA NA 38.575 NA
1e9 100 0 unsorted median v3 sd v3 by id2 id4 110.517 168.679 NA NA NA NA NA 23.691 NA
1e9 100 0 unsorted max v1 - min v2 by id2 id4 52.961 120.665 NA 13.851 17.525 NA NA 8.487 NA
1e9 100 0 unsorted largest two v3 by id2 id4 70.839 944.723 NA 237.878 143.411 NA NA 190.131 NA
1e9 100 0 unsorted regression v1 v2 by id2 id4 68.585 138.197 NA NA 23.240 NA NA 83.530 NA
1e9 100 0 unsorted sum v3 count by id1:id6 181.196 NA NA 693.345 979.816 NA NA NA NA
1e9 10 0 unsorted sum v1 by id1 11.234 18.183 NA 10.708 11.418 NA NA 4.087 NA
1e9 10 0 unsorted sum v1 by id1:id2 15.703 30.487 NA 22.134 22.099 NA NA 16.544 NA
1e9 10 0 unsorted sum v1 mean v3 by id3 26.566 498.536 NA 294.185 100.284 NA NA 130.989 NA
1e9 10 0 unsorted mean v1:v3 by id4 20.350 51.176 NA 8.774 7.428 NA NA 50.131 NA
1e9 10 0 unsorted sum v1:v3 by id6 29.170 1398.059 NA 166.351 86.649 NA NA 53.850 NA
1e9 10 0 unsorted median v3 sd v3 by id2 id4 93.772 NA NA NA NA NA NA 28.188 NA
1e9 10 0 unsorted max v1 - min v2 by id2 id4 35.253 NA NA 10.386 11.613 NA NA 14.402 NA
1e9 10 0 unsorted largest two v3 by id2 id4 65.956 NA NA 27.336 142.066 NA NA 245.911 NA
1e9 10 0 unsorted regression v1 v2 by id2 id4 50.316 NA NA NA 16.276 NA NA 31.897 NA
1e9 10 0 unsorted sum v3 count by id1:id6 332.623 NA NA 644.808 1102.860 NA NA NA NA
1e9 2 0 unsorted sum v1 by id1 12.883 12.322 NA 11.006 11.409 NA NA 2.956 NA
1e9 2 0 unsorted sum v1 by id1:id2 14.910 20.133 NA 14.474 13.880 NA NA 12.332 NA
1e9 2 0 unsorted sum v1 mean v3 by id3 NA NA NA 473.725 219.396 NA NA 192.526 NA
1e9 2 0 unsorted mean v1:v3 by id4 NA NA NA 11.653 7.748 NA NA NA NA
1e9 2 0 unsorted sum v1:v3 by id6 NA NA NA 133.137 146.303 NA NA NA NA
1e9 2 0 unsorted median v3 sd v3 by id2 id4 NA NA NA NA NA NA NA NA NA
1e9 2 0 unsorted max v1 - min v2 by id2 id4 NA NA NA 9.632 10.657 NA NA NA NA
1e9 2 0 unsorted largest two v3 by id2 id4 NA NA NA 19.127 481.419 NA NA NA NA
1e9 2 0 unsorted regression v1 v2 by id2 id4 NA NA NA NA 16.503 NA NA NA NA
1e9 2 0 unsorted sum v3 count by id1:id6 NA NA NA 531.451 676.685 NA NA NA NA
1e9 100 0 sorted sum v1 by id1 3.174 11.080 NA 7.183 10.299 NA NA 2.365 NA
1e9 100 0 sorted sum v1 by id1:id2 3.935 15.872 NA 13.146 13.103 NA NA 4.698 NA
1e9 100 0 sorted sum v1 mean v3 by id3 10.726 205.809 NA 78.085 80.457 NA NA 23.689 NA
1e9 100 0 sorted mean v1:v3 by id4 17.506 90.892 NA 18.015 7.809 NA NA 15.090 NA
1e9 100 0 sorted sum v1:v3 by id6 14.758 532.316 NA 22.069 60.178 NA NA 25.721 NA
1e9 100 0 sorted median v3 sd v3 by id2 id4 79.429 90.552 NA NA NA NA NA 8.184 NA
1e9 100 0 sorted max v1 - min v2 by id2 id4 27.530 48.317 NA 8.756 14.102 NA NA 5.001 NA
1e9 100 0 sorted largest two v3 by id2 id4 53.228 876.008 NA 238.032 98.020 NA NA 120.091 NA
1e9 100 0 sorted regression v1 v2 by id2 id4 45.746 62.967 NA NA 18.575 NA NA 45.185 NA
1e9 100 0 sorted sum v3 count by id1:id6 71.841 NA NA 523.283 416.538 NA NA NA NA
1e9 NA NA NA NA NA NA NA NA NA NA NA NA NA

Notes

Environment configuration

Component Value
cpu_model Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
cpu_cores 40
memory_model DIMM Synchronous 2133 MHz
memory_gb 125.78

Groupby benchmark run took around 62.5 hours.

Report was generated on: 2019-08-16 11:44:24 PDT.