This page presents results of h2oai.github.io/db-benchmark groupby task benchmark for various datasizes and various data characteristis (cardinality, percentage of missing values, pre-sorted input). There are 10 different questions run for each input data, questions are categorized into two groups. Basic questions refers to set of 5 questions designed by Matt Dowle (creator of data.table) in 2014 here. Advanced questions are 5 new questions meant to cover more complex queries, which are also less obvious to optimize.

Groupby

Below timings are presented for a single dataset case having random order, no NAs (missing values) and particular cardinality factor (group size question 1 k=100). To see timings for other cases scroll down to full timings table. If a solution is missing on particular data size timings table refer to benchplot for a reason and check its speed on smaller data size tab.

0.5 GB

Set of basic questions

Set of advanced questions

Details table

Plots of all cases can be found at G1_1e7_1e2_0_0.basic.png, G1_1e7_1e1_0_0.basic.png, G1_1e7_2e0_0_0.basic.png, G1_1e7_1e2_0_1.basic.png, G1_1e7_1e2_0_0.advanced.png, G1_1e7_1e1_0_0.advanced.png, G1_1e7_2e0_0_0.advanced.png, G1_1e7_1e2_0_1.advanced.png. Below first run timings.

rows q1_grp.size NA_pct pre_sorted question data.table dplyr pandas pydatatable spark dask juliadf clickhouse cudf
1e7 100 0 unsorted sum v1 by id1 0.137 0.253 0.186 0.199 2.191 0.122 4.800 0.076 0.378
1e7 100 0 unsorted sum v1 by id1:id2 0.126 0.465 0.637 0.450 2.116 0.217 0.621 0.130 0.203
1e7 100 0 unsorted sum v1 mean v3 by id3 0.135 0.706 0.501 1.090 1.707 1.622 1.147 0.183 1.722
1e7 100 0 unsorted mean v1:v3 by id4 0.211 0.544 0.525 0.296 0.866 0.234 0.713 0.039 0.061
1e7 100 0 unsorted sum v1:v3 by id6 0.158 1.402 0.888 0.286 1.539 0.367 1.174 0.137 0.090
1e7 100 0 unsorted median v3 sd v3 by id2 id4 0.983 1.573 1.316 NA NA NA 2.746 0.240 NA
1e7 100 0 unsorted max v1 - min v2 by id2 id4 0.402 0.864 0.814 0.168 0.924 0.257 1.993 0.122 NA
1e7 100 0 unsorted largest two v3 by id2 id4 0.921 5.723 3.725 2.566 2.600 68.827 1.969 1.032 NA
1e7 100 0 unsorted regression v1 v2 by id2 id4 0.864 1.194 12.005 NA 1.345 NA 2.062 0.160 NA
1e7 100 0 unsorted sum v3 count by id1:id6 1.619 NA NA 8.784 5.186 15.276 3.544 1.658 5.206
1e7 10 0 unsorted sum v1 by id1 0.162 0.113 0.271 0.162 1.258 0.089 4.788 0.049 0.318
1e7 10 0 unsorted sum v1 by id1:id2 0.126 0.206 0.560 0.218 2.313 0.141 0.339 0.158 0.151
1e7 10 0 unsorted sum v1 mean v3 by id3 0.225 2.712 0.846 2.640 2.627 14.234 1.914 0.390 2.251
1e7 10 0 unsorted mean v1:v3 by id4 0.252 0.456 0.548 0.161 1.020 0.256 0.704 0.042 0.064
1e7 10 0 unsorted sum v1:v3 by id6 0.248 6.597 1.599 0.518 2.043 1.428 2.135 0.265 0.116
1e7 10 0 unsorted median v3 sd v3 by id2 id4 0.793 0.718 0.953 NA NA NA 2.310 0.113 NA
1e7 10 0 unsorted max v1 - min v2 by id2 id4 0.274 0.409 0.743 0.143 0.639 0.174 1.535 0.150 NA
1e7 10 0 unsorted largest two v3 by id2 id4 0.645 5.259 3.375 2.343 2.833 3.880 1.516 1.088 NA
1e7 10 0 unsorted regression v1 v2 by id2 id4 0.441 0.526 2.566 NA 0.737 NA 1.589 0.074 NA
1e7 10 0 unsorted sum v3 count by id1:id6 1.982 NA NA 6.856 5.012 28.723 4.134 1.729 5.654
1e7 2 0 unsorted sum v1 by id1 0.177 0.135 0.200 0.184 1.364 0.086 4.995 0.051 0.298
1e7 2 0 unsorted sum v1 by id1:id2 0.151 0.181 0.555 0.210 0.847 0.141 0.320 0.174 NA
1e7 2 0 unsorted sum v1 mean v3 by id3 0.532 6.441 2.128 5.501 4.834 29.277 4.044 0.734 NA
1e7 2 0 unsorted mean v1:v3 by id4 0.252 0.300 0.553 0.245 0.826 0.224 0.736 0.044 NA
1e7 2 0 unsorted sum v1:v3 by id6 0.905 16.943 4.679 1.349 2.365 4.256 3.486 0.550 NA
1e7 2 0 unsorted median v3 sd v3 by id2 id4 0.511 0.579 0.911 NA NA NA 2.504 0.316 NA
1e7 2 0 unsorted max v1 - min v2 by id2 id4 0.148 0.270 0.683 0.166 0.574 0.205 1.530 0.121 NA
1e7 2 0 unsorted largest two v3 by id2 id4 0.559 4.997 3.241 0.544 4.361 3.560 1.756 1.887 NA
1e7 2 0 unsorted regression v1 v2 by id2 id4 0.367 0.383 2.008 NA 0.738 NA 1.496 0.174 NA
1e7 2 0 unsorted sum v3 count by id1:id6 4.120 NA NA 7.684 5.102 45.176 6.206 1.670 NA
1e7 100 0 sorted sum v1 by id1 0.066 0.079 0.206 0.168 1.347 0.088 4.782 0.035 0.293
1e7 100 0 sorted sum v1 by id1:id2 0.043 0.143 0.576 0.327 2.521 0.165 0.354 0.088 0.120
1e7 100 0 sorted sum v1 mean v3 by id3 0.127 0.665 0.408 0.879 2.318 1.684 1.143 0.197 0.747
1e7 100 0 sorted mean v1:v3 by id4 0.191 0.549 0.572 0.546 0.823 0.227 0.723 0.042 0.060
1e7 100 0 sorted sum v1:v3 by id6 0.154 1.389 0.943 0.267 1.382 0.367 1.185 0.143 0.091
1e7 100 0 sorted median v3 sd v3 by id2 id4 0.765 1.013 1.028 NA NA NA 2.327 0.110 NA
1e7 100 0 sorted max v1 - min v2 by id2 id4 0.183 0.394 0.758 0.146 1.218 0.217 1.528 0.084 NA
1e7 100 0 sorted largest two v3 by id2 id4 0.628 5.418 3.856 1.567 3.057 68.095 1.541 0.891 NA
1e7 100 0 sorted regression v1 v2 by id2 id4 0.538 0.732 12.105 NA 1.337 NA 1.455 0.093 NA
1e7 100 0 sorted sum v3 count by id1:id6 0.704 NA NA 7.879 5.733 13.908 3.347 1.720 5.249
1e7 NA NA NA NA 21.825 NA NA NA NA NA 87.191 16.959 NA

5 GB

Set of basic questions

Set of advanced questions

Details table

Plots of all cases can be found at G1_1e8_1e2_0_0.basic.png, G1_1e8_1e1_0_0.basic.png, G1_1e8_2e0_0_0.basic.png, G1_1e8_1e2_0_1.basic.png, G1_1e8_1e2_0_0.advanced.png, G1_1e8_1e1_0_0.advanced.png, G1_1e8_2e0_0_0.advanced.png, G1_1e8_1e2_0_1.advanced.png. Below first run timings.

rows q1_grp.size NA_pct pre_sorted question data.table dplyr pandas pydatatable spark dask juliadf clickhouse cudf
1e8 100 0 unsorted sum v1 by id1 1.047 2.045 1.995 2.710 6.645 0.642 6.420 0.343 NA
1e8 100 0 unsorted sum v1 by id1:id2 0.968 6.121 6.647 4.441 3.827 1.526 4.806 0.810 NA
1e8 100 0 unsorted sum v1 mean v3 by id3 1.373 16.510 7.637 10.369 6.280 136.337 15.102 23.358 NA
1e8 100 0 unsorted mean v1:v3 by id4 1.547 8.077 5.786 2.012 1.298 1.767 4.081 0.250 NA
1e8 100 0 unsorted sum v1:v3 by id6 1.404 38.068 14.172 3.591 5.656 7.892 19.472 1.598 NA
1e8 100 0 unsorted median v3 sd v3 by id2 id4 9.497 14.250 14.044 NA NA NA 18.015 1.632 NA
1e8 100 0 unsorted max v1 - min v2 by id2 id4 4.432 9.181 8.809 1.516 2.260 1.583 14.653 0.783 NA
1e8 100 0 unsorted largest two v3 by id2 id4 6.369 73.909 44.726 76.225 11.920 137.981 15.232 10.099 NA
1e8 100 0 unsorted regression v1 v2 by id2 id4 5.295 10.701 44.874 NA 2.936 NA 16.334 0.917 NA
1e8 100 0 unsorted sum v3 count by id1:id6 15.636 NA NA 77.451 40.787 492.245 34.606 19.165 NA
1e8 10 0 unsorted sum v1 by id1 1.310 1.413 1.989 1.281 2.874 0.697 6.759 0.366 NA
1e8 10 0 unsorted sum v1 by id1:id2 1.236 2.684 5.300 3.825 2.859 1.168 2.033 1.330 NA
1e8 10 0 unsorted sum v1 mean v3 by id3 2.160 36.410 15.075 24.241 11.953 476.123 33.976 4.822 NA
1e8 10 0 unsorted mean v1:v3 by id4 2.204 4.533 5.936 1.000 1.222 1.886 3.724 2.190 NA
1e8 10 0 unsorted sum v1:v3 by id6 2.262 84.416 29.795 5.236 8.314 28.326 30.698 3.202 NA
1e8 10 0 unsorted median v3 sd v3 by id2 id4 9.119 9.235 9.510 NA NA NA 12.484 1.194 NA
1e8 10 0 unsorted max v1 - min v2 by id2 id4 3.870 4.473 7.095 1.217 1.552 1.222 9.390 1.353 NA
1e8 10 0 unsorted largest two v3 by id2 id4 6.071 69.254 40.090 6.852 12.726 57.127 9.342 12.725 NA
1e8 10 0 unsorted regression v1 v2 by id2 id4 4.874 5.681 28.342 NA 2.040 NA 10.991 0.492 NA
1e8 10 0 unsorted sum v3 count by id1:id6 25.395 NA NA 66.355 38.262 752.139 46.329 18.095 NA
1e8 2 0 unsorted sum v1 by id1 1.421 1.259 1.999 1.684 2.881 0.659 9.821 0.309 NA
1e8 2 0 unsorted sum v1 by id1:id2 1.543 1.997 5.279 1.847 1.779 1.112 1.997 1.235 NA
1e8 2 0 unsorted sum v1 mean v3 by id3 30.574 73.534 24.969 51.446 15.696 787.090 45.733 8.969 NA
1e8 2 0 unsorted mean v1:v3 by id4 2.138 2.941 5.501 1.642 1.304 1.776 4.287 2.775 NA
1e8 2 0 unsorted sum v1:v3 by id6 5.925 151.004 59.711 11.660 11.327 66.593 38.874 4.899 NA
1e8 2 0 unsorted median v3 sd v3 by id2 id4 4.996 5.948 8.640 NA NA NA 15.543 2.665 NA
1e8 2 0 unsorted max v1 - min v2 by id2 id4 1.749 3.118 6.963 1.545 1.536 1.271 11.059 1.153 NA
1e8 2 0 unsorted largest two v3 by id2 id4 4.996 66.960 37.532 2.241 33.392 33.485 13.872 26.693 NA
1e8 2 0 unsorted regression v1 v2 by id2 id4 3.326 4.074 20.407 NA 2.280 NA 11.494 0.612 NA
1e8 2 0 unsorted sum v3 count by id1:id6 18.381 NA NA 74.677 35.944 1227.718 65.239 15.733 NA
1e8 100 0 sorted sum v1 by id1 0.348 0.982 2.228 0.967 2.044 0.641 6.170 0.240 NA
1e8 100 0 sorted sum v1 by id1:id2 0.448 1.361 5.386 2.992 3.018 1.221 2.033 0.613 NA
1e8 100 0 sorted sum v1 mean v3 by id3 1.212 15.252 6.464 9.281 6.164 142.964 14.225 1.903 NA
1e8 100 0 sorted mean v1:v3 by id4 1.564 8.270 6.044 1.962 1.223 1.933 3.818 0.252 NA
1e8 100 0 sorted sum v1:v3 by id6 1.461 37.942 15.268 3.317 4.761 7.776 19.537 1.750 NA
1e8 100 0 sorted median v3 sd v3 by id2 id4 5.331 7.465 9.819 NA NA NA 11.629 0.757 NA
1e8 100 0 sorted max v1 - min v2 by id2 id4 1.405 3.259 7.203 1.089 2.215 1.662 8.319 0.523 NA
1e8 100 0 sorted largest two v3 by id2 id4 4.362 68.336 44.275 75.531 10.873 122.201 8.648 9.018 NA
1e8 100 0 sorted regression v1 v2 by id2 id4 3.151 4.492 31.808 NA 2.585 NA 8.814 0.717 NA
1e8 100 0 sorted sum v3 count by id1:id6 7.376 NA NA 59.768 39.428 481.175 32.898 18.576 NA
1e8 NA NA NA NA 207.776 NA NA NA NA NA 648.457 204.116 NA

50 GB

Set of basic questions

Set of advanced questions

Details table

Plots of all cases can be found at G1_1e9_1e2_0_0.basic.png, G1_1e9_1e1_0_0.basic.png, G1_1e9_2e0_0_0.basic.png, G1_1e9_1e2_0_1.basic.png, G1_1e9_1e2_0_0.advanced.png, G1_1e9_1e1_0_0.advanced.png, G1_1e9_2e0_0_0.advanced.png, G1_1e9_1e2_0_1.advanced.png. Below first run timings.

rows q1_grp.size NA_pct pre_sorted question data.table dplyr pandas pydatatable spark dask juliadf clickhouse cudf
1e9 100 0 unsorted sum v1 by id1 9.612 22.761 NA 23.499 12.445 NA NA 84.182 NA
1e9 100 0 unsorted sum v1 by id1:id2 9.297 66.515 NA 47.602 24.133 NA NA 33.629 NA
1e9 100 0 unsorted sum v1 mean v3 by id3 12.358 240.707 NA 137.216 83.709 NA NA 102.252 NA
1e9 100 0 unsorted mean v1:v3 by id4 14.370 88.961 NA 19.851 8.170 NA NA 40.064 NA
1e9 100 0 unsorted sum v1:v3 by id6 13.716 527.388 NA 23.304 60.479 NA NA 42.913 NA
1e9 100 0 unsorted median v3 sd v3 by id2 id4 119.386 176.374 NA NA NA NA NA 20.194 NA
1e9 100 0 unsorted max v1 - min v2 by id2 id4 54.512 124.054 NA 15.293 17.471 NA NA 7.612 NA
1e9 100 0 unsorted largest two v3 by id2 id4 73.191 960.340 NA 194.755 137.292 NA NA 187.408 NA
1e9 100 0 unsorted regression v1 v2 by id2 id4 80.065 139.877 NA NA 22.949 NA NA 90.058 NA
1e9 100 0 unsorted sum v3 count by id1:id6 180.351 NA NA 770.134 750.028 NA NA NA NA
1e9 10 0 unsorted sum v1 by id1 13.962 17.435 NA 12.822 11.843 NA NA 2.998 NA
1e9 10 0 unsorted sum v1 by id1:id2 12.715 30.620 NA 37.072 17.130 NA NA 12.302 NA
1e9 10 0 unsorted sum v1 mean v3 by id3 26.337 489.815 NA 325.870 127.824 NA NA 109.570 NA
1e9 10 0 unsorted mean v1:v3 by id4 22.629 49.797 NA 10.488 7.899 NA NA 46.934 NA
1e9 10 0 unsorted sum v1:v3 by id6 29.554 1305.122 NA 167.254 82.998 NA NA 54.523 NA
1e9 10 0 unsorted median v3 sd v3 by id2 id4 95.793 NA NA NA NA NA NA 20.956 NA
1e9 10 0 unsorted max v1 - min v2 by id2 id4 36.055 NA NA 12.743 11.580 NA NA 16.459 NA
1e9 10 0 unsorted largest two v3 by id2 id4 72.213 NA NA 29.408 143.484 NA NA 222.414 NA
1e9 10 0 unsorted regression v1 v2 by id2 id4 48.399 NA NA NA 13.608 NA NA 32.588 NA
1e9 10 0 unsorted sum v3 count by id1:id6 326.642 NA NA 732.003 764.813 NA NA NA NA
1e9 2 0 unsorted sum v1 by id1 14.459 12.608 NA 12.718 11.845 NA NA 2.380 NA
1e9 2 0 unsorted sum v1 by id1:id2 16.023 20.185 NA 16.502 14.138 NA NA 11.529 NA
1e9 2 0 unsorted sum v1 mean v3 by id3 NA NA NA 626.073 137.651 NA NA 174.785 NA
1e9 2 0 unsorted mean v1:v3 by id4 NA NA NA 19.743 7.803 NA NA NA NA
1e9 2 0 unsorted sum v1:v3 by id6 NA NA NA 129.454 113.318 NA NA NA NA
1e9 2 0 unsorted median v3 sd v3 by id2 id4 NA NA NA NA NA NA NA NA NA
1e9 2 0 unsorted max v1 - min v2 by id2 id4 NA NA NA 13.858 11.176 NA NA NA NA
1e9 2 0 unsorted largest two v3 by id2 id4 NA NA NA 22.245 479.219 NA NA NA NA
1e9 2 0 unsorted regression v1 v2 by id2 id4 NA NA NA NA 16.534 NA NA NA NA
1e9 2 0 unsorted sum v3 count by id1:id6 NA NA NA 814.960 593.575 NA NA NA NA
1e9 100 0 sorted sum v1 by id1 3.124 12.730 NA 8.885 10.655 NA NA 1.692 NA
1e9 100 0 sorted sum v1 by id1:id2 3.286 16.919 NA 15.842 13.028 NA NA 3.945 NA
1e9 100 0 sorted sum v1 mean v3 by id3 10.274 202.092 NA 89.565 91.454 NA NA 23.992 NA
1e9 100 0 sorted mean v1:v3 by id4 14.938 90.268 NA 19.933 8.246 NA NA 11.897 NA
1e9 100 0 sorted sum v1:v3 by id6 13.393 533.077 NA 23.314 62.228 NA NA 22.901 NA
1e9 100 0 sorted median v3 sd v3 by id2 id4 74.885 89.974 NA NA NA NA NA 8.162 NA
1e9 100 0 sorted max v1 - min v2 by id2 id4 28.426 49.192 NA 11.478 14.757 NA NA 4.880 NA
1e9 100 0 sorted largest two v3 by id2 id4 52.623 863.251 NA 192.886 98.977 NA NA 112.487 NA
1e9 100 0 sorted regression v1 v2 by id2 id4 43.096 59.314 NA NA 18.350 NA NA 27.457 NA
1e9 100 0 sorted sum v3 count by id1:id6 73.564 NA NA 542.407 430.035 NA NA NA NA
1e9 NA NA NA NA NA NA NA NA NA NA NA NA NA

Notes

Environment configuration

Component Value
cpu_model Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
cpu_cores 40
memory_model DIMM Synchronous 2133 MHz
memory_gb 125.78
gpu_model GeForce GTX 1080 Ti
gpu_num 2
gpu_gb 21.83

Groupby benchmark run took around 62.7 hours.

Report was generated on: 2019-10-11 03:23:20 PDT.