This page presents results of h2oai.github.io/db-benchmark groupby task benchmark for various datasizes and various data characteristis (cardinality, percentage of missing values, pre-sorted input). There are 10 different questions run for each input data, questions are categorized into two groups. Basic questions refers to set of 5 questions designed by Matt Dowle (creator of data.table) in 2014 here. Advanced questions are 5 new questions meant to cover more complex queries, which are also less obvious to optimize.

Groupby

Below timings are presented for a single dataset case having random order, no NAs (missing values) and particular cardinality factor (group size question 1 k=100). To see timings for other cases scroll down to full timings table. If a solution is missing on particular data size timings table refer to benchplot for a reason and check its speed on smaller data size tab.

0.5 GB

Set of basic questions

Set of advanced questions

Details table

Plots of all cases can be found at G1_1e7_1e2_0_0.basic.png, G1_1e7_1e1_0_0.basic.png, G1_1e7_2e0_0_0.basic.png, G1_1e7_1e2_0_1.basic.png, G1_1e7_1e2_0_0.advanced.png, G1_1e7_1e1_0_0.advanced.png, G1_1e7_2e0_0_0.advanced.png, G1_1e7_1e2_0_1.advanced.png. Below first run timings.

rows q1_grp.size NA_pct pre_sorted question data.table dplyr pandas pydatatable spark dask juliadf
1e7 100 0 unsorted sum v1 by id1 0.092 0.308 0.173 0.157 1.192 0.085 4.143
1e7 100 0 unsorted sum v1 by id1:id2 0.089 0.799 0.595 0.722 1.403 0.175 0.462
1e7 100 0 unsorted sum v1 mean v3 by id3 0.124 1.522 0.372 0.859 1.342 1.321 2.377
1e7 100 0 unsorted mean v1:v3 by id4 0.157 0.545 0.409 0.101 0.603 0.176 0.650
1e7 100 0 unsorted sum v1:v3 by id6 0.132 1.639 0.855 0.155 0.972 0.257 1.147
1e7 100 0 unsorted median v3 sd v3 by id2 id4 0.929 1.692 1.707 NA NA NA 3.089
1e7 100 0 unsorted max v1 - min v2 by id2 id4 0.402 0.990 14.290 0.241 0.806 47.913 1.999
1e7 100 0 unsorted largest two v3 by id2 id4 0.922 6.392 3.468 0.766 2.795 71.710 1.885
1e7 100 0 unsorted regression v1 v2 by id2 id4 1.116 1.765 13.791 NA 1.060 30.688 2.774
1e7 100 0 unsorted sum v3 count by id1:id6 1.419 51.391 8.505 9.696 6.488 11.241 3.343
1e7 10 0 unsorted sum v1 by id1 0.147 0.225 0.145 0.154 0.952 0.068 3.650
1e7 10 0 unsorted sum v1 by id1:id2 0.109 0.445 0.415 0.344 0.931 0.114 0.323
1e7 10 0 unsorted sum v1 mean v3 by id3 0.201 6.371 0.894 1.857 2.329 11.945 4.494
1e7 10 0 unsorted mean v1:v3 by id4 0.227 0.399 0.403 0.093 0.782 0.168 0.798
1e7 10 0 unsorted sum v1:v3 by id6 0.205 5.750 1.643 0.384 1.800 1.316 2.355
1e7 10 0 unsorted median v3 sd v3 by id2 id4 0.824 0.900 1.288 NA NA NA 3.787
1e7 10 0 unsorted max v1 - min v2 by id2 id4 0.272 0.624 3.102 0.192 0.525 52.272 2.314
1e7 10 0 unsorted largest two v3 by id2 id4 0.544 5.373 3.027 1.424 2.765 3.302 2.393
1e7 10 0 unsorted regression v1 v2 by id2 id4 0.399 0.641 2.494 NA 0.742 3.119 3.121
1e7 10 0 unsorted sum v3 count by id1:id6 2.361 56.595 8.500 6.137 6.372 22.851 6.452
1e7 2 0 unsorted sum v1 by id1 0.152 0.249 0.144 0.147 0.974 0.060 3.658
1e7 2 0 unsorted sum v1 by id1:id2 0.153 0.430 0.410 0.272 0.705 0.125 0.465
1e7 2 0 unsorted sum v1 mean v3 by id3 0.403 17.619 2.115 2.430 2.914 23.016 9.886
1e7 2 0 unsorted mean v1:v3 by id4 0.231 0.264 0.404 0.119 1.041 0.167 1.218
1e7 2 0 unsorted sum v1:v3 by id6 0.615 12.285 4.440 0.408 2.617 3.057 7.938
1e7 2 0 unsorted median v3 sd v3 by id2 id4 0.483 0.681 1.226 NA NA NA 6.233
1e7 2 0 unsorted max v1 - min v2 by id2 id4 0.176 0.403 1.873 0.198 0.537 36.912 3.093
1e7 2 0 unsorted largest two v3 by id2 id4 0.479 5.690 2.827 0.421 3.895 2.632 4.610
1e7 2 0 unsorted regression v1 v2 by id2 id4 0.312 0.497 2.019 NA 0.720 2.807 4.373
1e7 2 0 unsorted sum v3 count by id1:id6 6.574 69.601 10.675 4.210 6.166 36.273 10.941
1e7 100 0 sorted sum v1 by id1 0.044 0.213 0.145 0.153 1.356 0.072 4.106
1e7 100 0 sorted sum v1 by id1:id2 0.045 0.396 0.412 0.290 1.194 0.140 0.292
1e7 100 0 sorted sum v1 mean v3 by id3 0.120 1.457 0.356 0.639 1.490 1.351 2.228
1e7 100 0 sorted mean v1:v3 by id4 0.143 0.504 0.408 0.087 0.624 0.202 0.606
1e7 100 0 sorted sum v1:v3 by id6 0.129 1.611 0.863 0.152 0.996 0.279 1.213
1e7 100 0 sorted median v3 sd v3 by id2 id4 0.543 1.127 1.302 NA NA NA 2.576
1e7 100 0 sorted max v1 - min v2 by id2 id4 0.193 0.501 12.300 0.167 0.754 42.094 1.361
1e7 100 0 sorted largest two v3 by id2 id4 0.768 5.744 3.499 0.698 2.411 71.681 1.504
1e7 100 0 sorted regression v1 v2 by id2 id4 0.841 1.178 12.495 NA 1.055 29.287 1.957
1e7 100 0 sorted sum v3 count by id1:id6 0.553 42.175 5.847 6.055 6.419 10.656 3.370
1e7 NA NA NA NA 23.628 306.991 129.836 NA NA NA 123.184

5 GB

Set of basic questions

Set of advanced questions

Details table

Plots of all cases can be found at G1_1e8_1e2_0_0.basic.png, G1_1e8_1e1_0_0.basic.png, G1_1e8_2e0_0_0.basic.png, G1_1e8_1e2_0_1.basic.png, G1_1e8_1e2_0_0.advanced.png, G1_1e8_1e1_0_0.advanced.png, G1_1e8_2e0_0_0.advanced.png, G1_1e8_1e2_0_1.advanced.png. Below first run timings.

rows q1_grp.size NA_pct pre_sorted question data.table dplyr pandas pydatatable spark dask juliadf
1e8 100 0 unsorted sum v1 by id1 0.878 3.093 1.438 1.251 2.395 0.554 8.376
1e8 100 0 unsorted sum v1 by id1:id2 0.869 9.210 5.886 8.969 4.488 1.318 3.600
1e8 100 0 unsorted sum v1 mean v3 by id3 1.210 38.198 7.274 10.898 8.311 133.275 35.868
1e8 100 0 unsorted mean v1:v3 by id4 1.496 7.622 4.048 0.834 1.760 1.606 3.025
1e8 100 0 unsorted sum v1:v3 by id6 1.272 39.317 15.182 2.234 6.305 7.537 19.709
1e8 100 0 unsorted median v3 sd v3 by id2 id4 10.214 17.175 17.508 NA NA NA 22.082
1e8 100 0 unsorted max v1 - min v2 by id2 id4 4.205 10.655 62.466 1.996 3.150 NA 18.524
1e8 100 0 unsorted largest two v3 by id2 id4 7.303 74.452 42.422 88.120 18.480 NA 18.530
1e8 100 0 unsorted regression v1 v2 by id2 id4 5.818 12.106 46.950 NA 4.222 NA 22.833
1e8 100 0 unsorted sum v3 count by id1:id6 16.899 NA 210.419 73.726 56.777 NA 51.274
1e8 10 0 unsorted sum v1 by id1 1.268 2.683 1.416 1.171 5.262 0.537 9.418
1e8 10 0 unsorted sum v1 by id1:id2 1.157 5.014 4.036 3.229 3.243 0.982 1.921
1e8 10 0 unsorted sum v1 mean v3 by id3 2.213 81.520 16.552 21.902 14.485 443.816 71.452
1e8 10 0 unsorted mean v1:v3 by id4 1.941 3.968 3.999 0.795 1.596 1.600 4.008
1e8 10 0 unsorted sum v1:v3 by id6 2.299 76.195 34.108 3.332 10.958 24.140 54.156
1e8 10 0 unsorted median v3 sd v3 by id2 id4 8.839 9.870 12.697 NA NA NA 37.122
1e8 10 0 unsorted max v1 - min v2 by id2 id4 3.066 5.589 34.185 1.733 2.109 NA 24.164
1e8 10 0 unsorted largest two v3 by id2 id4 5.701 68.650 35.648 4.600 16.921 NA 31.843
1e8 10 0 unsorted regression v1 v2 by id2 id4 4.359 6.559 26.435 NA 3.120 NA 28.283
1e8 10 0 unsorted sum v3 count by id1:id6 21.280 NA 112.222 68.924 55.813 NA 67.777
1e8 2 0 unsorted sum v1 by id1 1.293 2.417 1.432 1.221 2.876 0.517 16.341
1e8 2 0 unsorted sum v1 by id1:id2 1.439 4.551 4.017 2.306 2.821 0.974 3.416
1e8 2 0 unsorted sum v1 mean v3 by id3 6.081 199.405 28.017 28.581 20.397 780.387 106.054
1e8 2 0 unsorted mean v1:v3 by id4 2.047 2.583 4.008 1.081 1.703 1.564 7.669
1e8 2 0 unsorted sum v1:v3 by id6 6.240 146.987 67.300 4.864 16.613 51.412 92.149
1e8 2 0 unsorted median v3 sd v3 by id2 id4 4.726 6.831 11.970 NA NA NA 51.534
1e8 2 0 unsorted max v1 - min v2 by id2 id4 1.718 4.085 18.676 1.606 2.215 363.197 35.220
1e8 2 0 unsorted largest two v3 by id2 id4 4.683 65.305 34.225 3.132 38.026 27.853 50.601
1e8 2 0 unsorted regression v1 v2 by id2 id4 3.100 4.960 19.684 NA 3.248 30.841 43.544
1e8 2 0 unsorted sum v3 count by id1:id6 64.890 689.287 140.629 45.901 56.494 1082.639 113.471
1e8 100 0 sorted sum v1 by id1 0.352 2.180 1.462 1.152 2.670 0.528 8.263
1e8 100 0 sorted sum v1 by id1:id2 0.380 4.215 4.065 2.449 3.354 1.051 1.604
1e8 100 0 sorted sum v1 mean v3 by id3 1.003 31.643 5.289 7.259 8.361 133.472 33.275
1e8 100 0 sorted mean v1:v3 by id4 1.573 7.414 4.079 0.873 1.646 1.595 3.033
1e8 100 0 sorted sum v1:v3 by id6 1.274 39.110 14.160 2.653 6.006 7.506 20.363
1e8 100 0 sorted median v3 sd v3 by id2 id4 5.426 8.058 12.165 NA NA NA 16.096
1e8 100 0 sorted max v1 - min v2 by id2 id4 1.812 4.461 32.968 1.610 2.858 NA 11.029
1e8 100 0 sorted largest two v3 by id2 id4 5.113 69.170 40.819 88.579 15.888 NA 12.933
1e8 100 0 sorted regression v1 v2 by id2 id4 3.449 5.885 30.025 NA 3.633 NA 12.219
1e8 100 0 sorted sum v3 count by id1:id6 6.382 NA 77.548 46.457 54.679 NA 50.223
1e8 NA NA NA NA 225.268 NA 1247.429 NA NA NA 1223.002

50 GB

Set of basic questions

Set of advanced questions

Details table

Plots of all cases can be found at G1_1e9_1e2_0_0.basic.png, G1_1e9_1e1_0_0.basic.png, G1_1e9_2e0_0_0.basic.png, G1_1e9_1e2_0_1.basic.png, G1_1e9_1e2_0_0.advanced.png, G1_1e9_1e1_0_0.advanced.png, G1_1e9_2e0_0_0.advanced.png, G1_1e9_1e2_0_1.advanced.png. Below first run timings.

rows q1_grp.size NA_pct pre_sorted question data.table dplyr pandas pydatatable spark dask juliadf
1e9 100 0 unsorted sum v1 by id1 9.588 31.658 NA 13.050 17.371 NA 49.317
1e9 100 0 unsorted sum v1 by id1:id2 9.972 92.808 NA 78.502 37.007 NA 38.392
1e9 100 0 unsorted sum v1 mean v3 by id3 16.030 511.756 NA 176.168 197.669 NA 680.282
1e9 100 0 unsorted mean v1:v3 by id4 14.731 73.674 NA 8.631 12.482 NA 26.012
1e9 100 0 unsorted sum v1:v3 by id6 17.962 529.403 NA 26.510 86.704 NA NA
1e9 100 0 unsorted median v3 sd v3 by id2 id4 120.951 189.748 NA NA NA NA NA
1e9 100 0 unsorted max v1 - min v2 by id2 id4 57.580 126.371 NA 21.087 27.698 NA NA
1e9 100 0 unsorted largest two v3 by id2 id4 83.903 871.827 NA 147.504 199.847 NA NA
1e9 100 0 unsorted regression v1 v2 by id2 id4 76.823 143.393 NA NA 35.168 NA NA
1e9 100 0 unsorted sum v3 count by id1:id6 181.221 NA NA 847.809 745.225 NA NA
1e9 10 0 unsorted sum v1 by id1 11.802 25.619 NA 11.690 16.916 NA NA
1e9 10 0 unsorted sum v1 by id1:id2 13.024 53.757 NA 35.408 25.240 NA NA
1e9 10 0 unsorted sum v1 mean v3 by id3 31.518 1057.684 NA 307.005 212.430 NA NA
1e9 10 0 unsorted mean v1:v3 by id4 20.135 39.260 NA 8.055 12.365 NA NA
1e9 10 0 unsorted sum v1:v3 by id6 33.578 977.941 NA 32.788 141.910 NA NA
1e9 10 0 unsorted median v3 sd v3 by id2 id4 89.289 NA NA NA NA NA NA
1e9 10 0 unsorted max v1 - min v2 by id2 id4 30.333 NA NA 16.174 18.323 NA NA
1e9 10 0 unsorted largest two v3 by id2 id4 64.319 NA NA 34.750 188.234 NA NA
1e9 10 0 unsorted regression v1 v2 by id2 id4 44.521 NA NA NA 26.805 NA NA
1e9 10 0 unsorted sum v3 count by id1:id6 254.895 NA NA 757.490 806.852 NA NA
1e9 2 0 unsorted sum v1 by id1 11.591 23.708 NA 11.734 16.286 NA NA
1e9 2 0 unsorted sum v1 by id1:id2 12.626 45.135 NA 22.640 25.719 NA NA
1e9 2 0 unsorted sum v1 mean v3 by id3 69.301 NA NA 373.598 311.432 NA NA
1e9 2 0 unsorted mean v1:v3 by id4 NA NA NA 9.458 12.549 NA NA
1e9 2 0 unsorted sum v1:v3 by id6 NA NA NA 69.303 186.184 NA NA
1e9 2 0 unsorted median v3 sd v3 by id2 id4 NA NA NA NA NA NA NA
1e9 2 0 unsorted max v1 - min v2 by id2 id4 NA NA NA 18.188 18.005 NA NA
1e9 2 0 unsorted largest two v3 by id2 id4 NA NA NA 32.894 487.745 NA NA
1e9 2 0 unsorted regression v1 v2 by id2 id4 NA NA NA NA 21.143 NA NA
1e9 2 0 unsorted sum v3 count by id1:id6 NA NA NA 538.923 602.290 NA NA
1e9 100 0 sorted sum v1 by id1 2.930 24.034 NA 11.746 14.551 NA 46.729
1e9 100 0 sorted sum v1 by id1:id2 3.447 41.597 NA 24.174 20.823 NA 14.714
1e9 100 0 sorted sum v1 mean v3 by id3 8.992 392.030 NA 81.802 105.391 NA 552.577
1e9 100 0 sorted mean v1:v3 by id4 14.451 73.812 NA 8.242 11.660 NA 25.722
1e9 100 0 sorted sum v1:v3 by id6 17.857 529.226 NA 26.144 85.502 NA NA
1e9 100 0 sorted median v3 sd v3 by id2 id4 80.412 85.326 NA NA NA NA NA
1e9 100 0 sorted max v1 - min v2 by id2 id4 26.054 54.065 NA 16.123 22.771 NA NA
1e9 100 0 sorted largest two v3 by id2 id4 54.412 833.108 NA 138.117 148.612 NA NA
1e9 100 0 sorted regression v1 v2 by id2 id4 39.546 64.009 NA NA 30.214 NA NA
1e9 100 0 sorted sum v3 count by id1:id6 56.045 NA NA 462.378 561.954 NA NA
1e9 NA NA NA NA NA NA NA NA NA NA NA

Environment configuration

Component Value
cpu_model Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
cpu_cores 20
memory_model DIMM DDR4 Synchronous 2133 MHz
memory_gb 125.8

Groupby benchmark run took around 43 hours.

Report was generated on: 2019-03-13 08:48:58 PDT.