This page presents results of h2oai.github.io/db-benchmark groupby task benchmark for various datasizes and various data characteristis (cardinality, percentage of missing values, pre-sorted input). There are 10 different questions run for each input data, questions are categorized into two groups. Basic questions refers to set of 5 questions designed by Matt Dowle (creator of data.table) in 2014 here. Advanced questions are 5 new questions meant to cover more complex queries, which are also less obvious to optimize.

Groupby

Below timings are presented for a single dataset case having random order, no NAs (missing values) and particular cardinality factor (group size question 1 k=100). To see timings for other cases scroll down to full timings table. If a solution is missing on particular data size timings table refer to benchplot for a reason and check its speed on smaller data size tab.

0.5 GB

Set of basic questions

Set of advanced questions

Details table

Plots of all cases can be found at G1_1e7_1e2_0_0.basic.png, G1_1e7_1e1_0_0.basic.png, G1_1e7_2e0_0_0.basic.png, G1_1e7_1e2_0_1.basic.png, G1_1e7_1e2_0_0.advanced.png, G1_1e7_1e1_0_0.advanced.png, G1_1e7_2e0_0_0.advanced.png, G1_1e7_1e2_0_1.advanced.png. Below first run timings.

rows q1_grp.size NA_pct pre_sorted question data.table dplyr pandas pydatatable spark dask juliadf
1e7 100 0 unsorted sum v1 by id1 0.097 0.456 0.242 0.408 1.609 0.099 4.310
1e7 100 0 unsorted sum v1 by id1:id2 0.115 0.799 0.628 5.821 2.585 0.172 0.634
1e7 100 0 unsorted sum v1 mean v3 by id3 0.133 1.324 0.410 26.662 1.861 1.337 1.116
1e7 100 0 unsorted mean v1:v3 by id4 0.176 0.598 0.500 0.279 0.775 0.175 0.762
1e7 100 0 unsorted sum v1:v3 by id6 0.148 1.448 0.873 0.537 1.697 0.291 1.205
1e7 100 0 unsorted median v3 sd v3 by id2 id4 0.901 163.153 2.169 NA NA NA 2.830
1e7 100 0 unsorted max v1 - min v2 by id2 id4 0.374 384.168 0.789 0.707 1.392 0.223 2.191
1e7 100 0 unsorted largest two v3 by id2 id4 0.895 6.136 3.615 50.794 3.068 67.177 1.976
1e7 100 0 unsorted regression v1 v2 by id2 id4 1.357 383.928 12.342 NA 1.340 NA 2.899
1e7 100 0 unsorted sum v3 count by id1:id6 1.669 61.846 10.477 505.593 4.973 14.254 3.620
1e7 10 0 unsorted sum v1 by id1 0.163 0.266 0.177 0.269 1.738 0.089 4.321
1e7 10 0 unsorted sum v1 by id1:id2 0.136 0.482 0.528 1.220 1.984 0.135 0.374
1e7 10 0 unsorted sum v1 mean v3 by id3 0.244 6.689 0.832 237.474 2.685 11.785 2.238
1e7 10 0 unsorted mean v1:v3 by id4 0.242 0.461 0.504 0.144 0.720 0.212 0.790
1e7 10 0 unsorted sum v1:v3 by id6 0.240 6.021 1.610 61.237 2.090 1.458 2.140
1e7 10 0 unsorted median v3 sd v3 by id2 id4 0.825 3.138 1.684 NA NA NA 2.397
1e7 10 0 unsorted max v1 - min v2 by id2 id4 0.288 4.956 0.634 0.300 0.698 0.148 1.742
1e7 10 0 unsorted largest two v3 by id2 id4 0.594 5.546 3.228 113.734 2.744 3.600 1.542
1e7 10 0 unsorted regression v1 v2 by id2 id4 0.477 4.698 2.411 NA 0.733 NA 2.416
1e7 10 0 unsorted sum v3 count by id1:id6 2.539 65.435 10.920 372.879 4.990 25.659 4.691
1e7 2 0 unsorted sum v1 by id1 0.157 0.295 0.181 0.239 1.691 0.076 4.586
1e7 2 0 unsorted sum v1 by id1:id2 0.168 0.490 0.470 0.485 1.271 0.165 0.367
1e7 2 0 unsorted sum v1 mean v3 by id3 0.533 18.085 2.028 189.587 3.092 25.886 4.617
1e7 2 0 unsorted mean v1:v3 by id4 0.250 0.348 0.474 0.128 0.811 0.236 0.901
1e7 2 0 unsorted sum v1:v3 by id6 0.812 13.501 4.445 36.082 2.231 4.040 3.601
1e7 2 0 unsorted median v3 sd v3 by id2 id4 0.591 0.848 1.576 NA NA NA 2.528
1e7 2 0 unsorted max v1 - min v2 by id2 id4 0.151 0.522 0.610 0.286 0.655 0.218 1.842
1e7 2 0 unsorted largest two v3 by id2 id4 0.562 5.691 2.934 5.790 4.586 3.708 1.760
1e7 2 0 unsorted regression v1 v2 by id2 id4 0.289 0.647 1.907 NA 0.603 NA 2.486
1e7 2 0 unsorted sum v3 count by id1:id6 6.080 79.565 12.609 88.821 4.911 42.034 8.081
1e7 100 0 sorted sum v1 by id1 0.057 0.271 0.179 0.365 1.708 0.086 4.278
1e7 100 0 sorted sum v1 by id1:id2 0.049 0.464 0.510 5.376 1.744 0.159 0.382
1e7 100 0 sorted sum v1 mean v3 by id3 0.135 1.334 0.394 24.434 2.074 1.568 1.121
1e7 100 0 sorted mean v1:v3 by id4 0.183 0.599 0.483 0.284 0.806 0.244 0.747
1e7 100 0 sorted sum v1:v3 by id6 0.126 1.498 0.867 0.544 2.047 0.310 1.244
1e7 100 0 sorted median v3 sd v3 by id2 id4 0.559 150.555 1.717 NA NA NA 2.366
1e7 100 0 sorted max v1 - min v2 by id2 id4 0.209 252.464 0.652 0.619 1.207 0.184 1.661
1e7 100 0 sorted largest two v3 by id2 id4 0.821 5.894 3.594 46.963 2.987 66.018 1.535
1e7 100 0 sorted regression v1 v2 by id2 id4 0.980 263.195 11.703 NA 1.227 NA 2.300
1e7 100 0 sorted sum v3 count by id1:id6 0.691 53.478 7.607 590.816 5.591 13.783 3.593
1e7 NA NA NA NA 25.016 1951.292 109.513 NA NA NA 94.190

5 GB

Set of basic questions

Set of advanced questions

Details table

Plots of all cases can be found at G1_1e8_1e2_0_0.basic.png, G1_1e8_1e1_0_0.basic.png, G1_1e8_2e0_0_0.basic.png, G1_1e8_1e2_0_1.basic.png, G1_1e8_1e2_0_0.advanced.png, G1_1e8_1e1_0_0.advanced.png, G1_1e8_2e0_0_0.advanced.png, G1_1e8_1e2_0_1.advanced.png. Below first run timings.

rows q1_grp.size NA_pct pre_sorted question data.table dplyr pandas pydatatable spark dask juliadf
1e8 100 0 unsorted sum v1 by id1 0.869 3.435 1.755 3.349 3.299 0.678 5.925
1e8 100 0 unsorted sum v1 by id1:id2 0.912 9.239 6.122 34.204 3.526 1.447 5.220
1e8 100 0 unsorted sum v1 mean v3 by id3 1.388 36.771 6.815 124.044 6.172 137.014 14.870
1e8 100 0 unsorted mean v1:v3 by id4 1.482 8.013 4.676 2.553 1.330 1.798 4.245
1e8 100 0 unsorted sum v1:v3 by id6 1.335 37.418 13.645 32.712 4.813 7.657 19.492
1e8 100 0 unsorted median v3 sd v3 by id2 id4 9.423 5843.789 24.690 NA NA NA 18.488
1e8 100 0 unsorted max v1 - min v2 by id2 id4 3.973 NA 8.093 3.672 2.210 1.279 15.901
1e8 100 0 unsorted largest two v3 by id2 id4 7.765 NA 41.946 NA 12.631 137.914 14.702
1e8 100 0 unsorted regression v1 v2 by id2 id4 5.669 NA 42.686 NA 2.774 NA 18.229
1e8 100 0 unsorted sum v3 count by id1:id6 16.197 NA 219.293 NA 39.200 481.106 36.499
1e8 10 0 unsorted sum v1 by id1 1.264 3.029 1.778 2.119 2.130 0.677 6.649
1e8 10 0 unsorted sum v1 by id1:id2 1.419 5.555 4.778 8.228 3.093 1.092 2.532
1e8 10 0 unsorted sum v1 mean v3 by id3 2.376 82.747 14.286 1701.685 11.192 471.830 35.235
1e8 10 0 unsorted mean v1:v3 by id4 2.306 4.490 4.752 2.014 1.307 1.696 4.364
1e8 10 0 unsorted sum v1:v3 by id6 2.556 76.155 29.273 24.924 7.646 27.897 31.433
1e8 10 0 unsorted median v3 sd v3 by id2 id4 8.041 74.429 20.245 NA NA NA 13.436
1e8 10 0 unsorted max v1 - min v2 by id2 id4 3.961 114.749 6.265 2.491 1.819 1.009 11.689
1e8 10 0 unsorted largest two v3 by id2 id4 6.176 71.842 37.409 121.396 13.522 58.440 10.533
1e8 10 0 unsorted regression v1 v2 by id2 id4 4.568 108.550 26.206 NA 2.075 NA 14.127
1e8 10 0 unsorted sum v3 count by id1:id6 20.077 NA 126.945 NA 29.037 732.665 52.737
1e8 2 0 unsorted sum v1 by id1 1.344 3.008 1.800 1.623 2.918 0.583 9.371
1e8 2 0 unsorted sum v1 by id1:id2 1.518 5.190 4.774 3.154 2.114 1.078 2.530
1e8 2 0 unsorted sum v1 mean v3 by id3 6.243 199.836 23.128 1808.385 16.986 739.291 56.440
1e8 2 0 unsorted mean v1:v3 by id4 2.174 2.925 4.712 1.209 1.374 1.867 6.044
1e8 2 0 unsorted sum v1:v3 by id6 5.960 151.119 57.860 43.061 14.238 66.452 40.565
1e8 2 0 unsorted median v3 sd v3 by id2 id4 4.949 7.995 19.468 NA NA NA 17.622
1e8 2 0 unsorted max v1 - min v2 by id2 id4 1.460 4.787 6.195 2.169 1.602 1.001 16.637
1e8 2 0 unsorted largest two v3 by id2 id4 5.136 68.253 35.025 12.316 39.979 33.537 15.783
1e8 2 0 unsorted regression v1 v2 by id2 id4 3.239 6.430 19.541 NA 1.824 NA 17.035
1e8 2 0 unsorted sum v3 count by id1:id6 64.161 NA 152.277 885.626 37.913 1134.763 85.066
1e8 100 0 sorted sum v1 by id1 0.325 2.535 1.815 1.859 2.846 0.697 6.232
1e8 100 0 sorted sum v1 by id1:id2 0.400 4.230 4.806 10.556 3.157 1.153 2.557
1e8 100 0 sorted sum v1 mean v3 by id3 1.090 30.514 5.228 216.029 6.494 144.941 15.437
1e8 100 0 sorted mean v1:v3 by id4 1.439 8.857 4.773 2.610 1.245 1.823 4.246
1e8 100 0 sorted sum v1:v3 by id6 1.362 38.287 14.231 63.664 4.882 6.641 19.584
1e8 100 0 sorted median v3 sd v3 by id2 id4 5.292 3037.997 20.308 NA NA NA 12.496
1e8 100 0 sorted max v1 - min v2 by id2 id4 1.652 NA 6.300 3.262 2.050 1.289 10.242
1e8 100 0 sorted largest two v3 by id2 id4 4.693 NA 41.689 NA 10.757 120.610 9.942
1e8 100 0 sorted regression v1 v2 by id2 id4 3.357 NA 29.751 NA 2.550 NA 11.040
1e8 100 0 sorted sum v3 count by id1:id6 6.890 NA 109.522 NA 38.164 461.749 36.642
1e8 NA NA NA NA 224.441 NA 1204.861 NA NA NA 731.817

50 GB

Set of basic questions

Set of advanced questions

Details table

Plots of all cases can be found at G1_1e9_1e2_0_0.basic.png, G1_1e9_1e1_0_0.basic.png, G1_1e9_2e0_0_0.basic.png, G1_1e9_1e2_0_1.basic.png, G1_1e9_1e2_0_0.advanced.png, G1_1e9_1e1_0_0.advanced.png, G1_1e9_2e0_0_0.advanced.png, G1_1e9_1e2_0_1.advanced.png. Below first run timings.

rows q1_grp.size NA_pct pre_sorted question data.table dplyr pandas pydatatable spark dask juliadf
1e9 100 0 unsorted sum v1 by id1 8.195 41.233 NA 51.171 18.094 NA NA
1e9 100 0 unsorted sum v1 by id1:id2 11.106 108.462 NA 168.352 24.915 NA NA
1e9 100 0 unsorted sum v1 mean v3 by id3 13.292 527.502 NA 2060.718 74.070 NA NA
1e9 100 0 unsorted mean v1:v3 by id4 16.907 94.488 NA 40.236 8.631 NA NA
1e9 100 0 unsorted sum v1:v3 by id6 13.035 539.830 NA 85.019 63.283 NA NA
1e9 100 0 unsorted median v3 sd v3 by id2 id4 111.254 NA NA NA NA NA NA
1e9 100 0 unsorted max v1 - min v2 by id2 id4 53.947 NA NA 37.854 17.525 NA NA
1e9 100 0 unsorted largest two v3 by id2 id4 74.272 NA NA NA 143.411 NA NA
1e9 100 0 unsorted regression v1 v2 by id2 id4 73.545 NA NA NA 23.240 NA NA
1e9 100 0 unsorted sum v3 count by id1:id6 177.704 NA NA NA 979.816 NA NA
1e9 10 0 unsorted sum v1 by id1 11.327 32.903 NA 19.783 11.418 NA NA
1e9 10 0 unsorted sum v1 by id1:id2 13.001 67.317 NA 77.581 22.099 NA NA
1e9 10 0 unsorted sum v1 mean v3 by id3 28.024 1129.544 NA NA 100.284 NA NA
1e9 10 0 unsorted mean v1:v3 by id4 22.360 53.460 NA NA 7.428 NA NA
1e9 10 0 unsorted sum v1:v3 by id6 27.769 NA NA NA 86.649 NA NA
1e9 10 0 unsorted median v3 sd v3 by id2 id4 92.940 NA NA NA NA NA NA
1e9 10 0 unsorted max v1 - min v2 by id2 id4 35.388 NA NA NA 11.613 NA NA
1e9 10 0 unsorted largest two v3 by id2 id4 69.499 NA NA NA 142.066 NA NA
1e9 10 0 unsorted regression v1 v2 by id2 id4 46.012 NA NA NA 16.276 NA NA
1e9 10 0 unsorted sum v3 count by id1:id6 253.374 NA NA NA 1102.860 NA NA
1e9 2 0 unsorted sum v1 by id1 13.534 30.670 NA 16.639 11.409 NA NA
1e9 2 0 unsorted sum v1 by id1:id2 15.546 53.885 NA 30.520 13.880 NA NA
1e9 2 0 unsorted sum v1 mean v3 by id3 NA NA NA NA 219.396 NA NA
1e9 2 0 unsorted mean v1:v3 by id4 NA NA NA NA 7.748 NA NA
1e9 2 0 unsorted sum v1:v3 by id6 NA NA NA NA 146.303 NA NA
1e9 2 0 unsorted median v3 sd v3 by id2 id4 NA NA NA NA NA NA NA
1e9 2 0 unsorted max v1 - min v2 by id2 id4 NA NA NA NA 10.657 NA NA
1e9 2 0 unsorted largest two v3 by id2 id4 NA NA NA NA 481.419 NA NA
1e9 2 0 unsorted regression v1 v2 by id2 id4 NA NA NA NA 16.503 NA NA
1e9 2 0 unsorted sum v3 count by id1:id6 NA NA NA NA 676.685 NA NA
1e9 100 0 sorted sum v1 by id1 3.143 27.987 NA 16.833 10.299 NA NA
1e9 100 0 sorted sum v1 by id1:id2 4.427 50.633 NA 47.846 13.103 NA NA
1e9 100 0 sorted sum v1 mean v3 by id3 10.361 384.564 NA 2668.894 80.457 NA NA
1e9 100 0 sorted mean v1:v3 by id4 15.947 86.465 NA 36.934 7.809 NA NA
1e9 100 0 sorted sum v1:v3 by id6 14.370 514.740 NA 74.700 60.178 NA NA
1e9 100 0 sorted median v3 sd v3 by id2 id4 81.280 NA NA NA NA NA NA
1e9 100 0 sorted max v1 - min v2 by id2 id4 28.377 NA NA 29.759 14.102 NA NA
1e9 100 0 sorted largest two v3 by id2 id4 53.611 NA NA NA 98.020 NA NA
1e9 100 0 sorted regression v1 v2 by id2 id4 46.182 NA NA NA 18.575 NA NA
1e9 100 0 sorted sum v3 count by id1:id6 67.799 NA NA NA 416.538 NA NA
1e9 NA NA NA NA NA NA NA NA NA NA NA

Environment configuration

Component Value
cpu_model Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
cpu_cores 40
memory_model DIMM Synchronous 2133 MHz
memory_gb 125.78

Groupby benchmark run took around 54 hours.

Report was generated on: 2019-06-02 13:53:58 PDT.