If you have worked with spreadsheets a bit, you must have encountered lookup functions. The most common one is the VLOOKUP, which can be used to couple two tables:
What not everyone knows it that you can use LOOKUPs in another way as well, to classify data. That looks as follows:
By default, Excel (and most other spreadsheet systems too) return approximate matches as above: when the precise value is not found, the user gets the closest value below the search value. You can override this behavior by adding FALSE as fourth parameter. When doing that, Excel will return #N/A instead of an approximate match. I explain it in more detail in the video below:
I found the default very interesting because I have seen a lot of spreadsheets over the last few years, and my experience said that the first use is much more common. Since we have the wonderful Enron dataset on our hands now, we can finally check it out. In this post I give a quick overview of the results, but if you want all the details, have a look at the full paper.
What lookup use is more common?
As suspected, the use of the exact setting is more common than using TRUE or omitting the final parameter.
For the VLOOKUP functions the numbers are 5,198 approximate versus 7,010 exact and for HLKOOP those numbers are 976 versus 1,047. So in total, of the total 14,231 lookup functions in the entire Enron set, 43% is approximate and 57% is exact.
Are the approximate lookups needed?
Although it is already remarkable that there are more formulas not using the default than doing so, I must admit I was still surprised. Almost half of the uses was approximate? This did not comply with my ideas of what people used the VLOOKUp for. So, I dived into the approximate match sheets deeper, and found that, in many of the cases, approximate was used, but it was not needed (i.e. the search value occurred in the range precisely)
An example.
This type of use was common, in a staggering 4,792 (77.6%) of all approximate matches, the approximate setting was technically not needed. There is a benefit in using the approximate match, it is quicker (as binary search is used) but for small spreadsheets the difference is negligible, and one could wonder if most spreadsheet users know this.
The most likely explanation here is that users used the default, without changing it.
Who cares if people use the default?
Well, the interesting thing here is that there are risks associated with using the approximate match. It uses binary search and therefore, it requires the search range to be sorted. If not, funky things can happen. So, as a final research question, we looked into formulas using approximate match over an unsorted search range. Turns out, 23% of all approximate lookups searches over an unsorted search range. Often this is smelly, but harmless, but in 11 of the cases we found actual errors.
An remarkably deceiving example:
So, in short: VLOOKUPs default seems to be chosen in the wrong way, often people use the default setting while exact match would also do and using the default setting over an unsorted range can result in errors. Want to know more? Full paper is here:
Encore: The history of LOOKUP functions
Of course one would now wonder: why is the default for VLOOKUP chosen as it is? We have to go way back for that. Visical, widely considered to be the first spreadsheet program, already supported a lookup function among the 20 functions it supported. Interestingly enough, it did not support an IF formula, so the LOOKUP was likely used in those days to model decision tables:
In the first version of Excel, they supported the VLOOKUP function like the lookup function in VisiCalc, there was no parameter and everything was always approximate. This remained the case upto Excel 4.
You can read the preprint for all the details, to be plublished at VL/HCC 2015, or have a look at the slides I presented there:
Did you also investigate whether the users were doing some pre-processing before doing an exact match, where instead no preprocessing combined with an approximate match would suffice?
It wouldn’t surprise me if the users aren’t actually fully aware of the API’s abilities. 😉