A Technical Guide to Digital Investigation and Litigation Support https://doi.org/10.1016/B978-1-59749-296-6.00007-9Get rights and content
When online shopping, you will sometimes get a suggestion of the following form: "Customers who bought item X also bought item Y." This suggestion is an example of an association rule. To derive it, you first have to know which items on the market most frequently co-occur in customers' shopping baskets, and here the FP-Growth algorithm has a role to play. The FP-Growth algorithm is an efficient algorithm for calculating frequently co-occurring items in a transaction database. To understand how it works, let's start with some terminology, using a customer transaction as an example:
Here, the words "basket" and "transaction" are used interchangably, because we identify the customer's shopping basket with the items that were purchased. To make these definitions concrete, consider the following transaction database:
Nine distinct items are for sale, and there are four baskets / transactions, with a varying number of items. The item appearing most frequently, product7, appears four times in the transactions database. Each of the following itemsets occurs twice: (product1, product7), (product2, product7), (product6, product7). An FP-tree data structure can be efficiently created, compressing the data so much that, in many cases, even large databases will fit into main memory. In the example above, the FP-tree would have product7, the most frequently occurring product, next to the root, with branches from product7 to product1, product2, and product6. If we insist that a product must appear more than once in the transaction database, then the remaining products are excluded from the FP-tree. The transaction database might have started out as a 4 x 9 (transactions x products) data table, with many zero entries, but now it is reduced to a minimalistic tree that captures only the relevant frequency data. Even with an efficient tree structure, the number of itemsets considered by the algorithm can grow very large. With the help of the parameter max number of itemsets, you can if necessary reduce runtime and memory. Remember that online shopping is merely an example; the FP-Growth algorithm can be applied to any problem that can be formulated in terms of items, itemsets, and baskets / transactions. The typical setting for the algorithm is a large transaction database (many baskets), with only a small number of items in each basket -- small compared to the set of all items.
In general, the concept of "minimum support" creates a cutoff, defining what is meant by frequent or not-so-frequent occurrences of an itemset. If an item or an itemset appears in only a few baskets, it is excluded, via the parameters min support or min frequency. The exclusion of infrequently-occurring items and itemsets helps to compress the data and improves the statistical significance of the results. On the other hand, if the value for min support or min frequency is set too high, the algorithm may find zero itemsets. Hence, this Operator provides two major modes, via the checkbox find min number of itemsets: 1. if unchecked, with a fixed minimum support value, and 2. if checked, with a dynamic minimum support value, to ensure that the result includes a minimum number of itemsets. FP-Growth supports several different formats for the input data. Please note the following requirements:
For the columns, the three available input formats are illustrated in the second tutorial, together with necessary pre-processing. Here's the summary:
The process shows a market basket analysis. A data set containing transactions is loaded using the Retrieve Operator. A breakpoint is inserted here so that you can view the ExampleSet. We have to do some preprocessing using the Aggregate Operator to mold the ExampleSet into an acceptable input format. A breakpoint is inserted before the FP-Growth Operator so that you can view the input data. The FP-Growth Operator is applied to generate frequent itemsets. Finally, the Create Association Rules Operator is used to create rules from the frequent item sets. The frequent itemsets and the association rules can be viewed in the Results View. Run this process with different values of the parameters to get a better understanding of this Operator. The input formats of the FP-Growth OperatorData is loaded and transformed to three different input formats. A breakpoint is inserted before the FP-Growth Operators so that you can see the input data in each of these formats. The FP-Growth Operator is used and the resulting itemsets can be viewed in the Results View. The results are all the same because the input data is the same, despite the difference in formats. |