DBMiner Enterprise Version E0.8


DBMiner Tutorial

This tutorial will guide you in using DBMiner to mine three typical kinds of knowledge - associations, classifications and clusters - from a data cube created by the OLAP Manager. As you go through the tutorial, you will perform all the operations necessary for mining these three kinds of knowledge. For information about fully exploiting the power of DBMiner, see the online documentation.

This tutorial uses the existing FoodMart data cube from the OLAP Manager. You will first set up connections to an OLAP server in order to gain access to the data cube, and then you will be guided through mining associations, classifications, clusters from this cube.

Section 1.   Logging In
Section 2.   Mining Associations from a Data Cube
Section 3.   Mining Classifications from a Data Cube
Section 4.   Mining Clusters from a Data Cube


  1. Logging In

  2. Register Server

    DBMiner allows users to register various OLAP servers on-line. For this tutorial session, select your own server's name when you are logging in (If you don't know the server's name, consult your system administrator).

    Suppose your local OLAP server is called CHOPIN. After you start up DBMiner (If necessary, consult your system administrator on how to start up DBMiner), first highlight the icon OLAP servers and right-click on it to register a new server. The window could be as follows.

    The login dialog box will be prompted as shown in the following picture.

    After you enter your OLAP server name (for this tutorial enter CHOPIN), click OK and wait for the server name CHOPIN to appear under OLAP Servers in the left pane in DBMiner.

    Since the data cube FoodMart comes with SQL Server 7.0, it should be already be loaded on your current server. You can view a data cube's schema, dimensions, measures, etc., using the Data Cube Viewer module in DBMiner. Please see the on-line documentation for further details.



  3. Mining Associations from a Data Cube

    1. About Association

    2. Association mining on a set of data looks for values in different dimensions (attributes) that commonly occur together, suggesting an association between them.

      In DBMiner, three kinds of associations could be possibly mined:

      Inter-dimensional association. Associations among or across two or more dimensions.

      Customer-Country("Canada") => Product-SubCategory("Coffee")
      i.e. Canadian customers are likely to buy coffee.

      Intra-dimensional association. Associations present within one dimension grouped by another one or several dimensions. For example, if you want to find out which products customers in Canada are likely to purchase together:

      Within Customer-Country("Canada"):
           Product-ProductName("CarryBags") => Product-ProductName("Tents")
      i.e. Customers in Canada, who buy carry-bags, are also likely to buy tents.

      Hybrid association. Associations combining elements of both inter- and intra-dimensional association mining. For example,

      Within Customer-Country("Canada"):
           Product("Carry Bags") => Product("Tents"), Time("Q3")
      i.e. Customers in Canada, who buy carry-bags, also tend to buy tents and do so most often in the 3rd quarter of the year (Jul, Aug, Sep).

      In this tutorial, we focus on the first kind of association (i.e. Inter-dimensional association). For details on the other two types, see the on-line documentation.

    3. Submitting an Association Task

    4. In DBMiner, you do not need to write a mining query. Instead a mining wizard with a graphical user interface, is provided to help you specify your query.

      Suppose you want to submit an association query as follows:

      MINE Inter-dimensional association
      WITH RESPECT TO Customers, Education Level, Gender, Marital Status, Product
      FROM CUBE FoodMart-Sales
      SET Minimum support 10%
      SET Minimum confidence 50%
      

      The following steps will take you through an association mining task.

      Step (a):
      In order to input such a query, follow the menu route Mining -> Mining Wizard to the invoke the mining wizard, which will appear as shown below.

      In the Data Source window, expand the data source FoodMart, and then select the data cube Sales. After specifying the data cube, click on the Next button.

      Step (b):
      Select the mining task you want to perform from the dialog box shown below. For now, select the mining task "Association", and click on Next.

      N.B:  Do not click on the Finish button underneath the mining tasks since DBMinr will then look for all possible association rules across all dimensions. Doing so will take too much time since the FoodMart database is huge.

      Step (c):
      This step allows you to select dimensions and measures you want to mine on. The wizard box is shown below.

      For the query shown above, we need five dimensions. Double-clicking on a dimension will select it into the right pane. Select Customer, Education Level, Gender, Marital Status, Product.

      Use the default measure as selected by the system. Before go to the next step, make sure your wizard box is like the one shown below. Click on the button Next to go to the next wizard step.

      Step (d):
      In the window shown below, the wizard allows you to choose between inter-dimensional, intra-dimensional, and hybrid association mining in DBMiner. In this tutorial, we focus on the inter-dimensional association which is the default, so no changes are necessary here.

      You may choose to set constraints for the mining process. Although we will not add any constraint in this tutorial, we show below the dialog box for adding constraints for your reference.

      Association rules can be focused by specifying one or more constraints. A constraint specifies a dimension value that must appear in any association rule subsequently generated. For instance, if five constraints are chosen, then each generated rule must contain at least one of these five constraints in either its body or its head. You may further specify that only these constraints are used in rule generation, in which case the rules generated will consist only of clauses containing one or more of these constraints. This is done by checking the "Use only of these constraints" in the top right corner of the window shown as above. If the dialog box for adding constraints is open, cancel it. This returns you the wizard dialog box.

      Click on the button Next to go to the next wizard dialog box.

      Step (e):
      In this step, you can specify the minimum support and minimum confidence. In this mining task, we set the minimum support to be 10% and the minimum confidence 50%. After setting these thresholds, the wizard window should be like the one as follows.

      Click on the button Next to continue.

      Step (f):
      This is the final step in this mining wizard. The wizard box shown below displays the query just created to be used in the mining process. Verify your query. If you find the specifications of the query do not meet your requirements, you may click on the Back button to re-specify them.

      Now click on the Finish button to trigger the mining process.

    5. Viewing Association Mining Process

    6. During the mining process, the system displays some statistical information about this process, as shown below. You can re-display this statistical information later by following the menu route Associator -> Statistics.

      The mined rules are presented in spreadsheet format (shown below), much like that in Microsoft Excel.

      Frequent itemsets can be displayed by following the menu route Associator -> Frequent Itemset.

    7. Analyzing and Modifying Association Results

    8. You may analyze the association results in several different ways. The user can choose from the following display bar for these options which can be found at the top of the right pane.

      Each of these menu buttons has the following functionality:

      Mined rules can be analyzed using 3D Bar Chart View shown below. By clicking on the button you can display the following plane view which shows the association rules generated from the above frequent itemsets.

      Each rule contains the rule head (RHS) and rule body (LHS) which are represented by labeled axes. By pointing to any label on either axes, the text information for that label will be shown in the left corner of the screen as shown below.

      Bars in the above figure represent the support and confidence of rules, which are displayed at the intersection of the relevant values. To view a rule in text form, simply place the cursor over a particular bar. The rule will be displayed. For example,the figure below displays a rule in text form.

      Note that the height of a bar represents support of a rule (the higher the bar, the greater the support), while the color of a bar represents confidence. Bars in red indicate rules having the most confidence. The legend at the top right corner illustrates this.

      Another way to view the rules is via the 3D Ball Graph View which can be accessed by clicking on the button . See on-line documentation for details.

      You can also drill through to the raw data, which in this context represents data on the lowest level of the cube. Please see on-line documentation for details.

      By clicking on the button , the following dialog box appears. You can change the settings for the mining process, such as the minimum support and the minimum confidence. After you change the mining settings, the whole mining process will restart from the beginning.



  4. Mining Classifications from a Data Cube

    1. About Classification

    2. Classification mining analyzes a set of training data (i.e. a set of objects whose class labels are known) and constructs a model for each class based on the features in the data. A set of classification rules are generated by the classification process, and these can be used to classify future data, as well as develop a better understanding of each class in the database.

      In the classification process, attribute relevance analysis is very important. It is performed according to the analysis of an uncertainty measurement, which determines how relevant an attribute is to the chosen classification attribute. Only a few of the most relevant attributes are retained for the classification analysis and the weakly relevant or irrelevant ones are not further considered.

      In DBMiner, three thresholds are used to tackle noise and exceptional data and facilitate statistical analysis.

    3. Submitting a Classification Task

    4. Just as submitting an association query using the mining wizard, you can also use it to submit a classification query.

      Suppose you want to submit a classification query as follows.

      MINE Classification Rules
      ANALYZE Gender
      ON DIMENSIONS Customers, Education Level, Product, Promotion Media
      FROM CUBE FoodMart-Sales
      SET Classification Threshold = 90.00%
      SET Noise Threshold = 2.00%
      SET Train Set Threshold = 100.00%
      
      

      The following steps will take you through a classification task.

      Step (a):
      In order to input such a query, follow the menu route Mining -> Mining Wizard to invoke the mining wizard, which will appear as follows.

      In the Data Source window, expand the data source FoodMart, and then select the data cube Sales. After specifying the data cube, click on the button Next for the next step.

      Step (b) The wizard dialog box for this step is shown below. Select Classification as the mining task, then click on the Next button. Notice that the button Finish is grayed-out, because a classification attribute or dependent variable must be specified before mining can proceed.

      Step (c) In this step you will select the attribute you want to analyze by double-clicking on it. You may only select one attribute for analysis. For this tutorial choose the attribute, Gender (see wizard dialog box below). Click on the button Next to proceed to the next step.

      Step (d):
      You will select the attributes you want to analyze against, using the wizard box below. Double-clicking on an attribute will select it to the right pane. Select the attributes Customer, Education Level, Product, and Promotion Media. For this tutorial, do not change the default measure already selected.

      After these operations, the wizard box should be like the one shown as follows. Click on the button Next.

      Step (e):
      In this wizard step, you can specify the mining settings for classification, i.e. the classification threshold, the noise threshold, and the training/test set size. For this tutorial, accept the default values selected by the system. Click on the button Next to proceed to the next step.

      Step (f):
      This is the final step of this mining wizard which displays the classification query to be submitted. If you find the query displayed does not meet your requirement, you can click on the button Back to go back to change it.

      Click on the button Finish to trigger the classification mining process.

    5. View Classification Mining Process

    6. During the mining process, the following window will appear showing you the mining process statistics such as when the mining process began, when it ended, etc. You can also display this window by following the menu route Classifier->Mining Process.

      The mining result will be displayed using a tree structure, as shown below.

      You can also access the above decision tree by following the menu route Classifier->Decision Tree.

    7. Analyzing and Modify Classification Results

    8. Once the classification rules are formed, it is possible to view them in several different ways. The user can choose from the following display bar. All the functionalities provided by these menu buttons can also be obtained by accessing the menu Classifier. We show some of them below.

      Each of these menu buttons has the following functionality:

      By clicking the button , you can view the 3D graphical form of the generated decision tree. It is a hierarchical structure consisting of a set of pie charts and the branches (links) between them. Each pie chart indicates the distribution of the classes (of the classification attribute) at that particular node. For instance, in the above figure, each pie chart represents the distribution proportion of the both classes in "Gender". With generalization and a good classification algorithm, the generated classification tree is relatively small, and represents a good classification result (i.e. many leaves of the tree contain one majority class).

      To give you a clear impression of the classification process, the classifier displays a complete tree first, as shown in the above figure. You may click on any non-leaf node in the tree to expand it (showing all of its children nodes) or shrink it (hiding all its children nodes). For instance if, you click on the first + sign under the root node, the following window shows you an expanded tree.

      To view the actual distribution values at a particular node in the tree, simply place the cursor over it and examine the values that appear in the prompted box. For instance, if you move you mouse point over the pie beside the [Customers].[City] (Altadena), the following window shows up.

      By clicking on the button , the view similar to the following window will appear. It shows the classification tree in a tabular form.

      By clicking on the button, you can change the settings for classification mining. The dialog box is shown below. You may change the mining thresholds, and the hierarchical levels of dimensions. After you click on the button OK, the whole classification mining process will restart from the beginning.

      In analyzing classification results in DBMiner, you can also drill through to the raw data (i.e. data at the lowest level of the data cube). Please see on-line documentation for details.



  5. Mining Clusters from a Data Cube

    1. About Clustering

    2. Clustering groups all the value of one dimension by the values of the second dimension. In DBMiner, only two cube dimensions can be chosen in a mining session since the clustering space is 2-dimensional. The underlying algorithm used in DBMiner is the k-means method. For detailed information about the k-means method, see the on-line documentation.

      In DBMiner, there are several parameters that can be adjusted to tune the mining process.

    3. Submitting a Clustering Task

    4. Just as you can use the mining wizard in DBMiner to specify a query for an association or a classification task, you can also use it to specify a clustering task.

      Suppose you want to submit a clustering query as follows.

      MINE Clustering
      ON DIMENSION Customer WITH WEIGHT 1.00
      ON DIMENSION Store WITH WEIGHT 1.00
      FROM CUBE FoodMart-Sales
      SET Number of clusters = 4
      SET Maximum clustering passes = unlimited
      
      

      The following steps will take you through a clustering task.

      Step (a):
      In order to input such a query, follow the menu route Mining -> Mining Wizard to invoke the mining wizard. The mining wizard box will appear as follows.

      In the Data Source window, expand the data source FoodMart, and then select the data cube Sales. After specifying the data cube, click on the button Next for the next step.

      Step (b):
      The wizard window for this step is shown below. Specify Clustering as the mining task, and then click on the button Next to proceed to the next step. Note that the button Finish is grayed-out in this step, because you must select two dimensions in the next step before cluster mining can begin.

      Step (c):
      This step allows to select two dimensions to do the clustering. The wizard box is shown as below. For this tutorial select the two dimensions, Customers and Store, by double-clicking on them. Do not change the default measure value selected by the system. Click on the button Next to continue.

      Step (d):
      In this step, you will specify the mining settings. Specify the corresponding parameters according to the clustering query you want to submit (shown above). When it is done, the window should be like the one below. Press Next to proceed to the next step.

      Step (e):
      This is the final step in the mining wizard for clustering. The wizard box is shown below, displays the clustering query to be submitted. If this query does not meet your requirements, you can click on the button Back to go back to change it.

      Click on the button Finish to trigger the clustering mining process.

    5. View Classification Mining Process

    6. After you trigger the mining process, the following window will appear, showing you the mining process statistics, such as when the mining process began, when it ended, etc. You can also display this window by following the menu route Cluster->Mining Process.

      The mining result can then be displayed using a 2-dimensional plane, as shown below.

      You can also open the above window for viewing the generated clusters, by following the menu route Cluster->Clustering View.

    7. Analyzing and Modify Clustering Results

    8. Once the clusters are generated, it is possible to view them in several different ways. The user can choose from the following display bar. All the functionalities provided by these menu buttons can also be obtained by accessing the Cluster menu.


      Each of these menu buttons has the following functionality:

      Clicking on the button will display the 2-dimensional plane view shown above. Each plot represents a point in the original data set. Each plot shape (e.g. triangle, square, etc.) represents elements of one cluster. The red cross marked inside a group of plots are the centeres of each cluster.

      By clicking on the button , the clusters will be displayed in grid or spreadsheet form. The column "Counts" indicates the sum of the measure values (i.e. sales count) of each cell in this cluster.

      By clicking on the button , you can change the settings for clustering mining. The dialog box is shown below. In this dialog box, you may change the number of clusters, the dimension weights, the maximum number of clustering passes, and the filter threshold. After you click on the button OK, the whole clustering mining process will restart from the very beginning.