BI Future Blog: 2016

dinsdag 8 november 2016

SQL : ColumnStore (part III) : enforce uniqueness (SQL Server 2014)

Introduction

Columnstores in SQL Server 2014 can not enforce primary keys and you can't create an unique non clustered index in contrast with SQL server 2016, where you are able to create an unique non clustered index on the ColumnStore table. That's a pity because I have to work with SQL Server 2014. Now, how can you enforce a semi primary key on a ColumnStore table? Well, there seems to be a way to do that with indexed views together with an unique clustered index. In this blogpost I'll show you the way. For this blogpost I've used the blogpost of MSDN where you can find more information about this.

Enforce uniqueness with Indexed view

For this blogpost I've created a simple table.

       
/*
DROP TABLE testIndexedVIEW
GO
*/
CREATE TABLE testIndexedVIEW
(TestID int)
GO

The next step is creating a ColumStore table with the following statement:

       
CREATE CLUSTERED COLUMNSTORE INDEX CCI_testIndexedVIEW ON testIndexedVIEW;  
GO

Next step is building the indexed view on this table (WITH SCHEMABINDING):

       
/* 
DROP VIEW dbo.testIndexedViewUniqueness
GO
*/

CREATE VIEW dbo.testIndexedViewUniqueness
WITH SCHEMABINDING
AS 
SELECT TestID
FROM dbo.testIndexedVIEW
GO

And now create a Unique Clustered Index on this Indexed view:

       

CREATE UNIQUE CLUSTERED INDEX UNCI_testIndexedViewUniqueness ON testIndexedViewUniqueness(TestID)
GO

That's it. The columnstore, indexed view and Unique Clustered Index are in place and now it's time to test this by inserting twice the same value in the column with the Unique Clustered Index.

       
INSERT INTO [dbo].testIndexedVIEW
           (TestID)
     VALUES
           (1)

INSERT INTO [dbo].testIndexedVIEW
           (TestID)
     VALUES
           (1)

Resulting in an error message.

Level 14, State 1, Line 21
Cannot insert duplicate key row in object 'dbo.testIndexedVIEW' with unique index 'UNCI_testIndexedViewUniqueness'. The duplicate key value is (1).
The statement has been terminated.

Conclusion

It's not possible to create an Unique Clustered Index on a ColumnStore table in SQL Server 2014. With an Indexed View, together with a Unique Nonclustered index you can enforce the uniqueness of the key fields.

Greetz,

Hennie

maandag 7 november 2016

SQL : ColumnStore Index (part II) : Indexing

Introduction

In my first blogpost about the ColumnStore Index, I've written about creating and dropping a columnstore. Please took a look at my first blogpost: SQL : ColumnStore Index (part I). In this blogpost I'll continue and observe some details about the ColumnStore Index, particularly differences between SQL Server 2014 and SQL Server 2016 building non clustered indexes on ColumnStores.

Creating indexes on the ColumnStore

So, the table sale (imported form WorldWideImporters demo database) is not a Rowstore anymore but a ColumnStore. Let's experiment with some indexing. Can we create a clustered index on the ColumnStore Index? It seems to not very logical (and it is) but let's try and see what happens.

       
CREATE CLUSTERED INDEX [CI_CustomerKey] ON Sale([Customer Key])

Resulting in :

Msg 1902, Level 16, State 3, Line 15
Cannot create more than one clustered index on table 'Sale'. Drop the existing clustered index 'CCI_Sale' before creating another.

So, this will generate error, as expected. Let's try to create a non clustered index on the ColumnStore (in SQL Server 2016)

       
CREATE NONCLUSTERED INDEX [CI_CustomerKey] ON Sale([Customer Key])

And experiment with a query and see what happens.

       
SELECT [Customer Key] FROM Sale WHERE [Customer Key] = 6

And now it uses the non clustered index :

But when I try to do this on SQL Server 2014, it will give an error:

Msg 35303, Level 16, State 1, Line 1
CREATE INDEX statement failed because a nonclustered index cannot be created on a table that has a clustered columnstore index. Consider replacing the clustered columnstore index with a nonclustered columnstore index.

So, creating a (unique) non clustered index on SQL Server 2014 is not possible but on SQL Server 2016 it is easy done.

Conclusion

It is not possible to create (unique) non clustered indexes on a Clustered ColumnStore in SQL Server 2014. In SQL Server 2016 it is a piece of cake.

Greetz,

Hennie

zondag 6 november 2016

SQL : ColumnStore Index (part I)

Introduction

ColumnStore indexes are a powerful addition to SQL Server and it started with SQL Server 2012 and it continued to evolve to a better ColumnStore Index today. I want to learn more about it and it did some investigation about the columnstore Index. So, this blogpost is not a in depth blogpost but a reflection of some basics of the columnstore index.

Now, I'm working with SQL Server 2016 and it is possible to create a column store table with the CREATE TABLE statement. Unfortunately my client is working on SQL Server 2014 and therefore I won't use that.

Playground

First, let's start with getting some data from the WorldWideImporters database.

SELECT * 
INTO Sale
FROM WideWorldImportersDW.[Fact].[Sale]

Create the ColumnStore

Now, we have a table Sale in a database (I've called it ColumnStoreTest). Because we are pretending that we are working on SQL Server 2014 we create the ColumnStore Index with the CREATE CLUSTERED COLUMNSTORE INDEX command

       
CREATE CLUSTERED COLUMNSTORE INDEX CCI_Sale ON [Sale];  
GO

How do we know whether we created a Column Store Table? Well, go the properties of the table and click on the Storage tab and here you can see the CompressionType as ColumnStore (actually, I'm not sure whether this is also in SQL Server 2014).

Drop the ColumnStore

Next stop is dropping the ColumnStore with the DROP INDEX statement:

       
DROP INDEX CCI_Sale ON [Sale]
GO

This will drop the index and it will return to the row store. The CompressionType is set to none.

Conclusion

First discovery of the ColumnStore Index: how to build and how to drop a columnstore index.

Greetz,

Hennie

woensdag 2 november 2016

WorldWideImporters Data warehouse Datamodels

Introduction

Just a small blogpost about the data models of World wide Importers data warehouse (WorldWideImportersDW). I just wanted to get an overview of the data warehouse models of World Wide Importers demo database of SQL Server 2016. There seems to be 6 facts:

Order
Sale
Purchase
Stock Holding
Movement
Transaction

WideWorldImportersDW is the main database for data warehousing and analytics (OLAP – OnLine Analytics Processing). The data in this database is derived from the transactional database WideWorldImporters, but it uses a schema that is specifically optimized for analytics.

Order

This is the order fact of the WorldWide Importers database and there a couple of (role playing) dimensions here:

Order date and Picked Date (Date)
SalesPerson and Picker (Employee)
City
Date
Customer
StockItem

Sale

The sales fact contains almost the same dimensions as the order fact. There are a couple of (role playing) dimensions here:

InvoiceDate and DeliveryDate (Date).
SalesPerson (Employee).
City.
Date.
Customer and BillToCustomer (Customer).
StockItem

Purchase

The purchase Fact has the following dimensions :

Date
Supplier
StockItem

Stock holding

The StockHolding is a bit weird fact, in my opinion. I don't why and how but there seems to be one dimension:

Stock holding

Movement

The movement fact has the following dimensions:

Date
StockItem
Customer
Supplier
TransactionType

Transaction

And the last one, Transaction has the following dimensions:

Date
Customer and BillToCustomer (customer)
Supplier
TransactionType
PaymentMethod

Conclusion

A simple and overview of the WorldWideImporters demo database of SQL Server 2016.

Greetz,

Hennie

zondag 30 oktober 2016

New, stopped, active and returned products

Introduction

I want to develop a report for the evolution of products. The question is whether a product that is sold is a new product in a given period (innovative product), a product that is not sold anymore in the given period (stopped product), is actively bought in a given period and a product that is sold in the previous period and is also sold in the current period.

Now, yet another interesting thing with this approach is that you can also use this in customer behavior analysis for customer retention, -attrition, -churning and -loyality.

Now there is something to say about this approach because it's not based on a certain definition and I accept this for this particular situation. The customer is happy with a year to year comparison between products. So, in this lab situation the logic is dependent on the filter context (in this case the calendar year slicer).

Lab

I have made a small workbook with some sample data to experiment with the DAX expression. This is the data that is used for the building the expressions

And this is the model.

New products

Fist build your implicit measure SumNetrevenue and the Previous year calulation. A new product is a product that is sold in the current period (year) and is not sold in the previous period. You could argue that, I know, but for me it is ok enough, for now.

SumNetRevenue:=SUM(Fact[Revenue])
       
NR-1Year:=CALCULATE([SumNetRevenue];
                     FILTER(ALL(Period);
                            Period[Year] = MAX(Period[Year]) - 1
                            )
                    )

#NewProducts:=COUNTROWS(FILTER(VALUES(Fact[ProductID]);[NR-1Year] = 0))

Stopped products

A stopped product is a product that is sold in the previous period but is not sold in the current period. This is the DAX expression:

       
#LostProducts:=COUNTROWS(FILTER(Product; ([#ActiveProducts] = 0 && [NR-1Year] <> 0)))

Active products

An active product is a product that is solfd in in the current year (and could have been sold in previous years)

       
#ActiveProducts:=DISTINCTCOUNT(Fact[ProductID])

Returned products

A returned product is a product that is sold in the previous period and is also sold in the current period. This DAX expression is borrowed from the DAXPatterns of Alberto Ferrari and Marco Russo.

       
#ReturningProduct:=COUNTROWS (
    CALCULATETABLE (
        VALUES ( Fact[ProductID] );
        VALUES ( Fact[ProductID] );
        FILTER (
            ALL ( Period );
           Period[PeriodID] < MIN (Period[PeriodID] )
        )
    )
)

Conclusion

Interesting DAX expressions to use in customer- or product analysis.

Greetz,

Hennie

vrijdag 8 juli 2016

DAX : Ranking the Stars with RankX

Introduction

I have to work on a ranking problem and therefore I decided to study the RANKX() function in DAX a bit further. The idea is that the RANKX function determines the rank of a certain item in a complete list of items.

I've used a blogpost of Rob Collie (RANKX-apalooza: Within and Across Groups, With Filters, Etc.) as a source for this blogpost.

RankX function

As stated in the RANKX MSDN help information, the RANKX function returns the ranking of a number in a list of numbers for each row in the table argument.

The structure of the DAX function is :

       
RANKX(<table>, <expression>[, <value>[, <order>[, <ties>]]])

So there are five parameters:

table. A DAX expression that returns a table.
expression. This is the DAX expression that returns a scalar
value. Any DAX expression that returns a single scalar value whose rank is to be found.
order. Ascending or descending
ties. How the numbering should go when there are ties.

The Data

I've created a data model with three tables: DimPeriod, DimProduct and a Fact Sales table and this is how they are connected:

This is the data in the FactSales table :

This is the data in DimPeriod table:

This is the data in the DimProduct table:

Let's begin

First create the explicit SUM expression that will be used in the other examples. And it seems necessary according to Kasper de Jonge: "The reason for that is that SUM(Sales[SalesAmount)) is not able to be calculated against each and every row for ALL(Regions[Region]) but will be evaluated against the total result. If you would wrap a CALCULATE around the measure it would be able to evaluate the value for each and every row. A measure is automatically wrapped around CALCULATE() by the engine."

     
NR:=SUM(FactSales[NeRevenue])

My first Rankx() function

The first RANKX expression was this one. It calculates the position of a product in a list.

       
OverallRankProduct:=IF(HASONEVALUE(DimProduct[ProductID]);
                                   RANKX(ALL(DimProduct); [NR]);
                                   BLANK()
                      )

The IF(HASONEVALUE(),…,BLANK()) test prevents from ranking subtotals as if they were Products.

In a first draft of my product table I had two of the same Productnames: 100 in the Chopper category and 100 in the plane category and that lead to a blank OverallRankStore record in a simple pivottable (on the right).

I replaced the HASONEVALUE with a check with the COUNTROWS.

 OverallRankProduct:=IF(COUNTROWS(VALUES(DimProduct[ProductName]))=1;
                        RANKX(ALL(DimProduct); [NR]);
                        BLANK()
                       )

I adjusted my DimProduct table in order to have unique productnames and here you can see a couple pivottables. A chopper 100 model is ranked first in 2016 and a Plane model 70 is ranked 8.

So this is ranking neglecting the hierarchies. It is ranking across the hierarchy.

Ranking with a hierarchy

Here is an expression with a ranking of products within a upper level in a hierarchy. In this case Category

       
OverallRankProductWithinCategory:=IF(HASONEVALUE(DimProduct[ProductID]);
                                     RANKX(ALL(DimProduct[ProductName]); [NR]);
                                     BLANK()
                                     )

And this is the result of 2016 and here you can see that in the Plane category product 747 is ranked first, 70 is ranked second and 500 is ranked third.

This is the result of 2017:

Now If I change the hierarchy to Brand it works too. For instance Boeing has two products 100 and 747 and they are ranked 1 and 2.

This is a printscreen for 2016

Ranking products with same names

Here again an example of a product with the same name and using the HASONEVALUE:

By using the IF HASONEVALUE it's not calculating the RANKX.

By using the following expression, the non unique record is shown and the Total is also empty.

       
OverallRankProductNonUnique2:=IF(COUNTROWS(VALUES(DimProduct[ProductName]))=1;
                                 RANKX(ALL(DimProduct); [NR])
                                )

So finally the following table shows different variations of the data

This pivottable is with MainCategory and Category:

And this one PivotTable is with brand.

Now If we want to see the ranking of the categories within the Maincategories, then we need another kind of expression.

OverallRankProductCategory:=IF(HASONEVALUE(DimProduct[Category]);
                               RANKX(ALL(DimProduct[Category]); 
                                     CALCULATE([NR]; ALL(DimProduct[MainCategory]))
                                    );
                               BLANK()
                             )

And this results in the following PivotTable:

Ranking based on a selected items in a slicer

It is also possible to ranked items based on the selection of slicers.

       
RankProductSEL:=IF(COUNTROWS(VALUES(DimProduct[ProductName]))=1;
                                                   RANKX(ALLSELECTED(DimProduct); [NR]);
                                                   BLANK()
                                                 )

And here you can see the result. I've only selected plane 300, 70 and 747 and these are ranked 3, 2 and 1.

Ranking all the stars

In the following example I've turnoff any filter and the products are ranked for every value in the table. Even the year is negleted.

 RankProductEver:=IF(COUNTROWS(VALUES(DimProduct[ProductName]))=1;
                       RANKX(ALL(DimProduct); CALCULATE([NR]; ALL(DimPeriod)));
                          BLANK()
                       )

Resulting in:

Conclusion

Interesting function to use in DAX to rank items on their own but also in a hierarchy.

Greetz,

Hennie

woensdag 22 juni 2016

First impression of SQL Server R Services (part I)

Introduction

I wanted to know more about R Services and I decided to walkhrough the tutorial provided by Microsoft. There are a couple of lessons you can follow to understand R Services. So this will the first blogpost about R Services and many will follow during my discovery of R services. R services is a powerful addition to SQL Server and one of the advantages is that you don't need to load your data to the client anymore. You can puss your code to the R Services of SQL Server and have advantages of a powerful server.

So this blogpost is about the first step (Lesson 1: Prepare the Data (Data science End-to-End walkthrough) and that is getting everything ready for analysis: installing packages, getting the data, loading the data in SQL Server, installing some of the code in SQL Server.

The data is a representative sampling of the New York City taxi data set, which contains records of over 173 million individual trips in 2013, including the fares and tip amounts paid for each trip.
For more information about how this data was originally collected, and how you can get the full data set, see http://chriswhong.com/open-data/foil_nyc_taxi/.

To make the data easier to work with, the Microsoft data science team performed downsampling to get just 1% of the data.

Walkthrough the steps

First I opened R Studio and created a new script and copied the code for installing the packages.



if (!('ggmap' %in% rownames(installed.packages()))){  
  install.packages('ggmap')  
}  
if (!('mapproj' %in% rownames(installed.packages()))){  
  install.packages('mapproj')  
}  
if (!('ROCR' %in% rownames(installed.packages()))){  
  install.packages('ROCR')  
}  
if (!('RODBC' %in% rownames(installed.packages()))){  
  install.packages('RODBC')  
}

And I executed this code for installing the packages on the server.

       
# Install required packages on the server
install.packages("ggmap", lib=grep("Program Files", .libPaths(), value=TRUE)[1])  
install.packages("mapproj", lib=grep("Program Files", .libPaths(), value=TRUE)[1])

The next step was downloading the data from Github. The code I copied from the Micoroft site gave some errors because there were some missing slashes.

       
$source = 'https://raw.githubusercontent.com/Azure/Azure-MachineLearning-DataScience/master/Misc/RSQL/Download_Scripts_R_Walkthrough.ps1'
$ps1_dest = "$pwd\Download_Scripts_R_Walkthrough.ps1"
$wc = New-Object System.Net.WebClient
$wc.DownloadFile($source, $ps1_dest)
.\Download_Scripts_R_Walkthrough.ps1 –DestDir 'C:\tempR'

And this a printscreen of the entry and the execution.

Once the script is finished the files are in the TempR folder:

Next step was to install the scripts in SQL Server. This is done with Run_R_Walkthrough.ps1 Powershell script.

      
.\RunSQL_R_Walkthrough.ps1

Input the database server name (the full address): WIN-11283AFOULP
Input the name of the database you want to create: nyctaxi
Input the user name which has the previlige to create the database:
Input the password of user name which has the previlige to create the databas
Input the path to the csv file you want to upload to the database: c:\tempR

Here a printscreen of this process.

The script is executed and results are shown in the powershell window. I've grabbed the dump and placed it here.

Start creating database and table on your SQL Server, and uploading data to the table. It may take a while...
---------------------------------------------
Executed SQL packet:
DECLARE @db_name varchar(255), @tb_name varchar(255)
DECLARE @create_db_template varchar(max), @create_tb_template varchar(max), @create_tb_template2 varchar(max)
DECLARE @sql_script varchar(max)
set @db_name = 'nyctaxi'
SET @tb_name = 'nyctaxi_sample'
SET @create_db_template = 'create database {db_name}'
SET @create_tb_template = '
use {db_name}
CREATE TABLE {tb_name}
(
       medallion varchar(50) not null,
       hack_license varchar(50)  not null,
       vendor_id char(3),
       rate_code char(3),
       store_and_fwd_flag char(3),
       pickup_datetime datetime  not null,
       dropoff_datetime datetime,
       passenger_count int,
       trip_time_in_secs bigint,
       trip_distance float,
       pickup_longitude varchar(30),
       pickup_latitude varchar(30),
       dropoff_longitude varchar(30),
       dropoff_latitude varchar(30),
       payment_type char(3),
       fare_amount float,
       surcharge float,
       mta_tax float,
       tolls_amount float,
       total_amount float,
       tip_amount float,
       tipped int,
       tip_class int
)
CREATE CLUSTERED COLUMNSTORE INDEX [nyc_cci] ON {tb_name} WITH (DROP_EXISTING = OFF)
'

SET @create_tb_template2 = '
use {db_name}
CREATE TABLE nyc_taxi_models
(
        model varbinary(max) not null
)
'

-- Create database
SET @sql_script = REPLACE(@create_db_template, '{db_name}', @db_name)
EXECUTE(@sql_script)

-- Create table
SET @sql_script = REPLACE(@create_tb_template, '{db_name}', @db_name)
SET @sql_script = REPLACE(@sql_script, '{tb_name}', @tb_name)
EXECUTE(@sql_script)

-- Create the table to persist the trained model
SET @sql_script = REPLACE(@create_tb_template2, '{db_name}', @db_name)
EXECUTE(@sql_script)

Execution succesful
-----------------------------------------
C:\tempR\create-db-tb-upload-data.sql execution done
start loading the data to SQL Server table...
Unable to load BCP resource DLL.  BCP cannot continue.
This step (creating database and tables, and uploading data to table) takes 27.19 seconds.
Start running the .sql files to register all stored procedures used in this walkthrough...
---------------------------------------------
Executed SQL packet:
USE [nyctaxi]

Execution succesful
---------------------------------------------
Executed SQL packet:
/****** Object:  StoredProcedure [dbo].[PersistModel]    Script Date: 11/05/2015 23:04:39 ******/
SET ANSI_NULLS ON

Execution succesful
---------------------------------------------
Executed SQL packet:
SET QUOTED_IDENTIFIER ON

Execution succesful
---------------------------------------------
Executed SQL packet:

IF EXISTS (SELECT * FROM sys.objects WHERE type = 'P' AND name = 'PersistModel')
  DROP PROCEDURE PersistModel

Execution succesful
---------------------------------------------
Executed SQL packet:
-- =============================================
-- Author:              <Author,,Name>
-- Create date: <Create Date,,>
-- Description: Store model in database. Input is a binary representation of R Model object passed as Hex string
-- =============================================
CREATE PROCEDURE [dbo].[PersistModel]
@m nvarchar(max)
AS
BEGIN
        -- SET NOCOUNT ON added to prevent extra result sets from
        -- interfering with SELECT statements.
        SET NOCOUNT ON;
        insert into nyc_taxi_models (model) values (convert(varbinary(max),@m,2))
END

Execution succesful
-----------------------------------------
C:\tempR\PersistModel.sql execution done
---------------------------------------------
Executed SQL packet:
USE [nyctaxi]

Execution succesful
---------------------------------------------
Executed SQL packet:

/****** Object:  StoredProcedure [dbo].[PredictTip]    Script Date: 10/29/2015 4:36:55 PM ******/
SET ANSI_NULLS ON

Execution succesful
---------------------------------------------
Executed SQL packet:

SET QUOTED_IDENTIFIER ON

Execution succesful
---------------------------------------------
Executed SQL packet:

IF EXISTS (SELECT * FROM sys.objects WHERE type = 'P' AND name = 'PredictTipBatchMode')
  DROP PROCEDURE PredictTipBatchMode

Execution succesful
---------------------------------------------
Executed SQL packet:

CREATE PROCEDURE [dbo].[PredictTipBatchMode] @inquery nvarchar(max)
AS
BEGIN

  DECLARE @lmodel2 varbinary(max) = (SELECT TOP 1
    model
  FROM nyc_taxi_models);
  EXEC sp_execute_external_script @language = N'R',
                                  @script = N'
mod <- unserialize(as.raw(model));
print(summary(mod))
OutputDataSet<-rxPredict(modelObject = mod, data = InputDataSet, outData = NULL,
          predVarNames = "Score", type = "response", writeModelVars = FALSE, overwrite = TRUE);
str(OutputDataSet)
print(OutputDataSet)
',
                                  @input_data_1 = @inquery,
                                  @params = N'@model varbinary(max)',
                                  @model = @lmodel2
  WITH RESULT SETS ((Score float));

END


Execution succesful
-----------------------------------------
C:\tempR\PredictTipBatchMode.sql execution done
---------------------------------------------
Executed SQL packet:
USE [nyctaxi]

Execution succesful
---------------------------------------------
Executed SQL packet:

/****** Object:  StoredProcedure [dbo].[PredictTipSingleMode]    Script Date: 10/31/2015 5:44:43 PM ******/
SET ANSI_NULLS ON

Execution succesful
---------------------------------------------
Executed SQL packet:

SET QUOTED_IDENTIFIER ON

Execution succesful
---------------------------------------------
Executed SQL packet:

IF EXISTS (SELECT * FROM sys.objects WHERE type = 'P' AND name = 'PredictTipSingleMode')
  DROP PROCEDURE PredictTipSingleMode

Execution succesful
---------------------------------------------
Executed SQL packet:


CREATE PROCEDURE [dbo].[PredictTipSingleMode] @passenger_count int = 0,
@trip_distance float = 0,
@trip_time_in_secs int = 0,
@pickup_latitude float = 0,
@pickup_longitude float = 0,
@dropoff_latitude float = 0,
@dropoff_longitude float = 0
AS
BEGIN

  DECLARE @inquery nvarchar(max) = N'
  SELECT * FROM [dbo].[fnEngineerFeatures](
  @passenger_count,
@trip_distance,
@trip_time_in_secs,
@pickup_latitude,
@pickup_longitude,
@dropoff_latitude,
@dropoff_longitude)
        '
  DECLARE @lmodel2 varbinary(max) = (SELECT TOP 1
    model
  FROM nyc_taxi_models);
  EXEC sp_execute_external_script @language = N'R',
                                  @script = N'

mod <- unserialize(as.raw(model));
print(summary(mod))
OutputDataSet<-rxPredict(modelObject = mod, data = InputDataSet, outData = NULL,
          predVarNames = "Score", type = "response", writeModelVars = FALSE, overwrite = TRUE);
str(OutputDataSet)
print(OutputDataSet)
',
                                  @input_data_1 = @inquery,
                                  @params = N'@model varbinary(max),@passenger_count int,@trip_distance float,
@trip_time_in_secs int ,
@pickup_latitude float ,
@pickup_longitude float ,
@dropoff_latitude float ,
@dropoff_longitude float',
                                  @model = @lmodel2,
                                                                    @passenger_count =@passenger_count ,
                                                                        @trip_distance=@trip_distance,
                                                                        @trip_time_in_secs=@trip_time_in_secs,
                                                                        @pickup_latitude=@pickup_latitude,
                                                                        @pickup_longitude=@pickup_longitude,
                                                                        @dropoff_latitude=@dropoff_latitude,
                                                                        @dropoff_longitude=@dropoff_longitude
  WITH RESULT SETS ((Score float));

END



Execution succesful
-----------------------------------------
C:\tempR\PredictTipSingleMode.sql execution done
---------------------------------------------
Executed SQL packet:
USE [nyctaxi]

Execution succesful
---------------------------------------------
Executed SQL packet:

/****** Object:  UserDefinedFunction [dbo].[fnCalculateDistance]    Script Date: 10/29/2015 4:33:52 PM ******/
SET ANSI_NULLS ON

Execution succesful
---------------------------------------------
Executed SQL packet:

SET QUOTED_IDENTIFIER ON

Execution succesful
---------------------------------------------
Executed SQL packet:

IF EXISTS (SELECT * FROM sys.objects WHERE type IN ('FN', 'IF') AND name = 'fnCalculateDistance')
  DROP FUNCTION fnCalculateDistance

Execution succesful
---------------------------------------------
Executed SQL packet:

CREATE FUNCTION [dbo].[fnCalculateDistance] (@Lat1 float, @Long1 float, @Lat2 float, @Long2 float)
-- User-defined function calculate the direct distance between two geographical coordinates.
RETURNS float
AS
BEGIN
  DECLARE @distance decimal(28, 10)
  -- Convert to radians
  SET @Lat1 = @Lat1 / 57.2958
  SET @Long1 = @Long1 / 57.2958
  SET @Lat2 = @Lat2 / 57.2958
  SET @Long2 = @Long2 / 57.2958
  -- Calculate distance
  SET @distance = (SIN(@Lat1) * SIN(@Lat2)) + (COS(@Lat1) * COS(@Lat2) * COS(@Long2 - @Long1))
  --Convert to miles
  IF @distance <> 0
  BEGIN
    SET @distance = 3958.75 * ATAN(SQRT(1 - POWER(@distance, 2)) / @distance);
  END
  RETURN @distance
END

Execution succesful
-----------------------------------------
C:\tempR\fnCalculateDistance.sql execution done
---------------------------------------------
Executed SQL packet:
USE [nyctaxi]

Execution succesful
---------------------------------------------
Executed SQL packet:

/****** Object:  UserDefinedFunction [dbo].[fnEngineerFeatures]    Script Date: 10/29/2015 4:36:07 PM ******/
SET ANSI_NULLS ON

Execution succesful
---------------------------------------------
Executed SQL packet:

SET QUOTED_IDENTIFIER ON

Execution succesful
---------------------------------------------
Executed SQL packet:

IF EXISTS (SELECT * FROM sys.objects WHERE type IN ('FN', 'IF') AND name = 'fnEngineerFeatures')
  DROP FUNCTION fnEngineerFeatures

Execution succesful
---------------------------------------------
Executed SQL packet:

CREATE FUNCTION [dbo].[fnEngineerFeatures] (
@passenger_count int = 0,
@trip_distance float = 0,
@trip_time_in_secs int = 0,
@pickup_latitude float = 0,
@pickup_longitude float = 0,
@dropoff_latitude float = 0,
@dropoff_longitude float = 0)
RETURNS TABLE
AS
  RETURN
  (
  -- Add the SELECT statement with parameter references here
  SELECT
    @passenger_count AS passenger_count,
    @trip_distance AS trip_distance,
    @trip_time_in_secs AS trip_time_in_secs,
    [dbo].[fnCalculateDistance](@pickup_latitude, @pickup_longitude, @dropoff_latitude, @dropoff_longitude) AS direct_di
stance

  )

Execution succesful
-----------------------------------------
C:\tempR\fnEngineerFeatures.sql execution done
Completed registering all stored procedures used in this walkthrough.
This step (registering all stored procedures) takes 22.5 seconds.
Plug in the database server name, database name, user name and password into the R script file
This step (plugging in database information) takes 0.2 seconds.

Two tables are created, nyc_taxi_models and nyctaxi_sample

Now the table needs to be filled with the data. There seems to be one datafile available but there are two tables created. I don't know why.

The data is loaded and inserted in the table nyctaxi_sample and contains 1.703.957 rows.

The right number in table.

Conclusion

The first blogpost about R Services and it is to early to say something about R Services. In the next blogpost I'll elaborate more about R Services and the functions needed.

Greet,

Hennie

PowerPivot : Discovering prediction of sales with simulation

Introduction

I was wondering whether I could add predictive analytics into my PowerPivot Workbooks. If I could predict the sales based on the sales in the former month or based on the same period last year or perhaps both, it would be a neat addition to my workbooks.

Now, the problem lies in the fact that statistical support in PowerPivot 2013 is very limited. In Excel you have statistical functions like NORM() and NORM.INV() that can help you building predictions (with simulation) in workbooks. But in PowerPivot 2010 and 2013 practically no support for statistics is available. Indeed we can calculate the Average and the standard deviation, but if you want to go a step further and use for instance a normal distribution function to calculate changes, you are out of luck.

So this blogpost is a kind of a discovery based blogpost where I try to find some solutions that would help predicting sales with aid of Excel 2013 and PowerPivot 2013. I found out that there are also (custom) functions in SQL Server that can help you building simulations, but that is for a later blogpost.

Simulation explained

Simulation is the ability to understand the system and implications of its changes over time including forecasting. Simulation is the mimicking of the operation of a real system, such as day to day operations in an organization. By using simulation you can quickly find out how the department or an organization might behave in the future. Simulation in a probabilistic model means that the outcome of the model changes every time you run it. It will change because a certain randomness in the formulas. This is in contrast with deterministic models when every time you run the model the output is the same.

Simulation doesn't show you what will occur; instead, it will show you many scenarios that might occur!

The data sets

I’ve created a couple of data sets in this model. The first one is the fact of sales for a certain organization. For simplicity, I’ve only included a date and a sales amount for the current month. Now we can extent this with another period If we like, for instance same period last year. I’ll not do explain this in more detail. This blogpost is just about getting the idea.

So this is the calendar dimension with a date as key and a daynumber as an attribute.

This is the sales data and it could be in a fact

Now for using the data in the simulation we need to calculate the average and the standard deviation. Luckily PowerPivot 2013 can calculate this!

       
AvgSales:=CALCULATE(AVERAGE(Sales[Sales]); ALL(Sales))

StanDevSales:=CALCULATE(STDEV.P(Sales[Sales]); ALL(Sales))

Now because PowerPivot does not have statistical functions I can use Excel to calculate the predictions. So I pull the averages and standard deviation into Excel and calculate there the simulation per day.

With this code for Average Sales and for the standard deviation:

       
=CUBEVALUE("ThisWorkbookDataModel"; "[Measures].[AvgSales]")
       
=CUBEVALUE("ThisWorkbookDataModel";"[Measures].[StanDevSales]")

The following piece of information I use in the table to calculate the predicted sales based on the average and standard deviation:

       
=NORM.INV(RAND();$E$4;$E$5)

Now I pull this information back into PowerPivot in order to integrate this later in a PivotChart.

And now I can build this data model

And now I can draw this graph with the information :

And if we refresh the data you can see that is a probabilistic model and the prediction is simulated differently when the simulation executes multiple times. This is illustrated with the grey line.

Review of the solution

Now there are some serious drawbacks about this solution. If I want to reload the data with new calculations PowerPivot reports an error. The column with the predicted sales is showing "GETTING DATA".

If the button "Refresh Selected" is pressed the error disappears and the data is refreshed again. Now, this is not very feasible in an automated environment, is it?

Another disadvantage is passing the average and the standard deviation to Excel with the CUBEVALUE. I haven't found another solution yet but I need to reference this cells with an average and standarddeviation in the NORM.INV() function in Excel. The problem is that this is nice on high level sales analysis but what if we want to include products in the simulation. Referencing every product with CUBEVALUE is time consuming and prone to errors when products are introduced and others decay.

Yet, another problem is when the data is not normally distributed. In the diagram below the predicted sales is even negative!

Conclusion

This was a interesting exercise to understand simulation a bit further. It seems that support for this in PowerPivot 2013 is minimum. Perhaps, better tooling are PowerPivot 2016, PowerBI v2 (Desktop). I've read that there is still no support for a Excellish NORM.INV() and NORM() in PowerBI v2. SQL Server 2016 has an R engine included. But, I think it is possible to automate this with SQL Server T-SQL functions as well. But that is for a future blogpost.

Regarding the simulation, this example assumes that the data is normally distributed, but for instance there are products that are just introduced and has another pattern (exponential?). Now, if you want to predict trustful figures a good model is needed. Now the problem is, what is a good model. As someone said :"All models are wrong but some are less wrong", you have to strive to a maximum of effort to bring the model to perfection. Perhaps you have to apply the 80/20 rule: 20 % of effort will give you a 80% correctness.

In my opinion you need to test first whether what kind of distribution the data has. Is it normally distributed, is it skewed? Or perhaps there are seasonal aspects in the data? Or can I give a reliability number to a prediction? So, a nice discovery of the subject but not ready for a real situation.

Greetz,
Hennie