Introduction
I wanted to know more about R Services and I decided to walkhrough the tutorial provided by
Microsoft. There are a couple of lessons you can follow to understand R Services. So this will the first blogpost about R Services and many will follow during my discovery of R services. R services is a powerful addition to SQL Server and one of the advantages is that you don't need to load your data to the client anymore. You can puss your code to the R Services of SQL Server and have advantages of a powerful server.
So this blogpost is about the first step (Lesson 1: Prepare the Data (Data science End-to-End walkthrough) and that is getting everything ready for analysis: installing packages, getting the data, loading the data in SQL Server, installing some of the code in SQL Server.
The data is a representative sampling of the New York City taxi data set, which contains records of over 173 million individual trips in 2013, including the fares and tip amounts paid for each trip.
For more information about how this data was originally collected, and how you can get the full data set, see http://chriswhong.com/open-data/foil_nyc_taxi/.
To make the data easier to work with, the Microsoft data science team performed downsampling to get just 1% of the data.
Walkthrough the steps
First I opened R Studio and created a new script and copied the code for installing the packages.
if (!('ggmap' %in% rownames(installed.packages()))){
install.packages('ggmap')
}
if (!('mapproj' %in% rownames(installed.packages()))){
install.packages('mapproj')
}
if (!('ROCR' %in% rownames(installed.packages()))){
install.packages('ROCR')
}
if (!('RODBC' %in% rownames(installed.packages()))){
install.packages('RODBC')
}
And I executed this code for installing the packages on the server.
# Install required packages on the server
install.packages("ggmap", lib=grep("Program Files", .libPaths(), value=TRUE)[1])
install.packages("mapproj", lib=grep("Program Files", .libPaths(), value=TRUE)[1])
The next step was downloading the data from Github. The code I copied from the Micoroft site gave some errors because there were some missing slashes.
$source = 'https://raw.githubusercontent.com/Azure/Azure-MachineLearning-DataScience/master/Misc/RSQL/Download_Scripts_R_Walkthrough.ps1'
$ps1_dest = "$pwd\Download_Scripts_R_Walkthrough.ps1"
$wc = New-Object System.Net.WebClient
$wc.DownloadFile($source, $ps1_dest)
.\Download_Scripts_R_Walkthrough.ps1 –DestDir 'C:\tempR'
And this a printscreen of the entry and the execution.
Once the script is finished the files are in the TempR folder:
Next step was to install the scripts in SQL Server. This is done with Run_R_Walkthrough.ps1 Powershell script.
.\RunSQL_R_Walkthrough.ps1
Input the database server name (the full address): WIN-11283AFOULP
Input the name of the database you want to create: nyctaxi
Input the user name which has the previlige to create the database:
Input the password of user name which has the previlige to create the databas
Input the path to the csv file you want to upload to the database: c:\tempR
Here a printscreen of this process.
The script is executed and results are shown in the powershell window. I've grabbed the dump and placed it here.
Start creating database and table on your SQL Server, and uploading data to the table. It may take a while...
---------------------------------------------
Executed SQL packet:
DECLARE @db_name varchar(255), @tb_name varchar(255)
DECLARE @create_db_template varchar(max), @create_tb_template varchar(max), @create_tb_template2 varchar(max)
DECLARE @sql_script varchar(max)
set @db_name = 'nyctaxi'
SET @tb_name = 'nyctaxi_sample'
SET @create_db_template = 'create database {db_name}'
SET @create_tb_template = '
use {db_name}
CREATE TABLE {tb_name}
(
medallion varchar(50) not null,
hack_license varchar(50) not null,
vendor_id char(3),
rate_code char(3),
store_and_fwd_flag char(3),
pickup_datetime datetime not null,
dropoff_datetime datetime,
passenger_count int,
trip_time_in_secs bigint,
trip_distance float,
pickup_longitude varchar(30),
pickup_latitude varchar(30),
dropoff_longitude varchar(30),
dropoff_latitude varchar(30),
payment_type char(3),
fare_amount float,
surcharge float,
mta_tax float,
tolls_amount float,
total_amount float,
tip_amount float,
tipped int,
tip_class int
)
CREATE CLUSTERED COLUMNSTORE INDEX [nyc_cci] ON {tb_name} WITH (DROP_EXISTING = OFF)
'
SET @create_tb_template2 = '
use {db_name}
CREATE TABLE nyc_taxi_models
(
model varbinary(max) not null
)
'
-- Create database
SET @sql_script = REPLACE(@create_db_template, '{db_name}', @db_name)
EXECUTE(@sql_script)
-- Create table
SET @sql_script = REPLACE(@create_tb_template, '{db_name}', @db_name)
SET @sql_script = REPLACE(@sql_script, '{tb_name}', @tb_name)
EXECUTE(@sql_script)
-- Create the table to persist the trained model
SET @sql_script = REPLACE(@create_tb_template2, '{db_name}', @db_name)
EXECUTE(@sql_script)
Execution succesful
-----------------------------------------
C:\tempR\create-db-tb-upload-data.sql execution done
start loading the data to SQL Server table...
Unable to load BCP resource DLL. BCP cannot continue.
This step (creating database and tables, and uploading data to table) takes 27.19 seconds.
Start running the .sql files to register all stored procedures used in this walkthrough...
---------------------------------------------
Executed SQL packet:
USE [nyctaxi]
Execution succesful
---------------------------------------------
Executed SQL packet:
/****** Object: StoredProcedure [dbo].[PersistModel] Script Date: 11/05/2015 23:04:39 ******/
SET ANSI_NULLS ON
Execution succesful
---------------------------------------------
Executed SQL packet:
SET QUOTED_IDENTIFIER ON
Execution succesful
---------------------------------------------
Executed SQL packet:
IF EXISTS (SELECT * FROM sys.objects WHERE type = 'P' AND name = 'PersistModel')
DROP PROCEDURE PersistModel
Execution succesful
---------------------------------------------
Executed SQL packet:
-- =============================================
-- Author: <Author,,Name>
-- Create date: <Create Date,,>
-- Description: Store model in database. Input is a binary representation of R Model object passed as Hex string
-- =============================================
CREATE PROCEDURE [dbo].[PersistModel]
@m nvarchar(max)
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
insert into nyc_taxi_models (model) values (convert(varbinary(max),@m,2))
END
Execution succesful
-----------------------------------------
C:\tempR\PersistModel.sql execution done
---------------------------------------------
Executed SQL packet:
USE [nyctaxi]
Execution succesful
---------------------------------------------
Executed SQL packet:
/****** Object: StoredProcedure [dbo].[PredictTip] Script Date: 10/29/2015 4:36:55 PM ******/
SET ANSI_NULLS ON
Execution succesful
---------------------------------------------
Executed SQL packet:
SET QUOTED_IDENTIFIER ON
Execution succesful
---------------------------------------------
Executed SQL packet:
IF EXISTS (SELECT * FROM sys.objects WHERE type = 'P' AND name = 'PredictTipBatchMode')
DROP PROCEDURE PredictTipBatchMode
Execution succesful
---------------------------------------------
Executed SQL packet:
CREATE PROCEDURE [dbo].[PredictTipBatchMode] @inquery nvarchar(max)
AS
BEGIN
DECLARE @lmodel2 varbinary(max) = (SELECT TOP 1
model
FROM nyc_taxi_models);
EXEC sp_execute_external_script @language = N'R',
@script = N'
mod <- unserialize(as.raw(model));
print(summary(mod))
OutputDataSet<-rxPredict(modelObject = mod, data = InputDataSet, outData = NULL,
predVarNames = "Score", type = "response", writeModelVars = FALSE, overwrite = TRUE);
str(OutputDataSet)
print(OutputDataSet)
',
@input_data_1 = @inquery,
@params = N'@model varbinary(max)',
@model = @lmodel2
WITH RESULT SETS ((Score float));
END
Execution succesful
-----------------------------------------
C:\tempR\PredictTipBatchMode.sql execution done
---------------------------------------------
Executed SQL packet:
USE [nyctaxi]
Execution succesful
---------------------------------------------
Executed SQL packet:
/****** Object: StoredProcedure [dbo].[PredictTipSingleMode] Script Date: 10/31/2015 5:44:43 PM ******/
SET ANSI_NULLS ON
Execution succesful
---------------------------------------------
Executed SQL packet:
SET QUOTED_IDENTIFIER ON
Execution succesful
---------------------------------------------
Executed SQL packet:
IF EXISTS (SELECT * FROM sys.objects WHERE type = 'P' AND name = 'PredictTipSingleMode')
DROP PROCEDURE PredictTipSingleMode
Execution succesful
---------------------------------------------
Executed SQL packet:
CREATE PROCEDURE [dbo].[PredictTipSingleMode] @passenger_count int = 0,
@trip_distance float = 0,
@trip_time_in_secs int = 0,
@pickup_latitude float = 0,
@pickup_longitude float = 0,
@dropoff_latitude float = 0,
@dropoff_longitude float = 0
AS
BEGIN
DECLARE @inquery nvarchar(max) = N'
SELECT * FROM [dbo].[fnEngineerFeatures](
@passenger_count,
@trip_distance,
@trip_time_in_secs,
@pickup_latitude,
@pickup_longitude,
@dropoff_latitude,
@dropoff_longitude)
'
DECLARE @lmodel2 varbinary(max) = (SELECT TOP 1
model
FROM nyc_taxi_models);
EXEC sp_execute_external_script @language = N'R',
@script = N'
mod <- unserialize(as.raw(model));
print(summary(mod))
OutputDataSet<-rxPredict(modelObject = mod, data = InputDataSet, outData = NULL,
predVarNames = "Score", type = "response", writeModelVars = FALSE, overwrite = TRUE);
str(OutputDataSet)
print(OutputDataSet)
',
@input_data_1 = @inquery,
@params = N'@model varbinary(max),@passenger_count int,@trip_distance float,
@trip_time_in_secs int ,
@pickup_latitude float ,
@pickup_longitude float ,
@dropoff_latitude float ,
@dropoff_longitude float',
@model = @lmodel2,
@passenger_count =@passenger_count ,
@trip_distance=@trip_distance,
@trip_time_in_secs=@trip_time_in_secs,
@pickup_latitude=@pickup_latitude,
@pickup_longitude=@pickup_longitude,
@dropoff_latitude=@dropoff_latitude,
@dropoff_longitude=@dropoff_longitude
WITH RESULT SETS ((Score float));
END
Execution succesful
-----------------------------------------
C:\tempR\PredictTipSingleMode.sql execution done
---------------------------------------------
Executed SQL packet:
USE [nyctaxi]
Execution succesful
---------------------------------------------
Executed SQL packet:
/****** Object: UserDefinedFunction [dbo].[fnCalculateDistance] Script Date: 10/29/2015 4:33:52 PM ******/
SET ANSI_NULLS ON
Execution succesful
---------------------------------------------
Executed SQL packet:
SET QUOTED_IDENTIFIER ON
Execution succesful
---------------------------------------------
Executed SQL packet:
IF EXISTS (SELECT * FROM sys.objects WHERE type IN ('FN', 'IF') AND name = 'fnCalculateDistance')
DROP FUNCTION fnCalculateDistance
Execution succesful
---------------------------------------------
Executed SQL packet:
CREATE FUNCTION [dbo].[fnCalculateDistance] (@Lat1 float, @Long1 float, @Lat2 float, @Long2 float)
-- User-defined function calculate the direct distance between two geographical coordinates.
RETURNS float
AS
BEGIN
DECLARE @distance decimal(28, 10)
-- Convert to radians
SET @Lat1 = @Lat1 / 57.2958
SET @Long1 = @Long1 / 57.2958
SET @Lat2 = @Lat2 / 57.2958
SET @Long2 = @Long2 / 57.2958
-- Calculate distance
SET @distance = (SIN(@Lat1) * SIN(@Lat2)) + (COS(@Lat1) * COS(@Lat2) * COS(@Long2 - @Long1))
--Convert to miles
IF @distance <> 0
BEGIN
SET @distance = 3958.75 * ATAN(SQRT(1 - POWER(@distance, 2)) / @distance);
END
RETURN @distance
END
Execution succesful
-----------------------------------------
C:\tempR\fnCalculateDistance.sql execution done
---------------------------------------------
Executed SQL packet:
USE [nyctaxi]
Execution succesful
---------------------------------------------
Executed SQL packet:
/****** Object: UserDefinedFunction [dbo].[fnEngineerFeatures] Script Date: 10/29/2015 4:36:07 PM ******/
SET ANSI_NULLS ON
Execution succesful
---------------------------------------------
Executed SQL packet:
SET QUOTED_IDENTIFIER ON
Execution succesful
---------------------------------------------
Executed SQL packet:
IF EXISTS (SELECT * FROM sys.objects WHERE type IN ('FN', 'IF') AND name = 'fnEngineerFeatures')
DROP FUNCTION fnEngineerFeatures
Execution succesful
---------------------------------------------
Executed SQL packet:
CREATE FUNCTION [dbo].[fnEngineerFeatures] (
@passenger_count int = 0,
@trip_distance float = 0,
@trip_time_in_secs int = 0,
@pickup_latitude float = 0,
@pickup_longitude float = 0,
@dropoff_latitude float = 0,
@dropoff_longitude float = 0)
RETURNS TABLE
AS
RETURN
(
-- Add the SELECT statement with parameter references here
SELECT
@passenger_count AS passenger_count,
@trip_distance AS trip_distance,
@trip_time_in_secs AS trip_time_in_secs,
[dbo].[fnCalculateDistance](@pickup_latitude, @pickup_longitude, @dropoff_latitude, @dropoff_longitude) AS direct_di
stance
)
Execution succesful
-----------------------------------------
C:\tempR\fnEngineerFeatures.sql execution done
Completed registering all stored procedures used in this walkthrough.
This step (registering all stored procedures) takes 22.5 seconds.
Plug in the database server name, database name, user name and password into the R script file
This step (plugging in database information) takes 0.2 seconds.
Two tables are created, nyc_taxi_models and nyctaxi_sample
Now the table needs to be filled with the data. There seems to be one datafile available but there are two tables created. I don't know why.
The data is loaded and inserted in the table nyctaxi_sample and contains 1.703.957 rows.
The right number in table.
Conclusion
The first blogpost about R Services and it is to early to say something about R Services. In the next blogpost I'll elaborate more about R Services and the functions needed.
Greet,
Hennie