dinsdag 27 juli 2010

Geocoding text based spatial data wih SSIS packages.

On SQL Server Performance site I found an interesting example of geocoding some spatial data of siddharth Mehta. In his example he decodes some famous places around the world into a latitude and a longitude. I decided to try this and to learn more about geocoding. Below you can see the result.

These are the steps you have to take for this small demo:
1. Create testdata.
2. Use geocode processing engine (Bing or Google).
3. Built a geocode webservice wrapper class.
4. Built a SSIS package
5. Write the results in an table.

1. Create the testdata

In the example of siddharth Mehta he used some objects from around the world. I decided to add some places and tried some tests with it. Below you can see the starting point i used.


2. Use a geocode processing engine

We need to use bing maps or google maps to process the information, as we provide in the example, which can take text based location information as an input and return the latitude and longitude of the corresponding location. These web based applications expose web-services which can be consumed by applications, our purpose in this case is to geocode the text based data we have in our "Address" table.
For the purposes of this sample application, we will use the webservice exposed by Bings Maps. First we will need to subscribe to the service. Navigate to www.bingmapsportal.com, sign up and create a key which will be your gateway to the webservices exposed by Bing Maps. In the below screenshot, you can see that I have created my application-key pair.


3. Built a geocode webservice wrapper class

As the article states, create a client or wrapperclass which can be used in the SSIS package. This way it’s easy to use a webservice. The .NET framework ships with the wsdl.exe tool. This tool creates a proxy class. Installing the .NET framework was not enough in my case.  Even when i installed the SDK for .NET framework icouldn't find the wsdl tool. I had to install visual studio express before the wsdl tool was installed. May be I missed something.

Do this:
wsdl /out:C:/geocode.cs http://dev.virtualearth.net/webservices/v1/geocodeservice/geocodeservice.svc?wsdl

This will create a geocoding class in C#. This will be used in the SSIS Spackage.


4. Built a SSIS package

The next step is building a SSIS package.There are four steps in this package:
a. Read the textual spatial information
b. Add two columns: Longitude and latitude
c. Retrieve the information with the proxy class and bing webservice
d. Create a destination.

a. Read the textual spatial information.
  • Create a SSIS package and give it a name.
  • Create an OLEDB connection to the Address table.
  • Add a Dataflow to the package and configure it to read the address table 
b. Add two columns: Longitude and latitude
  • Create a derived column task.
  • Add two fields latitude and longitude
  • Use the unicode textstring and set the length to 100



c. Retrieve the information with the proxy class and bing webservice.


  • add a script transformation task
  • Select Type transformation.
  • Select scripting language visual C#.
  • Edit the lattitude and logitude columns and make them writeable.
  • In the project explorer choose "add existing item".
  • Add references System.Web.Services and System.XML to the project.
  • Select the proxy class Geocode.cs which you created with WDSL tool.
  • Add the following code snippit in the Script task (Input0_ProcessInputRow)
{
string latitude = "", longitude = "";
GeocodeAddress(Row.AddressObject + ", " + Row.AddressCountry + ", " + Row.AddressCity, ref latitude, ref longitude);

Row.Latitude = latitude;
Row.Longitude = longitude;
}

private void GeocodeAddress(string address, ref string latitude, ref string longitude)
{

string key = @"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx";

GeocodeRequest geocodeRequest = new GeocodeRequest();

// Set the credentials using a valid Bing Maps key

geocodeRequest.Credentials = new Credentials();
geocodeRequest.Credentials.ApplicationId = key;

// Set the full address query
geocodeRequest.Query = address;

// Set the options to only return high confidence results
ConfidenceFilter[] filters = new ConfidenceFilter[1];
filters[0] = new ConfidenceFilter();
filters[0].MinimumConfidence = Confidence.High;
// Add the filters to the options
GeocodeOptions geocodeOptions = new GeocodeOptions();
geocodeOptions.Filters = filters;
geocodeRequest.Options = geocodeOptions;

// Make the geocode request
GeocodeService geocodeService = new GeocodeService();
GeocodeResponse geocodeResponse = geocodeService.Geocode(geocodeRequest);

if (geocodeResponse.Results.Length > 0)
{

latitude = geocodeResponse.Results[0].Locations[0].Latitude.ToString();
longitude = geocodeResponse.Results[0].Locations[0].Longitude.ToString();

}
}

  • Replace xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx with the key which gave Bing in the BingMapsPortal.
  • Not sure if this is needed, but do a build.
d. Create a destination
  • Write the results in a table with Destination


5. Write the results in a table.


And here are the results:

Conclusion:

So this example shows the transformation of turning the textual spatial information into a spatial information. I tried some some small places and they all got a latitude and longitude. This way you can transform spatial information in geocoded information. Interesting and i will try more of this stuff and blog abou it.


Hennie

donderdag 22 juli 2010

Installing SQL Server 2008 R2 Sample databases and FILESTREAM

When installing the sample databases on a SQL Server 2008 R2 installation i ran into an issue with FILESTREAM settings. You can download the Sample databases at the codeplex site and many other great stuff. When installing, i saw that the AdventureWorks OLTP 2008R2 couldn't be installed.



At first the webpage told me that you have to configure the FILESTREAM. Okay i did:

EXEC sp_configure filestream_access_level, 1
RECONFIGURE
 
After that i tried it again. No go.

Than, somewhere on the page i saw a link to another page which linked to another page, etc which told me to set a check at the checkbox "enable FILESTREAM for Transact- SQL access" at servicewindow of the SQL Server service. You can reach this at the configuration manager:


You can do this with the following steps:
1.On the Start menu, point to All Programs, point to Microsoft SQL Server 2008 R2, point to Configuration Tools, and then click SQL Server Configuration Manager.
2.In the list of services, right-click SQL Server Services, and then click Open.
3.In the SQL Server Configuration Manager snap-in, locate the instance of SQL Server on which you want to enable FILESTREAM.
4.Right-click the instance, and then click Properties.
5.In the SQL Server Properties dialog box, click the FILESTREAM tab.
6.Select the Enable FILESTREAM for Transact-SQL access check box.

And the results are:



You have to install the analysis databases seperately in BIDS.

CU,
Hennie

dinsdag 13 juli 2010

Data aggregation implementation

This is the second article in the series about MDM based on the article “Master Data Management from a technical perspective” from Microsoft (april 2010). As I mentioned in the first post, i have noticed that there are 4 implementation possibilities of Master Data management:

Master data registry implementation
• Data aggregation implementation.
• System-of-record-only implementation.
• Complete enterprise master data management implementation.

This blog is about the 'Data aggregation implementation’ also called ‘Watered down’ method. In the article a quote is mentioned : ‘Data aggregation implementations involve the creation of a new system that is neither the system of entry nor the system of record but a downstream system used to aggregate the attributes of multiple systems and pass them to lower level subscribing systems.’ This means that one system is taken as the most business critical and a leading system. This data is loaded in the MDM hub and other systems will take this as a base to compare their data against this.



At first, data aggregation seemed a bit strange to me. Aggregated data is data (measures) aggregated to a higher level (summarized, maximized or whatever). So what is meant by a data aggregation implementation? Data aggregation means from this viewpoint : a single view of the master data. This is in contrast with the multiview approach which is well suited for companies, achieving the 'right view' presented to the right people and processes, at the right time. In a data aggregation implementation there are multiple values unified to one view: the aggregated data.

Advantages:
• Quick wins for a organization.
• Risk to the mission-critical application can be mitigated.
• Less critical applications can begin to source their master data from the master data management application, solving any integration issues that arise without major ramification to the organization.
• Integration processes can be tested and modified in an iterative fashion.

Disadvantages
• Lack of control. It is very difficult for an initial master data management project to get all of the necessary stakeholders to relinquish control of their data to a new system immediately.

So this is it again....
 
Greetz,
Hennie

zondag 11 juli 2010

Master data registry implementation

Reading the article “Master Data Management from a technical perspective” from Microsoft (april 2010) i noticed that there are 4 implementation possibilities of Master Data management:
  • Master data registry implementation
  • Data aggregation implementation
  • System-of-record-only implementation
  • Complete enterprise master data management implementation.
I want to describe the first type of implementation in this blog: master data registry implementation. In a registry implementation, each system remains in control of its own data. All system data records are mapped in the master data registry. Data maps shows the relationships between the keys of two different keys. The microsoft article distinguishes two different ways :
  • One to one: Every record in the main system will have only one corresponding record in the secondary system
  • One to many: Every record in the main system will have one or more corresponding records in the secondary system.


This approach is the opposite of the repository approach in which the master data is stored in the MDM hub. The master data is maintained in the application databases and the MDM hub contains lists of keys that can be used to find all the related records for a particular master data item. For example, if there are records for a particular customer in the CRM, ERP and a other database, the MDM hub would contain a mapping of the keys for these three records to a common key.

Advantage :
  • The changes to application code to implement this model are usually minimal.
  • Current application users generally don’t need to be aware of the MDM system.
 
Disadvantage :
  • Every query against MDM data is a distributed query across all the entries for the desired data in all the application databases. If the query is going against a particular customer this is probably not an unreasonable query but if you want a list of all customers who have ordered a particular product in the last six months you may need to do a distributed join across tables from five or even ten databases. Doing this kind of large, distributed query efficiently is pretty difficult.
Interesting approach! I assumed, before i read the article, there was only one implementation type possible: the repository approach. So there are multiple types of MDM implementations. I will blog about the other types laters...

Also interesting is the System Of Record (SOR) and the System Of Entry (SOE). SOR means that most, if not all systems, recieve their data from its source. When conflicts arise, this system is considered primary.

SOE means that this system is the primary point of data entry. This may be direct entry or through services that update the data in virtual real time. This is more interesting because when studying Datavault Linstedts states that the system of record moves from the source systems to the datawarehouse (because source systems alter). So how to put this in perspective?

Greetz,
Hennie