<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	>
<channel>
	<title>Comments for Data Reflections</title>
	<atom:link href="http://blogs.tallan.com/datareflections/comments/feed/" rel="self" type="application/rss+xml" />
	<link>http://blogs.tallan.com/datareflections</link>
	<description>Discussions of data warehousing, transformation and analysis by the Business Intelligence Team at Tallan.</description>
	<pubDate>Mon, 22 Mar 2010 00:05:55 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.7</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>Comment on Introduction to SAP Business Objects Data Integrator ETL tool by JD</title>
		<link>http://blogs.tallan.com/datareflections/2008/11/06/introduction-to-sap-business-objects-data-integrator-etl-tool/comment-page-1/#comment-3336</link>
		<dc:creator>JD</dc:creator>
		<pubDate>Thu, 26 Feb 2009 20:06:22 +0000</pubDate>
		<guid isPermaLink="false">http://www.datareflections.net/?p=205#comment-3336</guid>
		<description>Really practical presentation. I know it's really variable, but I was wondering how long will it take to develop and ETL using BODI to load a 15 column table with formated flat files, given there's no data transformation needed? I know its really variable but would like to know an estimated time gap.
Thank you!</description>
		<content:encoded><![CDATA[<p>Really practical presentation. I know it&#8217;s really variable, but I was wondering how long will it take to develop and ETL using BODI to load a 15 column table with formated flat files, given there&#8217;s no data transformation needed? I know its really variable but would like to know an estimated time gap.<br />
Thank you!</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Introduction to SAP Business Objects Data Integrator ETL tool by Savio</title>
		<link>http://blogs.tallan.com/datareflections/2008/11/06/introduction-to-sap-business-objects-data-integrator-etl-tool/comment-page-1/#comment-17</link>
		<dc:creator>Savio</dc:creator>
		<pubDate>Wed, 12 Nov 2008 12:52:17 +0000</pubDate>
		<guid isPermaLink="false">http://www.datareflections.net/?p=205#comment-17</guid>
		<description>One thing I did not mention in my presentation or my post is the fact that using a tool provides you with an additional benefit namely it makes your ETL jobs database platform agnostic. This allows you to move your code from one database platform to another with minimal changes, typically the only change necessary is to change the connection information. I have actually worked on a project where the test environment had a SQL Server backend and a Oracle backend in production. I was able to move my code from test to production by simply changing the connection information. The caveat on doing this though is that if you chose to write your own Stored procs or SQL you will need to rewrite it when you move to another platform as DI will not translate this kind of code.</description>
		<content:encoded><![CDATA[<p>One thing I did not mention in my presentation or my post is the fact that using a tool provides you with an additional benefit namely it makes your ETL jobs database platform agnostic. This allows you to move your code from one database platform to another with minimal changes, typically the only change necessary is to change the connection information. I have actually worked on a project where the test environment had a SQL Server backend and a Oracle backend in production. I was able to move my code from test to production by simply changing the connection information. The caveat on doing this though is that if you chose to write your own Stored procs or SQL you will need to rewrite it when you move to another platform as DI will not translate this kind of code.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on SSIS data management fundamentals by achandler</title>
		<link>http://blogs.tallan.com/datareflections/2008/10/22/ssis-data-management-fundamentals/comment-page-1/#comment-12</link>
		<dc:creator>achandler</dc:creator>
		<pubDate>Sun, 26 Oct 2008 03:54:57 +0000</pubDate>
		<guid isPermaLink="false">http://www.datareflections.net/?p=192#comment-12</guid>
		<description>Coming out of Eric's presentation, we discussed several SSIS "pain points". Here's a follow-up with info on the issues we discussed.

&lt;b&gt;Connection Managers&lt;/b&gt; are key elements of SSIS packages. This article contains a decent overview and links: http://www.mssqltips.com/tip.asp?tip=1147

In an enterprise setting, it's important to be able to move an SSIS package from Dev, through QA, and into Production without modifications. In the case of Connection Managers, this is accomplished by either creating SSIS configurations, or using the /CONF flag with dtexec on the command line.

Here's a Brian Knight Vidoe on SSIS Configurations: http://www.jumpstarttv.com/Media.aspx?vid=202

Here's the MSDN article covering dtexec options: http://msdn.microsoft.com/en-us/library/ms162810(SQL.90).aspx

Eric pointed out that &lt;b&gt;Metadata Issues&lt;/b&gt; arise frequently in SSIS dataflow tasks. Fixing them is not particularly elegant. Here's a link describing the errors you can encounter and the solution: http://followtheheard.blogspot.com/2007/10/ssis-external-metadata-refresh.html

Terry did not disappoint. His reputation for the arcane was reinforced when he mentioned &lt;b&gt;IMEX=1&lt;/b&gt;. Apparently, this can be added to Excel connection strings to enforce the interpretation of mixed data within the same columns. Clear as mud? This may help: http://support.microsoft.com/default.aspx?scid=kb;en-us;194124

&lt;b&gt;SSIS Data Flows require a destination&lt;/b&gt;, which can be problematic when you're on a deadline and don't have control of the database environment. Konesans has a useful free component you can download here: http://www.konesans.com/trashdest.aspx

&lt;b&gt;Script Task&lt;/b&gt;! I believe it was Kenneth who mentioned how painful they can be. You'd think a Script Task would lend itself to great Visual Studio debugging integration. Sadly, you'd be wrong. Here's the MSDN breakdown: http://msdn.microsoft.com/en-us/library/ms140033.aspx</description>
		<content:encoded><![CDATA[<p>Coming out of Eric&#8217;s presentation, we discussed several SSIS &#8220;pain points&#8221;. Here&#8217;s a follow-up with info on the issues we discussed.</p>
<p><b>Connection Managers</b> are key elements of SSIS packages. This article contains a decent overview and links: <a href="http://www.mssqltips.com/tip.asp?tip=1147" rel="nofollow">http://www.mssqltips.com/tip.asp?tip=1147</a></p>
<p>In an enterprise setting, it&#8217;s important to be able to move an SSIS package from Dev, through QA, and into Production without modifications. In the case of Connection Managers, this is accomplished by either creating SSIS configurations, or using the /CONF flag with dtexec on the command line.</p>
<p>Here&#8217;s a Brian Knight Vidoe on SSIS Configurations: <a href="http://www.jumpstarttv.com/Media.aspx?vid=202" rel="nofollow">http://www.jumpstarttv.com/Media.aspx?vid=202</a></p>
<p>Here&#8217;s the MSDN article covering dtexec options: <a href="http://msdn.microsoft.com/en-us/library/ms162810" rel="nofollow">http://msdn.microsoft.com/en-us/library/ms162810</a>(SQL.90).aspx</p>
<p>Eric pointed out that <b>Metadata Issues</b> arise frequently in SSIS dataflow tasks. Fixing them is not particularly elegant. Here&#8217;s a link describing the errors you can encounter and the solution: <a href="http://followtheheard.blogspot.com/2007/10/ssis-external-metadata-refresh.html" rel="nofollow">http://followtheheard.blogspot.com/2007/10/ssis-external-metadata-refresh.html</a></p>
<p>Terry did not disappoint. His reputation for the arcane was reinforced when he mentioned <b>IMEX=1</b>. Apparently, this can be added to Excel connection strings to enforce the interpretation of mixed data within the same columns. Clear as mud? This may help: <a href="http://support.microsoft.com/default.aspx?scid=kb;en-us;194124" rel="nofollow">http://support.microsoft.com/default.aspx?scid=kb;en-us;194124</a></p>
<p><b>SSIS Data Flows require a destination</b>, which can be problematic when you&#8217;re on a deadline and don&#8217;t have control of the database environment. Konesans has a useful free component you can download here: <a href="http://www.konesans.com/trashdest.aspx" rel="nofollow">http://www.konesans.com/trashdest.aspx</a></p>
<p><b>Script Task</b>! I believe it was Kenneth who mentioned how painful they can be. You&#8217;d think a Script Task would lend itself to great Visual Studio debugging integration. Sadly, you&#8217;d be wrong. Here&#8217;s the MSDN breakdown: <a href="http://msdn.microsoft.com/en-us/library/ms140033.aspx" rel="nofollow">http://msdn.microsoft.com/en-us/library/ms140033.aspx</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Developing a Dimensional Model w/ conformed dimensions and facts by Terry</title>
		<link>http://blogs.tallan.com/datareflections/2008/09/23/developing-a-dimensional-model-w-conformed-dimensions-and-facts/comment-page-1/#comment-11</link>
		<dc:creator>Terry</dc:creator>
		<pubDate>Thu, 25 Sep 2008 19:25:24 +0000</pubDate>
		<guid isPermaLink="false">http://www.datareflections.net/?p=154#comment-11</guid>
		<description>Here is a good article http://www.essentialstrategies.com/publications/modeling/makingrd.htm</description>
		<content:encoded><![CDATA[<p>Here is a good article <a href="http://www.essentialstrategies.com/publications/modeling/makingrd.htm" rel="nofollow">http://www.essentialstrategies.com/publications/modeling/makingrd.htm</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Developing a Dimensional Model w/ conformed dimensions and facts by Terry</title>
		<link>http://blogs.tallan.com/datareflections/2008/09/23/developing-a-dimensional-model-w-conformed-dimensions-and-facts/comment-page-1/#comment-10</link>
		<dc:creator>Terry</dc:creator>
		<pubDate>Thu, 25 Sep 2008 19:19:13 +0000</pubDate>
		<guid isPermaLink="false">http://www.datareflections.net/?p=154#comment-10</guid>
		<description>Tino, Figure 2.  You have not represented Store and SalesPerson as a many to many relationship.  Your representation i think will result in 1 to 1.</description>
		<content:encoded><![CDATA[<p>Tino, Figure 2.  You have not represented Store and SalesPerson as a many to many relationship.  Your representation i think will result in 1 to 1.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on Developing a Dimensional Model w/ conformed dimensions and facts by Savio</title>
		<link>http://blogs.tallan.com/datareflections/2008/09/23/developing-a-dimensional-model-w-conformed-dimensions-and-facts/comment-page-1/#comment-8</link>
		<dc:creator>Savio</dc:creator>
		<pubDate>Tue, 23 Sep 2008 17:04:22 +0000</pubDate>
		<guid isPermaLink="false">http://www.datareflections.net/?p=154#comment-8</guid>
		<description>I think this would be a good place to talk about what Kimball refers to as the bus architecture. It can be a useful technique to represent the conformed and non-confromed dimensions. I've used it before in customer presentations as a tool to explain exactly what a conformed dimension was and why they were important.</description>
		<content:encoded><![CDATA[<p>I think this would be a good place to talk about what Kimball refers to as the bus architecture. It can be a useful technique to represent the conformed and non-confromed dimensions. I&#8217;ve used it before in customer presentations as a tool to explain exactly what a conformed dimension was and why they were important.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on SCD via SQL Stored Procedure by Terry</title>
		<link>http://blogs.tallan.com/datareflections/2008/09/08/scd-via-sql-stored-procedure/comment-page-1/#comment-7</link>
		<dc:creator>Terry</dc:creator>
		<pubDate>Tue, 16 Sep 2008 14:41:37 +0000</pubDate>
		<guid isPermaLink="false">http://www.datareflections.net/?p=126#comment-7</guid>
		<description>binary_checksum()...
http://msdn.microsoft.com/en-us/library/ms173784.aspx
The SQL Server checksum functions can be used to detect changes to records.  So if i am looking to see if a value changed on a row i might code something that looks like


UPDATE a
   SET ...
  FROM targetTable a
  JOIN sourceTable b
    ON a.keys = b.keys
 WHERE binary_checksum( a.col1
                      , a.col2)
     binary_checksum( b.col1
                      , b.col2)

This is far more efficient than 

UPDATE a
   SET ...
  FROM targetTable a
  JOIN sourceTable b
    ON a.keys = b.keys
 WHERE ((a.col1  b.col1)
    OR  (a.col2  b.col2))

It can be made even more efficient by storing the checksum as a column in your fact/dimension table and indexing it.

BUT...   MS does not guarantee that rows that have changed will not product the same checksum.  So if you use it, you have to be careful to Test, Test, Test.  It is only a small percent of changes where this can occur, but it can.

Further caveat...   NULLs in your expression list are ignored.  Lets say your list looks something like this

SELECT binary_checksum(1,2,NULL,3)
It will return 291 as the result

Similarly, the following will also produce 291 as the result set...
SELECT binary_checksum(1,NULL,2,3)

Simply casting the columns to a datatype will usually fix this, but i usually coalesce the values with a value unlikely to occur in the dataset.  So my update statement above might look somethink like this

DECLARE @ValueIfNullStr varchar
      , @ValueIfNullNbr int
      
SELECT @ValueIfNullStr = '#^#@'
     , @ValueIfNullNbr = -2000000000

UPDATE a
   SET ...
  FROM targetTable a
  JOIN sourceTable b
    ON a.keys = b.keys
 WHERE binary_checksum( coalesce(a.col1,@ValueIfNullStr)
                      , coalesce(a.col2,@ValueIfNullNbr))
     binary_checksum( coalesce(b.col1,@ValueIfNullStr)
                      , coalesce(b.col2,@ValueIfNullNbr))</description>
		<content:encoded><![CDATA[<p>binary_checksum()&#8230;<br />
<a href="http://msdn.microsoft.com/en-us/library/ms173784.aspx" rel="nofollow">http://msdn.microsoft.com/en-us/library/ms173784.aspx</a><br />
The SQL Server checksum functions can be used to detect changes to records.  So if i am looking to see if a value changed on a row i might code something that looks like</p>
<p>UPDATE a<br />
   SET &#8230;<br />
  FROM targetTable a<br />
  JOIN sourceTable b<br />
    ON a.keys = b.keys<br />
 WHERE binary_checksum( a.col1<br />
                      , a.col2)<br />
     binary_checksum( b.col1<br />
                      , b.col2)</p>
<p>This is far more efficient than </p>
<p>UPDATE a<br />
   SET &#8230;<br />
  FROM targetTable a<br />
  JOIN sourceTable b<br />
    ON a.keys = b.keys<br />
 WHERE ((a.col1  b.col1)<br />
    OR  (a.col2  b.col2))</p>
<p>It can be made even more efficient by storing the checksum as a column in your fact/dimension table and indexing it.</p>
<p>BUT&#8230;   MS does not guarantee that rows that have changed will not product the same checksum.  So if you use it, you have to be careful to Test, Test, Test.  It is only a small percent of changes where this can occur, but it can.</p>
<p>Further caveat&#8230;   NULLs in your expression list are ignored.  Lets say your list looks something like this</p>
<p>SELECT binary_checksum(1,2,NULL,3)<br />
It will return 291 as the result</p>
<p>Similarly, the following will also produce 291 as the result set&#8230;<br />
SELECT binary_checksum(1,NULL,2,3)</p>
<p>Simply casting the columns to a datatype will usually fix this, but i usually coalesce the values with a value unlikely to occur in the dataset.  So my update statement above might look somethink like this</p>
<p>DECLARE @ValueIfNullStr varchar<br />
      , @ValueIfNullNbr int</p>
<p>SELECT @ValueIfNullStr = &#8216;#^#@&#8217;<br />
     , @ValueIfNullNbr = -2000000000</p>
<p>UPDATE a<br />
   SET &#8230;<br />
  FROM targetTable a<br />
  JOIN sourceTable b<br />
    ON a.keys = b.keys<br />
 WHERE binary_checksum( coalesce(a.col1,@ValueIfNullStr)<br />
                      , coalesce(a.col2,@ValueIfNullNbr))<br />
     binary_checksum( coalesce(b.col1,@ValueIfNullStr)<br />
                      , coalesce(b.col2,@ValueIfNullNbr))</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on SCD via SQL Stored Procedure by Terry</title>
		<link>http://blogs.tallan.com/datareflections/2008/09/08/scd-via-sql-stored-procedure/comment-page-1/#comment-6</link>
		<dc:creator>Terry</dc:creator>
		<pubDate>Tue, 16 Sep 2008 14:08:20 +0000</pubDate>
		<guid isPermaLink="false">http://www.datareflections.net/?p=126#comment-6</guid>
		<description>While I understand that code here is for illustration, best practice says INSERT INTO should always be followed by a column list.  You do not want to (necessarily) have your procedure break because someone went and added a column to your table.  This argument is the same one that says that SELECT * should never be used.  (I will use SELECT *, if and only if, I am in complete control of the column list, ie, i have a derived table/inline view, etc. embedded in the overall SELECT statement, but even then I will usually specify the list.)</description>
		<content:encoded><![CDATA[<p>While I understand that code here is for illustration, best practice says INSERT INTO should always be followed by a column list.  You do not want to (necessarily) have your procedure break because someone went and added a column to your table.  This argument is the same one that says that SELECT * should never be used.  (I will use SELECT *, if and only if, I am in complete control of the column list, ie, i have a derived table/inline view, etc. embedded in the overall SELECT statement, but even then I will usually specify the list.)</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on SCD via SQL Stored Procedure by Terry</title>
		<link>http://blogs.tallan.com/datareflections/2008/09/08/scd-via-sql-stored-procedure/comment-page-1/#comment-5</link>
		<dc:creator>Terry</dc:creator>
		<pubDate>Tue, 16 Sep 2008 14:03:34 +0000</pubDate>
		<guid isPermaLink="false">http://www.datareflections.net/?p=126#comment-5</guid>
		<description>My preference is to load the updates first followed by the inserts.  My reason for doing this is that the inserted rows will be added correctly, with no need for update.  If these rows are inserted first in the process, the process to update existing rows will have to at a minimum pass these rows and verify that no change has been made.

As to the update process, again i think that i would prefer to reverse the order of the process suggested above.  The reason here is that I would prefer to not have two rows end dated with high values available simultaneously which would be the case if you first insert the changed row and then update the row that is changing at a later time.  If current fact data happens to be loading during this process (which it should not be, but i never trust that that will be the case), I would rather have a row fail to insert into the fact table and handle it as an exception than to have a fact duplicated with no exception raised.

Two additional points... 
If the update process is treated as a single transaction, then the concern over two "current" rows goes away, but that means that a larger transaction log must be available.  
There might be an argument for a cursor that does both the insert and update  or blocking n (TOP (10000) for example) updates and inserts into a transaction block, so that the data consistency is maintained, but the transaction log footprint is minimized.</description>
		<content:encoded><![CDATA[<p>My preference is to load the updates first followed by the inserts.  My reason for doing this is that the inserted rows will be added correctly, with no need for update.  If these rows are inserted first in the process, the process to update existing rows will have to at a minimum pass these rows and verify that no change has been made.</p>
<p>As to the update process, again i think that i would prefer to reverse the order of the process suggested above.  The reason here is that I would prefer to not have two rows end dated with high values available simultaneously which would be the case if you first insert the changed row and then update the row that is changing at a later time.  If current fact data happens to be loading during this process (which it should not be, but i never trust that that will be the case), I would rather have a row fail to insert into the fact table and handle it as an exception than to have a fact duplicated with no exception raised.</p>
<p>Two additional points&#8230;<br />
If the update process is treated as a single transaction, then the concern over two &#8220;current&#8221; rows goes away, but that means that a larger transaction log must be available.<br />
There might be an argument for a cursor that does both the insert and update  or blocking n (TOP (10000) for example) updates and inserts into a transaction block, so that the data consistency is maintained, but the transaction log footprint is minimized.</p>
]]></content:encoded>
	</item>
	<item>
		<title>Comment on SCD via SQL Stored Procedure by Terry</title>
		<link>http://blogs.tallan.com/datareflections/2008/09/08/scd-via-sql-stored-procedure/comment-page-1/#comment-4</link>
		<dc:creator>Terry</dc:creator>
		<pubDate>Tue, 16 Sep 2008 13:46:45 +0000</pubDate>
		<guid isPermaLink="false">http://www.datareflections.net/?p=126#comment-4</guid>
		<description>“Fact table rows can be joined to the dimension row where the fact row transaction date is within the effective date range of the dimension row. The BEGIN_EFF_DT and END_EFF_DT have an integer data type because they will correspond to key values that tie back to the date dimension, whose primary key has a yyyyddd format (ex: 2008001).”

Let's be clear that INCOMING fact table rows can be joined to the dimension for range scan to determine _KEY at load time.    The rows in the actual fact table should only be joined to the _KEY in the dimension table.</description>
		<content:encoded><![CDATA[<p>“Fact table rows can be joined to the dimension row where the fact row transaction date is within the effective date range of the dimension row. The BEGIN_EFF_DT and END_EFF_DT have an integer data type because they will correspond to key values that tie back to the date dimension, whose primary key has a yyyyddd format (ex: 2008001).”</p>
<p>Let&#8217;s be clear that INCOMING fact table rows can be joined to the dimension for range scan to determine _KEY at load time.    The rows in the actual fact table should only be joined to the _KEY in the dimension table.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
