Streaming data from one datalayer to another
-
I have a problem where I am creating a data layer for a project, where I have a scenario I don't really know which solution is better. The main function of the project/service is to regularly check for changes in several databases and copy the information to a data warehouse(one big database really). One scenario is when you add an new table to be tracked in the system. It has a specific task to copy the entire table from the site database to the warehouse. It does this by opening a data reader and then generate a batch INSERT script until it reaches a limit and then it execute the script towards the data warehouse. In pseudo code it looks like this:
using (IDbCommand clientCommand = Connection.CreateCommand())
{
clientCommand.CommandText = clientSql;
using (IDataReader clientReader = clientCommand.ExecuteReader()
{
while(clientReader.Read())
{
AddDataToList(clientReader);
}if(List.Count >= 100) { using (var com = warehouseConnection.CreateCommand()) { com.CommandText = GetInsertScript(List); AddedCount += com.ExecuteNonQuery(); } } }
}
Its more complicated than that, with more checks and sql code. I see two main issues with this: 1: Database logic in business logic that I want to move over to my data layer. This will make the code testable and easier to understand. 2: I do not like that we wait for data from the client, and then wait for data to be written to the other database, when this could be done in 2 tasks that transfer and receiving information asynchronously with each other via ConcurrentQueue for instance. This speeds up the process, which is also an issue in some cases. The problem I am having is that I that I cant get all the data in one go and then add it to the warehouse. Sometimes the tables are huge, which can lead to memory issues on the service. So I want to process the data as they come in through the client reader and continuously add them to the warehouse. But, I also want the database logic in the data layer. Adding the data is simple, but retrieving it is a bit more complicated. Researching this issue have given me 2 choices: Option 1 (Pull from data layer): I could use the yield keyword to return data as it comes through the reader in my data layer:
public IEnumerator GetTableContent()
{
SqlCommand command = new SqlCommand("command", SqlConnection);
using (var reader = command.ExecuteReader())
{
var items = new object[reader.FieldCount];
rea -
I have a problem where I am creating a data layer for a project, where I have a scenario I don't really know which solution is better. The main function of the project/service is to regularly check for changes in several databases and copy the information to a data warehouse(one big database really). One scenario is when you add an new table to be tracked in the system. It has a specific task to copy the entire table from the site database to the warehouse. It does this by opening a data reader and then generate a batch INSERT script until it reaches a limit and then it execute the script towards the data warehouse. In pseudo code it looks like this:
using (IDbCommand clientCommand = Connection.CreateCommand())
{
clientCommand.CommandText = clientSql;
using (IDataReader clientReader = clientCommand.ExecuteReader()
{
while(clientReader.Read())
{
AddDataToList(clientReader);
}if(List.Count >= 100) { using (var com = warehouseConnection.CreateCommand()) { com.CommandText = GetInsertScript(List); AddedCount += com.ExecuteNonQuery(); } } }
}
Its more complicated than that, with more checks and sql code. I see two main issues with this: 1: Database logic in business logic that I want to move over to my data layer. This will make the code testable and easier to understand. 2: I do not like that we wait for data from the client, and then wait for data to be written to the other database, when this could be done in 2 tasks that transfer and receiving information asynchronously with each other via ConcurrentQueue for instance. This speeds up the process, which is also an issue in some cases. The problem I am having is that I that I cant get all the data in one go and then add it to the warehouse. Sometimes the tables are huge, which can lead to memory issues on the service. So I want to process the data as they come in through the client reader and continuously add them to the warehouse. But, I also want the database logic in the data layer. Adding the data is simple, but retrieving it is a bit more complicated. Researching this issue have given me 2 choices: Option 1 (Pull from data layer): I could use the yield keyword to return data as it comes through the reader in my data layer:
public IEnumerator GetTableContent()
{
SqlCommand command = new SqlCommand("command", SqlConnection);
using (var reader = command.ExecuteReader())
{
var items = new object[reader.FieldCount];
reaTo be honest, neither solution really appeals to me. Have you considered using a database server to do the ETL instead? SQL Server, for instance, provides excellent ETL capabilities.
-
To be honest, neither solution really appeals to me. Have you considered using a database server to do the ETL instead? SQL Server, for instance, provides excellent ETL capabilities.
That is a very interesting suggestion. I cant use that at the moment, because the system is not setup for that. But its clear this is the way to go. Thank you for that!