Parse 17,000 text files and insert into database
-
Hey Gang...I've got a folder with about 17,000 text files in it. I need to parse those text files and insert my parsed results into a Sql database. I've gotten it to where I'm doing about 4 - 5 files per second, but I need it to be faster that that. The text files aren't straight forward either. An example of the data in the text file is: orld Clock Location Entries=INXX0102|Pune|India Weather Location ID=48226|Detroit| MI (48226) English/Metric Units=0 CLNAME001= CLNUMBER001=xxx-xxx-xxxx CLTYPE001=4 CLDATE001=11/09/16 CLTIME001=18:07 CLDURATION001= CLBRIDGEDFLAG001=0 CLMISSEDCNTR001=1 CLBCALBL001= CLNAME002= CLNUMBER002=xxx-xxx-xxxx CLTYPE002=4 CLDATE002=11/09/16 CLTIME002=17:59 CLDURATION002= CLBRIDGEDFLAG002=0 CLMISSEDCNTR002=1 CLBCALBL002= CLNAME003= CLNUMBER003=xxxxxxxxxxxx CLTYPE003=3 CLDATE003=11/09/16 CLTIME003=16:57 CLDURATION003= 1:54 CLBRIDGEDFLAG003=0 CLMISSEDCNTR003=0 CLBCALBL003= etc...... This is a backup text file of an AVAYA 96xx phone. What you see above is 3 calls from the phones call history. There's more in that text file than just calls though, so to get just call info, I grab all the lines that start with "CL". Here's a blurb of my code:
while ((line = file.ReadLine()) != null)
{
if (line.Substring(0, 2) == "CL")
{
try
{
string[] strArray = line.Split("=".ToCharArray());
string key = strArray[0];
string str = strArray[1];One call is made up of 9 elements, so: CLNAME, CLNUMBER, CLTYPE, CLDATE, CLTIME, CLDURATION, CLBRIDGEDFLAG, CLMISSEDCNTR, CLBCALBL all make up one call. My question to you is, how would you go about parsing this out and inserting into a database? Am I going about it the right way? Below is my complete code:
public static void letsdoit(SqlConnection con)
{
string[] files;
string line;
int counter = 0;using (UNCAccessWithCredentials unc = new UNCAccessWithCredentials()) { if (unc.NetUseWithCredentials(@"\\\\ql1telutil1\\c$\\inetpub\\wwwroot", "xxxxxxxxxx", "xx", "xxxxxx")) { files = Directory.GetFiles(@"\\\\ql1telutil1\\c$\\inetpub\\wwwroot\\backup96XX"); foreach (string f in files) { sqlString = null; int myCounter = 0; List<Int32> myCountList = new List<Int32>(); List<Int32> UniqueC
So, how many "exceptions" are you generating? Each exception will add some overhead; you should be attempting to improve the parsing so it generates fewer exceptions. (For those that do not believe exceptions add overhead: I DON'T CARE WHAT YOU THINK; SAVE IT FOR SOMEONE WHO DOES). Starting out, you should stick to timing the parsing; at this point, you don't know if parsing is the problem; or the database inserts; or something else.
-
So, how many "exceptions" are you generating? Each exception will add some overhead; you should be attempting to improve the parsing so it generates fewer exceptions. (For those that do not believe exceptions add overhead: I DON'T CARE WHAT YOU THINK; SAVE IT FOR SOMEONE WHO DOES). Starting out, you should stick to timing the parsing; at this point, you don't know if parsing is the problem; or the database inserts; or something else.
Hey Gerry...thanks for the reply! No exceptions yet. I stopped the code after about 1600 text files, and it didn't generate any errors. 1600 text files was around 130,000 records. I commented out the insert statement, and I was flying through all the text files. 24 per second. So, I'm guessing that means my insert command is the problem. I believe they prohibit bulk inserts here, I'll have to try again and see if I'm able to do that. I'm also thinking of maybe inserting multiple records with 1 insert statement? If I could get it to about 10 per second, I'd be happy with that I think. Thanks again! Dave
-
Hey Gang...I've got a folder with about 17,000 text files in it. I need to parse those text files and insert my parsed results into a Sql database. I've gotten it to where I'm doing about 4 - 5 files per second, but I need it to be faster that that. The text files aren't straight forward either. An example of the data in the text file is: orld Clock Location Entries=INXX0102|Pune|India Weather Location ID=48226|Detroit| MI (48226) English/Metric Units=0 CLNAME001= CLNUMBER001=xxx-xxx-xxxx CLTYPE001=4 CLDATE001=11/09/16 CLTIME001=18:07 CLDURATION001= CLBRIDGEDFLAG001=0 CLMISSEDCNTR001=1 CLBCALBL001= CLNAME002= CLNUMBER002=xxx-xxx-xxxx CLTYPE002=4 CLDATE002=11/09/16 CLTIME002=17:59 CLDURATION002= CLBRIDGEDFLAG002=0 CLMISSEDCNTR002=1 CLBCALBL002= CLNAME003= CLNUMBER003=xxxxxxxxxxxx CLTYPE003=3 CLDATE003=11/09/16 CLTIME003=16:57 CLDURATION003= 1:54 CLBRIDGEDFLAG003=0 CLMISSEDCNTR003=0 CLBCALBL003= etc...... This is a backup text file of an AVAYA 96xx phone. What you see above is 3 calls from the phones call history. There's more in that text file than just calls though, so to get just call info, I grab all the lines that start with "CL". Here's a blurb of my code:
while ((line = file.ReadLine()) != null)
{
if (line.Substring(0, 2) == "CL")
{
try
{
string[] strArray = line.Split("=".ToCharArray());
string key = strArray[0];
string str = strArray[1];One call is made up of 9 elements, so: CLNAME, CLNUMBER, CLTYPE, CLDATE, CLTIME, CLDURATION, CLBRIDGEDFLAG, CLMISSEDCNTR, CLBCALBL all make up one call. My question to you is, how would you go about parsing this out and inserting into a database? Am I going about it the right way? Below is my complete code:
public static void letsdoit(SqlConnection con)
{
string[] files;
string line;
int counter = 0;using (UNCAccessWithCredentials unc = new UNCAccessWithCredentials()) { if (unc.NetUseWithCredentials(@"\\\\ql1telutil1\\c$\\inetpub\\wwwroot", "xxxxxxxxxx", "xx", "xxxxxx")) { files = Directory.GetFiles(@"\\\\ql1telutil1\\c$\\inetpub\\wwwroot\\backup96XX"); foreach (string f in files) { sqlString = null; int myCounter = 0; List<Int32> myCountList = new List<Int32>(); List<Int32> UniqueC
- Your code is prone to SQL injection (if the source file can be modified). - You might want to process files in order. - Your code do not validate that the data is the expected one. - There are a bunch of hard-coded constant. If you do it one, then it does not matters much but it would also do not matters much that it would take about one hour to process the file. One thing that would have a significant impact on the performance is to keep the connection opened for the whole importation if it is not already done.
Philippe Mori
-
Hey Gerry...thanks for the reply! No exceptions yet. I stopped the code after about 1600 text files, and it didn't generate any errors. 1600 text files was around 130,000 records. I commented out the insert statement, and I was flying through all the text files. 24 per second. So, I'm guessing that means my insert command is the problem. I believe they prohibit bulk inserts here, I'll have to try again and see if I'm able to do that. I'm also thinking of maybe inserting multiple records with 1 insert statement? If I could get it to about 10 per second, I'd be happy with that I think. Thanks again! Dave
davers wrote:
I believe they prohibit bulk inserts here
Does that mean you can't prepare a DataTable and send the lot via a DataAdapter? That doesn't use the SqlBulkOperations class, but it does mean that the insert becomes one message in each direction instead of message-response-message-response-... which is going to be significantly slower as it's a new command parse and construct each time at the SQL end. Performing Batch Operations Using DataAdapters[^]
Bad command or file name. Bad, bad command! Sit! Stay! Staaaay...
-
Hey Gerry...thanks for the reply! No exceptions yet. I stopped the code after about 1600 text files, and it didn't generate any errors. 1600 text files was around 130,000 records. I commented out the insert statement, and I was flying through all the text files. 24 per second. So, I'm guessing that means my insert command is the problem. I believe they prohibit bulk inserts here, I'll have to try again and see if I'm able to do that. I'm also thinking of maybe inserting multiple records with 1 insert statement? If I could get it to about 10 per second, I'd be happy with that I think. Thanks again! Dave
-
Hey Gang...I've got a folder with about 17,000 text files in it. I need to parse those text files and insert my parsed results into a Sql database. I've gotten it to where I'm doing about 4 - 5 files per second, but I need it to be faster that that. The text files aren't straight forward either. An example of the data in the text file is: orld Clock Location Entries=INXX0102|Pune|India Weather Location ID=48226|Detroit| MI (48226) English/Metric Units=0 CLNAME001= CLNUMBER001=xxx-xxx-xxxx CLTYPE001=4 CLDATE001=11/09/16 CLTIME001=18:07 CLDURATION001= CLBRIDGEDFLAG001=0 CLMISSEDCNTR001=1 CLBCALBL001= CLNAME002= CLNUMBER002=xxx-xxx-xxxx CLTYPE002=4 CLDATE002=11/09/16 CLTIME002=17:59 CLDURATION002= CLBRIDGEDFLAG002=0 CLMISSEDCNTR002=1 CLBCALBL002= CLNAME003= CLNUMBER003=xxxxxxxxxxxx CLTYPE003=3 CLDATE003=11/09/16 CLTIME003=16:57 CLDURATION003= 1:54 CLBRIDGEDFLAG003=0 CLMISSEDCNTR003=0 CLBCALBL003= etc...... This is a backup text file of an AVAYA 96xx phone. What you see above is 3 calls from the phones call history. There's more in that text file than just calls though, so to get just call info, I grab all the lines that start with "CL". Here's a blurb of my code:
while ((line = file.ReadLine()) != null)
{
if (line.Substring(0, 2) == "CL")
{
try
{
string[] strArray = line.Split("=".ToCharArray());
string key = strArray[0];
string str = strArray[1];One call is made up of 9 elements, so: CLNAME, CLNUMBER, CLTYPE, CLDATE, CLTIME, CLDURATION, CLBRIDGEDFLAG, CLMISSEDCNTR, CLBCALBL all make up one call. My question to you is, how would you go about parsing this out and inserting into a database? Am I going about it the right way? Below is my complete code:
public static void letsdoit(SqlConnection con)
{
string[] files;
string line;
int counter = 0;using (UNCAccessWithCredentials unc = new UNCAccessWithCredentials()) { if (unc.NetUseWithCredentials(@"\\\\ql1telutil1\\c$\\inetpub\\wwwroot", "xxxxxxxxxx", "xx", "xxxxxx")) { files = Directory.GetFiles(@"\\\\ql1telutil1\\c$\\inetpub\\wwwroot\\backup96XX"); foreach (string f in files) { sqlString = null; int myCounter = 0; List<Int32> myCountList = new List<Int32>(); List<Int32> UniqueC
I would think a RegEx would work a bit faster.
"There are three kinds of lies: lies, damned lies and statistics." - Benjamin Disraeli
-
Hey Gerry...thanks for the reply! No exceptions yet. I stopped the code after about 1600 text files, and it didn't generate any errors. 1600 text files was around 130,000 records. I commented out the insert statement, and I was flying through all the text files. 24 per second. So, I'm guessing that means my insert command is the problem. I believe they prohibit bulk inserts here, I'll have to try again and see if I'm able to do that. I'm also thinking of maybe inserting multiple records with 1 insert statement? If I could get it to about 10 per second, I'd be happy with that I think. Thanks again! Dave
Indeed, parsing is usually faster than inserting each record separately using SQL. I did a quick search right now and this [^] resembles a possible approach using parametrized SQL, but it is also possible without explicit "BULK" operation, just a parametrized insert query would do (wrapped in a a SQL transaction...) Cheers,
-
Hey Gang...I've got a folder with about 17,000 text files in it. I need to parse those text files and insert my parsed results into a Sql database. I've gotten it to where I'm doing about 4 - 5 files per second, but I need it to be faster that that. The text files aren't straight forward either. An example of the data in the text file is: orld Clock Location Entries=INXX0102|Pune|India Weather Location ID=48226|Detroit| MI (48226) English/Metric Units=0 CLNAME001= CLNUMBER001=xxx-xxx-xxxx CLTYPE001=4 CLDATE001=11/09/16 CLTIME001=18:07 CLDURATION001= CLBRIDGEDFLAG001=0 CLMISSEDCNTR001=1 CLBCALBL001= CLNAME002= CLNUMBER002=xxx-xxx-xxxx CLTYPE002=4 CLDATE002=11/09/16 CLTIME002=17:59 CLDURATION002= CLBRIDGEDFLAG002=0 CLMISSEDCNTR002=1 CLBCALBL002= CLNAME003= CLNUMBER003=xxxxxxxxxxxx CLTYPE003=3 CLDATE003=11/09/16 CLTIME003=16:57 CLDURATION003= 1:54 CLBRIDGEDFLAG003=0 CLMISSEDCNTR003=0 CLBCALBL003= etc...... This is a backup text file of an AVAYA 96xx phone. What you see above is 3 calls from the phones call history. There's more in that text file than just calls though, so to get just call info, I grab all the lines that start with "CL". Here's a blurb of my code:
while ((line = file.ReadLine()) != null)
{
if (line.Substring(0, 2) == "CL")
{
try
{
string[] strArray = line.Split("=".ToCharArray());
string key = strArray[0];
string str = strArray[1];One call is made up of 9 elements, so: CLNAME, CLNUMBER, CLTYPE, CLDATE, CLTIME, CLDURATION, CLBRIDGEDFLAG, CLMISSEDCNTR, CLBCALBL all make up one call. My question to you is, how would you go about parsing this out and inserting into a database? Am I going about it the right way? Below is my complete code:
public static void letsdoit(SqlConnection con)
{
string[] files;
string line;
int counter = 0;using (UNCAccessWithCredentials unc = new UNCAccessWithCredentials()) { if (unc.NetUseWithCredentials(@"\\\\ql1telutil1\\c$\\inetpub\\wwwroot", "xxxxxxxxxx", "xx", "xxxxxx")) { files = Directory.GetFiles(@"\\\\ql1telutil1\\c$\\inetpub\\wwwroot\\backup96XX"); foreach (string f in files) { sqlString = null; int myCounter = 0; List<Int32> myCountList = new List<Int32>(); List<Int32> UniqueC
I would do this completely differently. Read the text files into memory Insert each record into a data table in memory - no transformation Bulk copy the raw data into a staging table in sql server Use a stored procedure to transform the data. Bulk copy and stored procedures will reduce the impact of the primary bottle necks of IO and transforms
Never underestimate the power of human stupidity RAH
-
Hey Gang...I've got a folder with about 17,000 text files in it. I need to parse those text files and insert my parsed results into a Sql database. I've gotten it to where I'm doing about 4 - 5 files per second, but I need it to be faster that that. The text files aren't straight forward either. An example of the data in the text file is: orld Clock Location Entries=INXX0102|Pune|India Weather Location ID=48226|Detroit| MI (48226) English/Metric Units=0 CLNAME001= CLNUMBER001=xxx-xxx-xxxx CLTYPE001=4 CLDATE001=11/09/16 CLTIME001=18:07 CLDURATION001= CLBRIDGEDFLAG001=0 CLMISSEDCNTR001=1 CLBCALBL001= CLNAME002= CLNUMBER002=xxx-xxx-xxxx CLTYPE002=4 CLDATE002=11/09/16 CLTIME002=17:59 CLDURATION002= CLBRIDGEDFLAG002=0 CLMISSEDCNTR002=1 CLBCALBL002= CLNAME003= CLNUMBER003=xxxxxxxxxxxx CLTYPE003=3 CLDATE003=11/09/16 CLTIME003=16:57 CLDURATION003= 1:54 CLBRIDGEDFLAG003=0 CLMISSEDCNTR003=0 CLBCALBL003= etc...... This is a backup text file of an AVAYA 96xx phone. What you see above is 3 calls from the phones call history. There's more in that text file than just calls though, so to get just call info, I grab all the lines that start with "CL". Here's a blurb of my code:
while ((line = file.ReadLine()) != null)
{
if (line.Substring(0, 2) == "CL")
{
try
{
string[] strArray = line.Split("=".ToCharArray());
string key = strArray[0];
string str = strArray[1];One call is made up of 9 elements, so: CLNAME, CLNUMBER, CLTYPE, CLDATE, CLTIME, CLDURATION, CLBRIDGEDFLAG, CLMISSEDCNTR, CLBCALBL all make up one call. My question to you is, how would you go about parsing this out and inserting into a database? Am I going about it the right way? Below is my complete code:
public static void letsdoit(SqlConnection con)
{
string[] files;
string line;
int counter = 0;using (UNCAccessWithCredentials unc = new UNCAccessWithCredentials()) { if (unc.NetUseWithCredentials(@"\\\\ql1telutil1\\c$\\inetpub\\wwwroot", "xxxxxxxxxx", "xx", "xxxxxx")) { files = Directory.GetFiles(@"\\\\ql1telutil1\\c$\\inetpub\\wwwroot\\backup96XX"); foreach (string f in files) { sqlString = null; int myCounter = 0; List<Int32> myCountList = new List<Int32>(); List<Int32> UniqueC
Two simple things to consider: - at 4-5 files per second, those 17000 files will take an hour. So what? Run it somewhen during night time when nobody else is at work. - SQL statements can be
Prepare
d. That means you have to change it to a parameterized query first, callPrepare
, and later only add the fresh parameters before callingExecuteNonQuery
. -
Hey Gang...I've got a folder with about 17,000 text files in it. I need to parse those text files and insert my parsed results into a Sql database. I've gotten it to where I'm doing about 4 - 5 files per second, but I need it to be faster that that. The text files aren't straight forward either. An example of the data in the text file is: orld Clock Location Entries=INXX0102|Pune|India Weather Location ID=48226|Detroit| MI (48226) English/Metric Units=0 CLNAME001= CLNUMBER001=xxx-xxx-xxxx CLTYPE001=4 CLDATE001=11/09/16 CLTIME001=18:07 CLDURATION001= CLBRIDGEDFLAG001=0 CLMISSEDCNTR001=1 CLBCALBL001= CLNAME002= CLNUMBER002=xxx-xxx-xxxx CLTYPE002=4 CLDATE002=11/09/16 CLTIME002=17:59 CLDURATION002= CLBRIDGEDFLAG002=0 CLMISSEDCNTR002=1 CLBCALBL002= CLNAME003= CLNUMBER003=xxxxxxxxxxxx CLTYPE003=3 CLDATE003=11/09/16 CLTIME003=16:57 CLDURATION003= 1:54 CLBRIDGEDFLAG003=0 CLMISSEDCNTR003=0 CLBCALBL003= etc...... This is a backup text file of an AVAYA 96xx phone. What you see above is 3 calls from the phones call history. There's more in that text file than just calls though, so to get just call info, I grab all the lines that start with "CL". Here's a blurb of my code:
while ((line = file.ReadLine()) != null)
{
if (line.Substring(0, 2) == "CL")
{
try
{
string[] strArray = line.Split("=".ToCharArray());
string key = strArray[0];
string str = strArray[1];One call is made up of 9 elements, so: CLNAME, CLNUMBER, CLTYPE, CLDATE, CLTIME, CLDURATION, CLBRIDGEDFLAG, CLMISSEDCNTR, CLBCALBL all make up one call. My question to you is, how would you go about parsing this out and inserting into a database? Am I going about it the right way? Below is my complete code:
public static void letsdoit(SqlConnection con)
{
string[] files;
string line;
int counter = 0;using (UNCAccessWithCredentials unc = new UNCAccessWithCredentials()) { if (unc.NetUseWithCredentials(@"\\\\ql1telutil1\\c$\\inetpub\\wwwroot", "xxxxxxxxxx", "xx", "xxxxxx")) { files = Directory.GetFiles(@"\\\\ql1telutil1\\c$\\inetpub\\wwwroot\\backup96XX"); foreach (string f in files) { sqlString = null; int myCounter = 0; List<Int32> myCountList = new List<Int32>(); List<Int32> UniqueC
I recently had a similar problem and gained some speed with a 2 step change: parallelize the
foreach (string f in files)
and then ditch the file streaming in favor of File.ReadAllLines(...). Basically the disk was my bottleneck and it performed better doing a bulk reads of data instead of streaming line by line; after each bulk read I had a chunk of data in memory that I could process bound only by the limitations of my CPU. This does, of course, assume that you have some memory to spare because you're reading entire files in at a time. There was also a limit though as to how many concurrent file reads I could throw at my disk, after a certain amount the performance dropped significantly so I had to throttle back how many parallel file reads my app could do. I did some trial and error to find the best number, YMMV. That would also require you to redo your database access though since you probably don't want to be doing that from multiple threads, but that kind of goes along with some of the other suggestions you've received. Collecting that data in memory and do a bulk insert later on instead of individual inserts would make a big difference, too.
:badger: