spliting sentence on the basis of conjunctions
-
Here is what i did so far. The problem is if a conjunction appears twice in the sentence the code doesnt work for the 2nd appearance of the conjunction. plz if any expert can help ? private void SplitSentence_Click(object sender, EventArgs e) { richTextBox2.Text = ""; richTextBox3.Text = ""; string[] keywords = { " or ", " and ", " hence", "so that", "however", " because" }; string[] sentences = SentenceTokenizer(richTextBox1.Text); string remSentence; foreach (string sentence in sentences) { remSentence = sentence; richTextBox3.Text = remSentence; for (int i =0; i < keywords.Length; i++) { if ((remSentence.Contains(keywords[i])))// || (remSentence.IndexOf(keywords[i]) > 0)) { richTextBox2.Text += remSentence.Substring(0, remSentence.IndexOf(keywords[i])) + '\n' + keywords[i] + '\n'; remSentence = remSentence.Substring(remSentence.IndexOf(keywords[i]) + keywords[i].Length); } } richTextBox2.Text += remSentence; } } public static string[] SentenceTokenizer(string text) { char[] sentdelimiters = new char[] { '.', '?', '۔', '؟', '\r', ':', '-' }; // '{ ',' }', '( ', ' )', ' [', ']', '>', '<','-', '_', '= ', '+','|', '\\', ':', ';', ' ', '\'', ',', '.', '/', '?', '~', '!','@', '#', '$', '%', '^', '&', '*', ' ', '\r', '\n', '\t'}; // text.Remove('\n'); return text.Split(sentdelimiters, StringSplitOptions.RemoveEmptyEntries); }
-
Here is what i did so far. The problem is if a conjunction appears twice in the sentence the code doesnt work for the 2nd appearance of the conjunction. plz if any expert can help ? private void SplitSentence_Click(object sender, EventArgs e) { richTextBox2.Text = ""; richTextBox3.Text = ""; string[] keywords = { " or ", " and ", " hence", "so that", "however", " because" }; string[] sentences = SentenceTokenizer(richTextBox1.Text); string remSentence; foreach (string sentence in sentences) { remSentence = sentence; richTextBox3.Text = remSentence; for (int i =0; i < keywords.Length; i++) { if ((remSentence.Contains(keywords[i])))// || (remSentence.IndexOf(keywords[i]) > 0)) { richTextBox2.Text += remSentence.Substring(0, remSentence.IndexOf(keywords[i])) + '\n' + keywords[i] + '\n'; remSentence = remSentence.Substring(remSentence.IndexOf(keywords[i]) + keywords[i].Length); } } richTextBox2.Text += remSentence; } } public static string[] SentenceTokenizer(string text) { char[] sentdelimiters = new char[] { '.', '?', '۔', '؟', '\r', ':', '-' }; // '{ ',' }', '( ', ' )', ' [', ']', '>', '<','-', '_', '= ', '+','|', '\\', ':', ';', ' ', '\'', ',', '.', '/', '?', '~', '!','@', '#', '$', '%', '^', '&', '*', ' ', '\r', '\n', '\t'}; // text.Remove('\n'); return text.Split(sentdelimiters, StringSplitOptions.RemoveEmptyEntries); }
I think you have to learn some regex, it can help you out... http://www.regular-expressions.info/[^] http://regex.learncodethehardway.org/[^]
I'm not questioning your powers of observation; I'm merely remarking upon the paradox of asking a masked man who he is. (V)
-
Here is what i did so far. The problem is if a conjunction appears twice in the sentence the code doesnt work for the 2nd appearance of the conjunction. plz if any expert can help ? private void SplitSentence_Click(object sender, EventArgs e) { richTextBox2.Text = ""; richTextBox3.Text = ""; string[] keywords = { " or ", " and ", " hence", "so that", "however", " because" }; string[] sentences = SentenceTokenizer(richTextBox1.Text); string remSentence; foreach (string sentence in sentences) { remSentence = sentence; richTextBox3.Text = remSentence; for (int i =0; i < keywords.Length; i++) { if ((remSentence.Contains(keywords[i])))// || (remSentence.IndexOf(keywords[i]) > 0)) { richTextBox2.Text += remSentence.Substring(0, remSentence.IndexOf(keywords[i])) + '\n' + keywords[i] + '\n'; remSentence = remSentence.Substring(remSentence.IndexOf(keywords[i]) + keywords[i].Length); } } richTextBox2.Text += remSentence; } } public static string[] SentenceTokenizer(string text) { char[] sentdelimiters = new char[] { '.', '?', '۔', '؟', '\r', ':', '-' }; // '{ ',' }', '( ', ' )', ' [', ']', '>', '<','-', '_', '= ', '+','|', '\\', ':', ';', ' ', '\'', ',', '.', '/', '?', '~', '!','@', '#', '$', '%', '^', '&', '*', ' ', '\r', '\n', '\t'}; // text.Remove('\n'); return text.Split(sentdelimiters, StringSplitOptions.RemoveEmptyEntries); }
if ((remSentence.Contains(keywords[i])))
That's the position your trouble starts: you execute it once only regardless of the number of occurrences. You'll better use a function to split the sentence, and apply that function recursively on the resulting sub-sentences. And yes: Regular Expressions are preferred. -
if ((remSentence.Contains(keywords[i])))
That's the position your trouble starts: you execute it once only regardless of the number of occurrences. You'll better use a function to split the sentence, and apply that function recursively on the resulting sub-sentences. And yes: Regular Expressions are preferred.here is what i did using regex. it works well. But doing this way, Splitting with regex, i lost the control over the word "and" for further processing. I have a lexicon of 20 words that normally appears before the "and (اور)" in urdu language. In next step I want to have a way to check the word before "and" against the lexicon and if found the sentence is broken else display the complete sentence. private void button1_Click(object sender, EventArgs e) { { richTextBox2.Text = ""; richTextBox3.Text = ""; string[] sentences = SentenceTokenizer(richTextBox1.Text); string remSentence; // these are urdu conjunctions. i am actually working on urdu language. Regex r = new Regex("(کہ |اور | تاکہ| مگر | تاہم | کیونکہ | لیکن )"); foreach (string sentence in sentences) { remSentence = sentence; remSentence = r.Replace(remSentence, "|"); string[] phrases = remSentence.Split('|'); for (int i = 0; i < phrases.Length; i++) { richTextBox2.Text += phrases[i] + '\n'; } } } }
-
if ((remSentence.Contains(keywords[i])))
That's the position your trouble starts: you execute it once only regardless of the number of occurrences. You'll better use a function to split the sentence, and apply that function recursively on the resulting sub-sentences. And yes: Regular Expressions are preferred. -
I am not Bernhard Hiller.
All the label says is that this stuff contains chemicals "... known to the State of California to cause cancer in rats and low-income test subjects."
Roger Wright
http://www.codeproject.com/lounge.asp?select=965687&exp=5&fr=1#xx965687xx -
I think you have to learn some regex, it can help you out... http://www.regular-expressions.info/[^] http://regex.learncodethehardway.org/[^]
I'm not questioning your powers of observation; I'm merely remarking upon the paradox of asking a masked man who he is. (V)
here is what i did using regex. it works well. But doing this way, Splitting with regex, i lost the control over the word "and" for further processing. I have a lexicon of 20 words that normally appears before the "and (اور)" in urdu language. In next step I want to have a way to check the word before "and" against the lexicon and if found the sentence is broken else display the complete sentence. private void button1_Click(object sender, EventArgs e) { { richTextBox2.Text = ""; richTextBox3.Text = ""; string[] sentences = SentenceTokenizer(richTextBox1.Text); string remSentence; // these are urdu conjunctions. i am actually working on urdu language. Regex r = new Regex("(کہ |اور | تاکہ| مگر | تاہم | کیونکہ | لیکن )"); foreach (string sentence in sentences) { remSentence = sentence; remSentence = r.Replace(remSentence, "|"); string[] phrases = remSentence.Split('|'); for (int i = 0; i < phrases.Length; i++) { richTextBox2.Text += phrases[i] + '\n'; } } } }
-
Here is what i did so far. The problem is if a conjunction appears twice in the sentence the code doesnt work for the 2nd appearance of the conjunction. plz if any expert can help ? private void SplitSentence_Click(object sender, EventArgs e) { richTextBox2.Text = ""; richTextBox3.Text = ""; string[] keywords = { " or ", " and ", " hence", "so that", "however", " because" }; string[] sentences = SentenceTokenizer(richTextBox1.Text); string remSentence; foreach (string sentence in sentences) { remSentence = sentence; richTextBox3.Text = remSentence; for (int i =0; i < keywords.Length; i++) { if ((remSentence.Contains(keywords[i])))// || (remSentence.IndexOf(keywords[i]) > 0)) { richTextBox2.Text += remSentence.Substring(0, remSentence.IndexOf(keywords[i])) + '\n' + keywords[i] + '\n'; remSentence = remSentence.Substring(remSentence.IndexOf(keywords[i]) + keywords[i].Length); } } richTextBox2.Text += remSentence; } } public static string[] SentenceTokenizer(string text) { char[] sentdelimiters = new char[] { '.', '?', '۔', '؟', '\r', ':', '-' }; // '{ ',' }', '( ', ' )', ' [', ']', '>', '<','-', '_', '= ', '+','|', '\\', ':', ';', ' ', '\'', ',', '.', '/', '?', '~', '!','@', '#', '$', '%', '^', '&', '*', ' ', '\r', '\n', '\t'}; // text.Remove('\n'); return text.Split(sentdelimiters, StringSplitOptions.RemoveEmptyEntries); }
Something like the below will loop for each keyword until there are no more matches left for that keyword - but it wont be perfect because it will skip over other keywords while it's looking e.g "if you had this and that or the other and something" you'd strip out all the Ands before stripping out the Or What you need to do is Repeat Find the first occurrence of ANY of the keywords in your sentence. Split the sentence Until no occurrences found Using the IndexOf method you can loop through your keywords, finding the lowest, non zero value of IndexOf and storing that word. When the loop finishes , split the sentence using that word.
bool finished = false;
while (not finished)
if ((remSentence.Contains(keywords[i])))// || (remSentence.IndexOf(keywords[i]) > 0))
{
richTextBox2.Text += remSentence.Substring(0, remSentence.IndexOf(keywords[i])) + '\n' + keywords[i] + '\n';
remSentence = remSentence.Substring(remSentence.IndexOf(keywords[i]) + keywords[i].Length);}
else
{
finished = true;
}