spliting sentence on the basis of conjunctions

KhanKtk

Here is what i did so far. The problem is if a conjunction appears twice in the sentence the code doesnt work for the 2nd appearance of the conjunction. plz if any expert can help ? private void SplitSentence_Click(object sender, EventArgs e) { richTextBox2.Text = ""; richTextBox3.Text = ""; string[] keywords = { " or ", " and ", " hence", "so that", "however", " because" }; string[] sentences = SentenceTokenizer(richTextBox1.Text); string remSentence; foreach (string sentence in sentences) { remSentence = sentence; richTextBox3.Text = remSentence; for (int i =0; i < keywords.Length; i++) { if ((remSentence.Contains(keywords[i])))// || (remSentence.IndexOf(keywords[i]) > 0)) { richTextBox2.Text += remSentence.Substring(0, remSentence.IndexOf(keywords[i])) + '\n' + keywords[i] + '\n'; remSentence = remSentence.Substring(remSentence.IndexOf(keywords[i]) + keywords[i].Length); } } richTextBox2.Text += remSentence; } } public static string[] SentenceTokenizer(string text) { char[] sentdelimiters = new char[] { '.', '?', '۔', '؟', '\r', ':', '-' }; // '{ ',' }', '( ', ' )', ' [', ']', '>', '<','-', '_', '= ', '+','|', '\\', ':', ';', ' ', '\'', ',', '.', '/', '?', '~', '!','@', '#', '$', '%', '^', '&', '*', ' ', '\r', '\n', '\t'}; // text.Remove('\n'); return text.Split(sentdelimiters, StringSplitOptions.RemoveEmptyEntries); }

Kornfeld Eliyahu Peter

I think you have to learn some regex, it can help you out... http://www.regular-expressions.info/[^] http://regex.learncodethehardway.org/[^]

I'm not questioning your powers of observation; I'm merely remarking upon the paradox of asking a masked man who he is. (V)

Bernhard Hiller

if ((remSentence.Contains(keywords[i])))That's the position your trouble starts: you execute it once only regardless of the number of occurrences. You'll better use a function to split the sentence, and apply that function recursively on the resulting sub-sentences. And yes: Regular Expressions are preferred.

KhanKtk

here is what i did using regex. it works well. But doing this way, Splitting with regex, i lost the control over the word "and" for further processing. I have a lexicon of 20 words that normally appears before the "and (اور)" in urdu language. In next step I want to have a way to check the word before "and" against the lexicon and if found the sentence is broken else display the complete sentence. private void button1_Click(object sender, EventArgs e) { { richTextBox2.Text = ""; richTextBox3.Text = ""; string[] sentences = SentenceTokenizer(richTextBox1.Text); string remSentence; // these are urdu conjunctions. i am actually working on urdu language. Regex r = new Regex("(کہ |اور | تاکہ| مگر | تاہم | کیونکہ | لیکن )"); foreach (string sentence in sentences) { remSentence = sentence; remSentence = r.Replace(remSentence, "|"); string[] phrases = remSentence.Split('|'); for (int i = 0; i < phrases.Length; i++) { richTextBox2.Text += phrases[i] + '\n'; } } } }

KhanKtk

would you plz share some coded modification?

Bernhard

I am not Bernhard Hiller.

All the label says is that this stuff contains chemicals "... known to the State of California to cause cancer in rats and low-income test subjects."
Roger Wright
http://www.codeproject.com/lounge.asp?select=965687&exp=5&fr=1#xx965687xx

KhanKtk

here is what i did using regex. it works well. But doing this way, Splitting with regex, i lost the control over the word "and" for further processing. I have a lexicon of 20 words that normally appears before the "and (اور)" in urdu language. In next step I want to have a way to check the word before "and" against the lexicon and if found the sentence is broken else display the complete sentence. private void button1_Click(object sender, EventArgs e) { { richTextBox2.Text = ""; richTextBox3.Text = ""; string[] sentences = SentenceTokenizer(richTextBox1.Text); string remSentence; // these are urdu conjunctions. i am actually working on urdu language. Regex r = new Regex("(کہ |اور | تاکہ| مگر | تاہم | کیونکہ | لیکن )"); foreach (string sentence in sentences) { remSentence = sentence; remSentence = r.Replace(remSentence, "|"); string[] phrases = remSentence.Split('|'); for (int i = 0; i < phrases.Length; i++) { richTextBox2.Text += phrases[i] + '\n'; } } } }

Lost User

Something like the below will loop for each keyword until there are no more matches left for that keyword - but it wont be perfect because it will skip over other keywords while it's looking e.g "if you had this and that or the other and something" you'd strip out all the Ands before stripping out the Or What you need to do is Repeat Find the first occurrence of ANY of the keywords in your sentence. Split the sentence Until no occurrences found Using the IndexOf method you can loop through your keywords, finding the lowest, non zero value of IndexOf and storing that word. When the loop finishes , split the sentence using that word.

bool finished = false;
while (not finished)
if ((remSentence.Contains(keywords[i])))// || (remSentence.IndexOf(keywords[i]) > 0))
{
richTextBox2.Text += remSentence.Substring(0, remSentence.IndexOf(keywords[i])) + '\n' + keywords[i] + '\n';
remSentence = remSentence.Substring(remSentence.IndexOf(keywords[i]) + keywords[i].Length);

}
else
{
finished = true;
}