Posts Tagged ‘html’

May 6, 2008 0

How to easily parse HTML without RegEx

By J in Uncategorized

I recently discovered an absolutely amazing HTML parsing library for .NET called HtmlAgilityPack. It completely takes away the pain of parsing complicated HTML with regular expressions.
Here’s a very simple example of what you could do with it – I’m just extracting inner HTML from any element inside a HTML file which has a css class [...]

Tags: , ,

January 5, 2008 3

How to extract URLs (href property) from HTML

By J in Uncategorized

protected ArrayList getURL(string txtIn)
{
ArrayList outURL = new ArrayList();
Regex r = new Regex(“href\\s*=\\s*(?:(?:\\\”(?<url>[^\\\"]*)\\\”)|(?<url>[^\\s]* ))”);
MatchCollection mc1 = r.Matches(txtIn);

foreach (Match m1 in mc1)
{
foreach (Group g in m1.Groups)
[...]

Tags: , ,

November 13, 2007 0

Strip out HTML tags using RegEx

By J in Uncategorized

This code will strip out all the HTML tags and truncate the text to 4 lines.

public static string TruncateText(string txtIn, int newLength)
{
string txtOut = txtIn;
string pattern = @”<(.|\n)*?>”;

//Strip out HTML tags
if (Regex.IsMatch(txtIn, pattern, RegexOptions.None))
[...]

Tags: , ,