Artificial Intelligence Authors: Liz McMillan, Yeshim Deniz, Zakia Bouachraoui, Pat Romanski, Elizabeth White

RSS Feed Item

Re: Parsing bad HTML

I'd say there are two questions you have to decide before you can 
tell the best approach:

1: What do you *want* that to parse to?

2: What other cases do you need to deal with?

For #1, I'd assuming that you want "67<eight" to be treated as if it 
were "67&lt;8", and you want the pointies inside the alt attribute to 
really be there, presumably also via &lt; (you might instead want to 
strip the "b" pseudo-tags entirely, or insert an "<eight>" tag for 
some reason, or....

For #2, is there a recurring pattern/problem you need fixed? If not, 
then there's no point in *programming* a solution of any kind. If so, 
what *is* the pattern? From this example it might be that you want to 
treat a less-than as literal data instead of a tag-start in any one 
of these situations:

a) after a sequence of digits

b) before "eight" (or before "e", or before any spelled-out name of a 
number, or....)

c) whenever the following text can't form a well-formed tag (this 
might be too draconian, though)

d) whenever the following XML Name characters (in this case "eight") 
don't form a known HTML tag name.

As is always true, a finite number of examples can be accounted for 
by an infinite number of different models; but to implement something 
you have to take your pick. If there's a lot of data involved, you 
would be well advised to scan around it and determine what kinds of 
errors there are,  then solve the most frequent ones first.

On the whole, this looks to me like the sort of thing you either fix 
by hand (if there are few enough examples, or if the examples are 
totally unpredictable); or by some simple regexes, such as:

sed 's/<\([a-zA-Z]*\)</\&lt;\1</g


At 12:22 PM -0800 11/13/08, Paul M wrote:
>I use tidy to clean up  bad html docs. It does a pretty good job of 
>converting html => strict xthml
>However, the following is a bit too much
><sub>123</sub>4567<eight<img src="file.gif" alt="<b>hello</b>">
>The problem is with 7<eight. Stray < and > seem to make tidy choke. 
>What is the best method of handling this? I am leaning toward perl 
>and regexp, but am hoping to avoid this. Maybe a Java solution? And 
>tidy solutions?


Steve DeRose -- http://www.derose.net, email [email protected]

Read the original blog entry...

IoT & Smart Cities Stories
DXWorldEXPO | CloudEXPO are the world's most influential, independent events where Cloud Computing was coined and where technology buyers and vendors meet to experience and discuss the big picture of Digital Transformation and all of the strategies, tactics, and tools they need to realize their goals. Sponsors of DXWorldEXPO | CloudEXPO benefit from unmatched branding, profile building and lead generation opportunities.
DXWorldEXPO LLC announced today that Telecom Reseller has been named "Media Sponsor" of CloudEXPO | DXWorldEXPO 2018 New York, which will take place on November 11-13, 2018 in New York City, NY. Telecom Reseller reports on Unified Communications, UCaaS, BPaaS for enterprise and SMBs. They report extensively on both customer premises based solutions such as IP-PBX as well as cloud based and hosted platforms.
Digital Transformation: Preparing Cloud & IoT Security for the Age of Artificial Intelligence. As automation and artificial intelligence (AI) power solution development and delivery, many businesses need to build backend cloud capabilities. Well-poised organizations, marketing smart devices with AI and BlockChain capabilities prepare to refine compliance and regulatory capabilities in 2018. Volumes of health, financial, technical and privacy data, along with tightening compliance requirements by...
The deluge of IoT sensor data collected from connected devices and the powerful AI required to make that data actionable are giving rise to a hybrid ecosystem in which cloud, on-prem and edge processes become interweaved. Attendees will learn how emerging composable infrastructure solutions deliver the adaptive architecture needed to manage this new data reality. Machine learning algorithms can better anticipate data storms and automate resources to support surges, including fully scalable GPU-c...
To Really Work for Enterprises, MultiCloud Adoption Requires Far Better and Inclusive Cloud Monitoring and Cost Management … But How? Overwhelmingly, even as enterprises have adopted cloud computing and are expanding to multi-cloud computing, IT leaders remain concerned about how to monitor, manage and control costs across hybrid and multi-cloud deployments. It’s clear that traditional IT monitoring and management approaches, designed after all for on-premises data centers, are falling short in ...
DXWordEXPO New York 2018, colocated with CloudEXPO New York 2018 will be held November 11-13, 2018, in New York City and will bring together Cloud Computing, FinTech and Blockchain, Digital Transformation, Big Data, Internet of Things, DevOps, AI, Machine Learning and WebRTC to one location.
We are seeing a major migration of enterprises applications to the cloud. As cloud and business use of real time applications accelerate, legacy networks are no longer able to architecturally support cloud adoption and deliver the performance and security required by highly distributed enterprises. These outdated solutions have become more costly and complicated to implement, install, manage, and maintain.SD-WAN offers unlimited capabilities for accessing the benefits of the cloud and Internet. ...
Business professionals no longer wonder if they'll migrate to the cloud; it's now a matter of when. The cloud environment has proved to be a major force in transitioning to an agile business model that enables quick decisions and fast implementation that solidify customer relationships. And when the cloud is combined with the power of cognitive computing, it drives innovation and transformation that achieves astounding competitive advantage.
DXWorldEXPO LLC announced today that "IoT Now" was named media sponsor of CloudEXPO | DXWorldEXPO 2018 New York, which will take place on November 11-13, 2018 in New York City, NY. IoT Now explores the evolving opportunities and challenges facing CSPs, and it passes on some lessons learned from those who have taken the first steps in next-gen IoT services.
SYS-CON Events announced today that Silicon India has been named “Media Sponsor” of SYS-CON's 21st International Cloud Expo, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Published in Silicon Valley, Silicon India magazine is the premiere platform for CIOs to discuss their innovative enterprise solutions and allows IT vendors to learn about new solutions that can help grow their business.