Why Vivaldi’s Reader, and others too, sometimes don’t do what you expect

TL;DR

Vivaldi Read Mode was was never meant for pages like Twitter, Youtube comments, or Facebook chats but for articles

As general rule of thumb you can assume that everything that has a continuous or adjacent text chunk of more than 300 characters counts as content. It may be split in multiple adjacent paragraphs, but must look content-y enough (i.e. consist of sentences) and may not contain too many links, videos or images inside of the content area (outside is fine) and does not belong to one of the “stop” classes and IDs like e.g. “comment”, “footer”, “ad”-vertisement and many others.

/TL;DR

Still there?

History

The Vivaldi Read View is – like the read view in Mozilla Firefox and the Apple Safari – based on the Readability(TM) code that was released as a labs experiment by arc90 in about 2009 under an open source license (some versions under MIT, some under Apache license). Later arc90 changed it to a server supported version that was available at readability.com

The Intention Behind It

Readability was never meant to be an ad-blocker, but always as a on-demand reader view to switch on for *articles*, meaning: Longer passages of text (important!)

It was never intended to be used on pages like Facebook, with its gazillions of short text snippets, Youtube video comments, Twitter feeds and generally not on any page that does not contain a sizable longer chunk of text in one article.

It was meant to make reading of longer texts distraction free by removing e.g. advertisements, page navigation, comments and videos or images that don’t belong to the main article content, and to re-style it with readable fonts and colors to make reading more pleasurable. 

How?!

Of course the code is not really “intelligent” (it has to be fast and may not use up too many resources), so it has to trust on some kind of heuristics to detect where the main content might be. While generally it works quite well, it may fail on some pages, especially “if the HTML is totally hosed” (not my words, that was a comment of one of the original arc90 developers)

A (simplified and not complete) Explanation:

First steps:

  • Remove all scripts.
  • Remove all styles.
  • Ignore HTML block level elements like paragraphs and divisions with less than 25 characters completely.
  • Remove HTML block level elements that have “stop” classes or IDs or tags that indicate that they are definitely not content but something else like e.g navigation, footers, comments or advertisements etc.pp.

After that the reader loops through all paragraphs, and

  • calculates the over-all score for text length by the following formula: 
    rounded-down((pure-Text character count of a page element)/100)
    and adds it to the parent element (you might see it as a container). This means: A paragraph with less than 100 characters of text does not get any bonus at all.
  • adds a base score of 1 for each remaining paragraph to the parent element 
  • assigns a score to them based on how content-y they look. This score gets added to their parent node.
  • adds additional scores that are is determined by things like number of commas (+), class names and IDs (+/-), image and link density (More than 25% of the text is links? Too many images per paragraph? Punish it!) etc.
  • punishes List, Headline, Forms and Address and some other Elements with negative scores because they are normally not part of articles, and if they are, they are usually in the same parent container as the paragraphs in a real article, so the combined score of the parent element is still high enough to count.
  • adds half of the resulting score to the grandparent elements.

When that all is done and the parent or grandparent has a high enough score, it is seen as content and gets displayed, everything else gets removed.

Probably you can imagine now, how many pitfalls are there in which content detection may fall, so please take a break if you see it fail and think about what might have caused it this time.

Personal side note (strong language warning)

All in all content detection is a bіtch and can definitely fail on some pages, especially if the “Webmasters” (I call them Crapmasters) don’t know what a HTML validator is and have never heard about structured pages and accessibility. I am speaking out of experience: Back in 2009 I started with a userscript and later made an made an extension (cleanPages, see the old my.opera page on archive org) based the full original arc90 code and fine tuned it for Opera Presto (and ported it later for the new Opera thing). It had over 250k installs and while it was fun to tweak for better results, it was a hell of a lot of work. I wrote more than 200 versions with generic fixes for “just another couple of new pages that fail” but in the end I gave in and called it a day  – there are too many broken pages out there where the webmasters seemingly do not want people to read the content. Their wish is my command 😉

So please be gentle with the Vivaldi developers – yes, there is still some fine-tuning to be done, but that is really time consuming. It will probably have to wait because there are some other, more difficult and bigger things in the pipe (hint, hint 😀 )

Thank You!

Disclaimer: While I am a “Soprano” (aka external tester for internal builds), all the views in this text are my private views and do not necessarily reflect the views or opinions of Vivaldi (the company) or any of it’s owners or employees.

7 comments Write a comment

  1. All in all content detection is a bіtch and can definitely fail on some pages, especially if the “Webmasters” (I call them Crapmasters) don’t know what a HTML validator is and have never heard about structured pages and accessibility.

    By using a rather “brute force” method to make a page more readable, I’d suggest investing time in making a standard to help webmasters (the proper ones) to support these kind of readers. You’d only have to fall back on to the “brute force” method of doing things when certain tags/attributes/meta aren’t found.

    This is actually one of the reason why I think the more or less forgotten combination of XML + XSLT (or any other content / layout separated combo) is not so bad after all. Take the same XML and apply a different XSLT template and CSS stylesheet and done.

  2. there is a standard and it’s called html5.

    just put your content inside proper tags. divide your page with header, footer, sections, navigations, lists, articles — each of mentioned has own dedicated html tag. then it’s easy af to know your content – search engines, screen readers and such things like reader views

  3. Things like that were possible before HTML5 too. Since about middle of the 1990s there were even recommendations from the W3C how to do that best – but that would have meant that the then often self taught webmasters had to read and follow those – which they often did not.

    We have to look where the HTML stuff comes from:
    It was more or less for publishing scientific papers and such had only very limited need or support for “fancy” formatting. The inventors simply forgot in the first HTML specifications, that normal people or companies wanted something more from the web, one among those things was a visually pleasing design or a corporate branding, or advertisements – you name it. Additionally they completely forgot that people at that time were used to read newspapers (grid and column layout) and magazines (column layout with a lot of images, side boxes, etc. pp.) but on displays that in opposition with traditional media were not of a common fixed size or resolution.
    Mind: CSS was not yet invented and it took several years past 2000 before people started using on a wider range, simply because of the inertia of the masses and because of “never change a working system”…

    So people had to be creative how to achieve that with the then quite limited features of the language.

    Some of them built visually stunning designs, completely abusing the HTML elements for things they were never meant to be used without spending a second thought on other things. Others looked into the source to see how they did that and published how to do such things, leading others to copy their bad behavior. I can still remember the old table based layouts, or the techniques to get rounded corners (border radius was not invented yet – some of those needed up to 4 nested DIVs). The main problem with published stuff is, that the more it gets copied and linked, the higher it goes in the search engine rankings. The search engines are blind about the quality of articles and they don’t know if some article is superseded by a better one on some still obscure page. Due to that it sometimes takes more than 10 years until the World Wild Web (pun intended) follows better practices. Additionally pages that are older than, say two years, rarely get updated to newer standards. (Sometimes even 15 year old pages contain valid and valuable content – I lately stumbled upon one and it still had a “Best viewed with Netscape 4” button on it. Horrible HTML code, but the content was great)

    In the end:
    A new standard does not make the gazillion b0rked pages out there vanish over night and definitely does not fix older pages (and nobody rewrites older pages) – and browsers must still be capable to display those, or the knowledge kept on those pages vanishes, so we will need things like the readers for a long time, despite the all new and better next [buzzword] standard.

  4. The Reader sometimes will fail because it needs special HTML elements to detect articles. It searches for H2 elements to detect a article’s title and displays it as main heading.
    Unfortunately a overview on a blog page may have H1 elements only if it is semantically well structured.
    On such pages the Reader will fail.

    Some blogs use empty H1 and show the title of the articles in a H2. Such procedure for creating HTML Markup may be right from the view of program logic but is not really semantically.
    As H1 is the top heading of a document it would be nonsense to keep it empty.

    What do you think?

    • Of course leaving the H1 empty is nonsense from the semantic point of view, but sadly people who wrote the code for some major blog platforms and some other CMS gave a s**t on that.

      if the Vivaldi reader follows the same logic as my variation, it should find non-empty H1 elements too…

      … but in the end I have seen worse things – like e.g. setting the whole 1st paragraph as H# to create something like you can see in newspapers with the excerpt printed in bold font etc. pp. Sometimes I really wish that The W3C would have gone the strict XML way and made schema validation mandatory. That would have kept several of the c**pmasters away.

    • I know those standards but companies don’t care. They push for visuals and that’s it, totally forgetting the main reason for a website:
      People want to read it (unless it is a pure video site, but even then the surrounding content should be easily accessible)

      IMHO the word “accessible” in itself and the association and implication that go with it – people with “special needs” – is a bad word. Ram it in their head that it is better for SEO and that it is cheaper to build websites that are standard conform. Let the companies insist on following the guidelines when commissioning a website. Sites can still look beautiful even when following the standards, that’s what CSS was made for.
      Extra benefit: They work on every device.

Leave a Reply


Vivaldi