Vivaldi Read Mode was was never meant for pages like Twitter, Youtube comments, or Facebook chats but for articles.
As general rule of thumb you can assume that everything that has a continuous or adjacent text chunk of more than 300 characters counts as content. It may be split in multiple adjacent paragraphs, but must look content-y enough (i.e. consist of sentences) and may not contain too many links, videos or images inside of the content area (outside is fine) and does not belong to one of the “stop” classes and IDs like e.g. “comment”, “footer”, “ad”-vertisement and many others.
The Vivaldi Read View is – like the read view in Mozilla Firefox and the Apple Safari – based on the Readability(TM) code that was released as a labs experiment by arc90 in about 2009 under an open source license (some versions under MIT, some under Apache license). Later arc90 changed it to a server supported version that was available at readability.com
The Intention Behind It
Readability was never meant to be an ad-blocker, but always as a on-demand reader view to switch on for *articles*, meaning: Longer passages of text (important!)
It was never intended to be used on pages like Facebook, with its gazillions of short text snippets, Youtube video comments, Twitter feeds and generally not on any page that does not contain a sizable longer chunk of text in one article.
It was meant to make reading of longer texts distraction free by removing e.g. advertisements, page navigation, comments and videos or images that don’t belong to the main article content, and to re-style it with readable fonts and colors to make reading more pleasurable.
Of course the code is not really “intelligent” (it has to be fast and may not use up too many resources), so it has to trust on some kind of heuristics to detect where the main content might be. While generally it works quite well, it may fail on some pages, especially “if the HTML is totally hosed” (not my words, that was a comment of one of the original arc90 developers)
A (simplified and not complete) Explanation:
- Remove all scripts.
- Remove all styles.
- Ignore HTML block level elements like paragraphs and divisions with less than 25 characters completely.
- Remove HTML block level elements that have “stop” classes or IDs or tags that indicate that they are definitely not content but something else like e.g navigation, footers, comments or advertisements etc.pp.
After that the reader loops through all paragraphs, and
- calculates the over-all score for text length by the following formula:
rounded-down((pure-Text character count of a page element)/100)
and adds it to the parent element (you might see it as a container). This means: A paragraph with less than 100 characters of text does not get any bonus at all.
- adds a base score of 1 for each remaining paragraph to the parent element
- assigns a score to them based on how content-y they look. This score gets added to their parent node.
- adds additional scores that are is determined by things like number of commas (+), class names and IDs (+/-), image and link density (More than 25% of the text is links? Too many images per paragraph? Punish it!) etc.
- punishes List, Headline, Forms and Address and some other Elements with negative scores because they are normally not part of articles, and if they are, they are usually in the same parent container as the paragraphs in a real article, so the combined score of the parent element is still high enough to count.
- adds half of the resulting score to the grandparent elements.
When that all is done and the parent or grandparent has a high enough score, it is seen as content and gets displayed, everything else gets removed.
Probably you can imagine now, how many pitfalls are there in which content detection may fall, so please take a break if you see it fail and think about what might have caused it this time.
Personal side note (strong language warning)
All in all content detection is a bіtch and can definitely fail on some pages, especially if the “Webmasters” (I call them Crapmasters) don’t know what a HTML validator is and have never heard about structured pages and accessibility. I am speaking out of experience: Back in 2009 I started with a userscript and later made an made an extension (cleanPages, see the old my.opera page on archive org) based the full original arc90 code and fine tuned it for Opera Presto (and ported it later for the new Opera thing). It had over 250k installs and while it was fun to tweak for better results, it was a hell of a lot of work. I wrote more than 200 versions with generic fixes for “just another couple of new pages that fail” but in the end I gave in and called it a day – there are too many broken pages out there where the webmasters seemingly do not want people to read the content. Their wish is my command 😉
So please be gentle with the Vivaldi developers – yes, there is still some fine-tuning to be done, but that is really time consuming. It will probably have to wait because there are some other, more difficult and bigger things in the pipe (hint, hint 😀 )
Disclaimer: While I am a “Soprano” (aka external tester for internal builds), all the views in this text are my private views and do not necessarily reflect the views or opinions of Vivaldi (the company) or any of it’s owners or employees.