Skip to content
This repository was archived by the owner on Dec 3, 2020. It is now read-only.

Upgrade to Fathom 3.0#317

Merged
biancadanforth merged 11 commits intomasterfrom
fathom3
Aug 21, 2019
Merged

Upgrade to Fathom 3.0#317
biancadanforth merged 11 commits intomasterfrom
fathom3

Conversation

@biancadanforth
Copy link
Copy Markdown
Collaborator

@biancadanforth biancadanforth commented Jun 27, 2019

Training numbers are:

image

bdanforth ~/Projects/price-tracker/src/extraction/fathom/vectors (fathom3) $ fathom-train vectors_training_image.json -a vectors_validation_image.json -l 0.07 -i 3000 -s -c "Image - early stopping"
  [#############################-------]   81%  00:03:59
Stopping early at iteration 2458, just before validation error rose.
Coeffs: [
        ['isAboveTheFoldImage', 16.586257934570312],
        ['isBig', 21.33774757385254],
        ['hasSquareAspectRatio', 1.1075100898742676],
        ['hasBackgroundInID', -10.287492752075195],
    ]
Bias: -10.176139831542969
  Training accuracy per tag:  0.99079    95% CI: (0.98879, 0.99280)  FP: 0.001  FN: 0.008
Validation accuracy per tag:  0.99335    95% CI: (0.98843, 0.99826)  FP: 0.003  FN: 0.004
  Training accuracy per page: 0.87037    95% CI: (0.81865, 0.92210)
Validation accuracy per page: 0.85000    95% CI: (0.69351, 1.00000)

title

bdanforth ~/Projects/price-tracker/src/extraction/fathom/vectors (fathom3) $ fathom-train vectors_training_title.json -a vectors_validation_title.json -l 0.1 -s -c "Title"
  [###############---------------------]   44%  00:00:06
Stopping early at iteration 443, just before validation error rose.
Coeffs: [
        ['isNearImageTopOrBottom', 7.078092575073242],
    ]
Bias: -1.6698582172393799
  Training accuracy per tag:  0.88934    95% CI: (0.84998, 0.92871)  FP: 0.020  FN: 0.090
Validation accuracy per tag:  1.00000    95% CI: (1.00000, 1.00000)  FP: 0.000  FN: 0.000
  Training accuracy per page: 0.98765    95% CI: (0.97065, 1.00000)
Validation accuracy per page: 1.00000    95% CI: (1.00000, 1.00000)

price

bdanforth ~/Projects/price-tracker/src/extraction/fathom/vectors (fathom3) $ fathom-train vectors_training_price.json -a vectors_validation_price.json -l 0.1 -c "Price" -s -v
  [####################################]  100%          
Coeffs: [
        ['hasDollarSign', 1.0177843570709229],
        ['isAboveTheFoldPrice', -5.301823616027832],
        ['hasPriceInID', 5.333859443664551],
        ['hasPriceInParentID', -7.5635271072387695],
        ['hasPriceInClassName', 1.155443787574768],
        ['hasPriceInParentClassName', 3.0024354457855225],
        ['fontIsBig', 11.338400840759277],
        ['isNearImage', 0.7539440989494324],
        ['hasPriceishPattern', 5.222956657409668],
    ]
Bias: -7.09004545211792
  Training accuracy per tag:  0.99350    95% CI: (0.99144, 0.99556)  FP: 0.002  FN: 0.004
Validation accuracy per tag:  0.99268    95% CI: (0.98685, 0.99852)  FP: 0.001  FN: 0.006
  Training accuracy per page: 0.97531    95% CI: (0.95141, 0.99921)
Validation accuracy per page: 0.95000    95% CI: (0.85448, 1.00000)

@biancadanforth biancadanforth changed the title WIP - Upgrade to Fathom 3.0 Upgrade to Fathom 3.0 Jun 27, 2019
biancadanforth and others added 9 commits July 1, 2019 11:45
Training with Fathom 3.0 requires some changes to the structure of the coefficients and ruleset used and has a new vectorize step.

Fathom changes include:
* Fathom now handles weighting the rules by their coefficients.
* Rule weighting is no longer exponential.
* Each rule should return a value between 0 and 1, inclusive
* Coefficients should be passed into Fathom as a [rule_name, coefficient] tuple
* Fathom's 'rule' function now takes a second argument, an object literal with a single key, 'name'. Its value is the name of the rule. This string literal value must match the rule_name passed into Fathom as part of the tuple mentioned above.
* [Fathom training](http://mozilla.github.io/fathom/training.html?highlight=vectorizer#running-the-trainer) now includes a vectorize step using the Vectorizer in FathomFox
  * The Vectorizer generates a 'vectors.json' file for training and validation for each feature (for Price Tracker which has three features (the product image, title and price), this would be 6 new files).
  * The main purpose of each 'vectors.json' file is to provide a 'feature' vector for each candidate element for a given feature. This feature vector has a floating point value for each rule. The ith value corresponds to the raw score for that element for the ith rule passed into Fathom for that feature's list of [rule_name, coefficient] tuples.
Sometimes a feature vector can output a 'null' value. This will [throw an error](mozilla/fathom-fox#35) during training with the new 'fathom-train' CLI.

Possible causes include:
* A name mismatch between the 'name' value passed into a rule function (its second argument) and the name of the rule in the list of [ruleName, coefficient] tuples referenced in the ruleset object.
* A score callback might be failing to return a number.
* A corner case of a DOM or CSSOM specification used by a score callback; e.g. innerText could return null instead of the empty string in Firefox.

In this case, a score callback was failing to return a number if the width or height of the element passed into 'aspectRatio' was 0.
Previous vectors were based on an incomplete corpus of tagged product pages. These vectors are based on the complete set of Amazon, Ebay, Best Buy, Walmart and Home Depot samples tagged in the [Fathom Commerce Samples](https://drive.google.com/drive/folders/1YKfDHx2niy9nCrdKCSDt7lcU9uWbHzon) folder.

These samples were divided into 3 buckets using ['fathom-pick'](https://github.com/mozilla/fathom/blob/master/cli/fathom_web/pick.py): training, validation and test set, which moves samples at random. The split percentage for the complete corpus into these buckets was 80/10/10.
Trained 'image' using the ['fathom-train'](https://github.com/mozilla/fathom/blob/master/cli/fathom_web/train.py) CLI and image training and validation vectors from FathomFox's Vectorizer. Copied the resulting coefficients and bias into trainees.js and its imports.
As mentioned in [this issue](mozilla/fathom-fox#35), it's possible that a feature vector will contain a 'null' value for one or more rules, which will cause 'fathom-train' to throw an error.

In this case, the reason for the 'null' value was a name mismatch.

I opted to change the method name as its naming convention did not match the other rules.
After fixing the issue with a 'null' feature vector value in the price vectors, the vectors were re-generated using FathomFox's Vectorizer. Now there are no longer any 'null' feature vector values and training can proceed.
In Price Tracker, the product 'title' and 'price' features are dependent upon the 'image' feature results (e.g. there is a rule for the 'price' feature called 'isNearImage' which scores a candidate 'price' element based on its proximity to the most likely 'image' element).

As a result, the final weights and bias from training the 'image' feature need to be taken into account before vectorizing the 'title' and 'price' features. This commit updates the vectors for 'title' and 'price' to take this into account.
…rice'

With updated vectors for 'title' and 'price' features based on the optimized coefficients and bias for 'image', the 'title' and 'price' features were trained using 'fathom-train' and their coefficients and biases updated.
Also removes the now unused getCoeffsInOrder function and updates some training-related comments
@biancadanforth
Copy link
Copy Markdown
Collaborator Author

biancadanforth commented Jul 2, 2019

TL;DR:

While Price Tracker's accuracy in this PR on the 30-page test set is low at 40%, it beats out the accuracy we get on this test set in the current master branch which is a whopping 0%.

Why is the test accuracy on master 0%?
Well, due to a bug (see this comment), the master branch currently doesn't use Fathom at all and instead relies purely on our fallback extraction method of using CSS selectors. This PR fixes that bug, since Fathom 3.0 now scores elements based on a [0,1] confidence that the element is the target element. As a result, a SCORE_THRESHOLD value in ./src/extraction/fathom/index.js was found that allows Fathom's extraction results to be used. That results in a 40% accuracy improvement!


Test Set Results -- This PR -- Fathom accuracy: 12/30 (40%)
The test set included 30 sample pages at random that were not a part of the training or validation sets, taken from the Fathom Commerce Samples Google Drive folder.

page method correct (T/F)
0005 eBay.html none F
0007 Best Buy.html fathom T
0007 Walmart.html fathom T
0009 Best Buy.html none F
0011 eBay.html fathom T
0013 Best Buy.html none F
0014 amz.html none F
0015 eBay.html none F
00015 H_D.html none F
0017 Best Buy.html none F
amazon-40 (8).html fathom T
amazon-40 (22).html none F
amazon-40 (30).html none F
amazon-40 (37).html none F
best_buy-40 (5).html none F
best_buy-40 (11).html none F
best_buy-40 (24).html none F
best_buy-40 (26).html none F
ebay-1 (15).html none F
ebay-1 (30).html none F
home_depot-40 (17).html none F
home_depot-40 (21).html fathom T
home_depot-40 (25).html fathom T
home_depot-40 (28).html fathom T
home_depot-40 (32).html fathom T
home_depot-40 (38).html none F
walmart-40 (20).html fathom T
walmart-40 (25).html fathom T
walmart-40 (38).html fathom T
walmart-40 (39).html fathom T

How accuracy was tested

The approach outlined below was used since fathom-train does not currently have a way to measure test accuracy.

In order to assess the accuracy of Price Tracker in this PR versus master, I had to run Fathom on the test set, which consisted of frozen pages saved locally (file:/// urls). By default, Price Tracker does not perform extraction on any URL schema other than http(s)://, and it only performs extraction on pages in an allow list, so some temporary changes were made:

  1. Set up a local HTTP server in the directory where the frozen pages are (python -m http.server 8000 --bind 127.0.0.1). This changes the scheme for the page URL from file:/// to http://.
  2. Modify web-ext-config.js to set this pref and value to enable extraction on any domain:
diff --git a/web-ext-config.js b/web-ext-config.js
index 9791f8c..8097c55 100644
--- a/web-ext-config.js
+++ b/web-ext-config.js
@@ -1,6 +1,7 @@
 module.exports = {
   run: {
     pref: [
+      'extensions.shopping-testpilot@mozilla.org.extractionAllowlist=*',
       'extensions.shopping-testpilot@mozilla.org.priceCheckInterval=30000',
       'extensions.shopping-testpilot@mozilla.org.priceCheckTimeoutInterval=30000',
       'extensions.shopping-testpilot@mozilla.org.iframeTimeout=10000',

@biancadanforth biancadanforth requested a review from erikrose July 2, 2019 00:05
Copy link
Copy Markdown
Contributor

@erikrose erikrose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was a nice, fast port! If we make those 2-3 little tweaks, I think we're good to merge. That'll make this a decent real-world example to show people, aside from the corpus being unusually un-diverse.

I am unsettled that the testing accuracy is so far off from the training/validation (and bad). It makes me suspect either the sets are not representative of each other or we've missed a significant bug. However, the goal of this work was to get everybody spun up on Fathom 3, and that's been accomplished. We can come back and chase that mystery if we decide to get serious about this product again. What's more, I don't think doing a new release of Price Tracker is going to make any user's experience worse—right, Bianca?

Again, good job on the quick spin-up!

*/
weightedIncludes(haystack, needle, coeff) {
return (this.caselessIncludes(haystack, needle) ? ONEISH : ZEROISH) ** coeff;
weightedIncludes(haystack, needle) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably shouldn't be called "weighted" anymore

/** Scores fnode with a '$' in its innerText */
hasDollarSign(fnode) {
return (fnode.element.innerText.includes('$') ? ONEISH : ZEROISH) ** this.hasDollarSignCoeff;
return (fnode.element.innerText.includes('$') ? ONEISH : ZEROISH);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should get rid of ONEISH and ZEROISH. They don't make anything better anymore, and they make make things ever so slightly worse (probably just slightly slower to converge).

* Using coefficients passed into the constructor method, returns a weighted
* ruleset used to score elements in an HTML document.
*
* @param {Array[]} An array of [string, number] tuples where the first element
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it legal to leave out the param name? In any case, we should either document both params or neither.

@biancadanforth
Copy link
Copy Markdown
Collaborator Author

TL;DR: Using the new fathom-test CLI on the 20 sample pages in our test set in this PR, the overall per-page accuracy of the current ruleset, weights and biases using Fathom 3.0 is 70%. This is the accuracy for Fathom to correctly identify a "product" on a page (i.e. correctly identifying all three product features: image, title and price).

Longer version
Using the src/extraction/fathom/vectors/vectors_test_* files pushed here and the coefficients and biases resulting from training, I used the new fathom-test CLI to measure the test accuracy (per page) for each feature against 20 sample pages:

  • "image": 75%
  • "title": 100%
  • "price": 95%

This is not an apples to apples comparison from the past test accuracy measure, since that was looking at whether a "product" (being the sum of its "image", "price" and "title") was found on a page. fathom-test checks per feature if Fathom was correct on a page or for a particular element. To directly compare, I'd need to see for which pages all 3 features were a "success" in choosing the correct element. If I look at that, the overall accuracy of Fathom in this PR to identify a product correctly (to correctly identify all three features on the page) is 70%.

This is much better than the 40% from my previous approach for testing accuracy (granted, that was on a different test set of 40 pages), but it's still nowhere near the upper 80s and 90s percent accuracy seen in the training and validation runs, despite these samples coming from the exact same corpus as the training and validation samples.

Image testing accuracy

(venv) bdanforth ~/Projects/price-tracker (fathom3) $ fathom-test src/extraction/fathom/vectors/vectors_test_image.json '{"coeffs": [["isAboveTheFoldImage", 16.586257934570312], ["isBig", 21.33774757385254], ["hasSquareAspectRatio", 1.1075100898742676], ["hasBackgroundInID", -10.287492752075195]], "bias": -10.176139831542969}'

Testing accuracy per tag:  0.99000    95% CI: (0.98412, 0.99588)  FP: 0.003  FN: 0.007
Testing accuracy per page: 0.75000    95% CI: (0.56022, 0.93978)

Testing per-page results:
 success  on     amazon-40%20(22).html. Confidence: 0.45426598
 success  on     amazon-40%20(30).html. Confidence: 0.79204732
 failure  on     amazon-40%20(37).html. Confidence: 0.71261239 Highest-scoring element was a wrong choice.
    First target at index 7: 0.18024161
 success  on      amazon-40%20(8).html. Confidence: 0.45773035
 failure  on   best_buy-40%20(11).html. Confidence: 0.06214701 Highest-scoring element was a wrong choice.
    First target at index 2: 0.03152284
 success  on   best_buy-40%20(24).html. Confidence: 0.31307796
 failure  on   best_buy-40%20(26).html. Confidence: 0.06214701 Highest-scoring element was a wrong choice.
    First target at index 2: 0.05037770
 success  on    best_buy-40%20(5).html. Confidence: 0.10311700
 success  on        ebay-1%20(15).html. Confidence: 0.20423122
 success  on        ebay-1%20(30).html. Confidence: 0.81114835
 failure  on home_depot-40%20(17).html. Confidence: 0.74345410 Highest-scoring element was a wrong choice.
    First target at index 1: 0.73110878
 success  on home_depot-40%20(21).html. Confidence: 0.73110878
 success  on home_depot-40%20(25).html. Confidence: 0.97135264
 success  on home_depot-40%20(28).html. Confidence: 0.73110878
 success  on home_depot-40%20(32).html. Confidence: 0.73110878
 failure  on home_depot-40%20(38).html. Confidence: 0.74345410 Highest-scoring element was a wrong choice.
    First target at index 1: 0.73110878
 success  on    walmart-40%20(20).html. Confidence: 0.60515302
 success  on    walmart-40%20(25).html. Confidence: 0.60668099
 success  on    walmart-40%20(38).html. Confidence: 0.58588487
 success  on    walmart-40%20(39).html. Confidence: 0.53866470

Title testing accuracy

(venv) bdanforth ~/Projects/price-tracker (fathom3) $ fathom-test src/extraction/fathom/vectors/vectors_test_title.json '{"coeffs": [["isNearImageTopOrBottom", 7.078092575073242]], "bias": -1.6698582172393799}'

Testing accuracy per tag:  0.83871    95% CI: (0.70923, 0.96818)  FP: 0.000  FN: 0.161
Testing accuracy per page: 1.00000    95% CI: (1.00000, 1.00000)

Testing per-page results:
 success  on     amazon-40%20(22).html. Confidence: 0.99099052
 success  on     amazon-40%20(30).html. Confidence: 0.64029801
 success  on     amazon-40%20(37).html. Confidence: 0.24906397
 success  on      amazon-40%20(8).html. Confidence: 0.97334236
 success  on   best_buy-40%20(11).html. Confidence: 0.24906397
 success  on   best_buy-40%20(24).html. Confidence: 0.69247615
 success  on   best_buy-40%20(26).html. Confidence: 0.24906397
 success  on    best_buy-40%20(5).html. Confidence: 0.76132601
 success  on        ebay-1%20(15).html. Confidence: 0.99085999
 success  on        ebay-1%20(30).html. Confidence: 0.99085999
 success  on home_depot-40%20(17).html. Confidence: 0.24906397
 success  on home_depot-40%20(21).html. Confidence: 0.98728508
 success  on home_depot-40%20(25).html. Confidence: 0.82306099
 success  on home_depot-40%20(28).html. Confidence: 0.93691045
 success  on home_depot-40%20(32).html. Confidence: 0.93691045
 success  on home_depot-40%20(38).html. Confidence: 0.24906397
 success  on    walmart-40%20(20).html. Confidence: 0.98574233
 success  on    walmart-40%20(25).html. Confidence: 0.97179431
 success  on    walmart-40%20(38).html. Confidence: 0.98574233
 success  on    walmart-40%20(39).html. Confidence: 0.98574233

Price testing accuracy

(venv) bdanforth ~/Projects/price-tracker (fathom3) $ fathom-test src/extraction/fathom/vectors/vectors_test_price.json '{"coeffs": [["hasDollarSign", 1.0177843570709229], ["isAboveTheFoldPrice", -5.301823616027832], ["hasPriceInID", 5.333859443664551], ["hasPriceInParentID", -7.5635271072387695], ["hasPriceInClassName", 1.155443787574768], ["hasPriceInParentClassName", 3.0024354457855225], ["fontIsBig", 11.338400840759277], ["isNearImage", 0.7539440989494324], ["hasPriceishPattern", 5.222956657409668]], "bias": -7.09004545211792}'

Testing accuracy per tag:  0.99458    95% CI: (0.98847, 1.00000)  FP: 0.000  FN: 0.005
Testing accuracy per page: 0.95000    95% CI: (0.85448, 1.00000)

Testing per-page results:
 failure  on     amazon-40%20(22).html. Confidence: 0.01885081 Highest-scoring element was a wrong choice.
    First target at index 1: 0.01707094
 success  on     amazon-40%20(30).html. Confidence: 0.01656970 No target nodes. Assumed negative sample.
 success  on     amazon-40%20(37).html. Confidence: 0.01240653 No target nodes. Assumed negative sample.
 success  on      amazon-40%20(8).html. Confidence: 0.82405251
 success  on   best_buy-40%20(11).html. Confidence: no candidate nodes. Assumed negative sample.
 success  on   best_buy-40%20(24).html. Confidence: 0.88195562
 success  on   best_buy-40%20(26).html. Confidence: no candidate nodes. Assumed negative sample.
 success  on    best_buy-40%20(5).html. Confidence: 0.88763416
 success  on        ebay-1%20(15).html. Confidence: 0.15543531 No target nodes. Assumed negative sample.
 success  on        ebay-1%20(30).html. Confidence: 0.14137758
 success  on home_depot-40%20(17).html. Confidence: no candidate nodes. Assumed negative sample.
 success  on home_depot-40%20(21).html. Confidence: 0.99791569
 success  on home_depot-40%20(25).html. Confidence: 0.99840385
 success  on home_depot-40%20(28).html. Confidence: 0.99801970
 success  on home_depot-40%20(32).html. Confidence: 0.99818760
 success  on home_depot-40%20(38).html. Confidence: no candidate nodes. Assumed negative sample.
 success  on    walmart-40%20(20).html. Confidence: 0.82440346
 success  on    walmart-40%20(25).html. Confidence: 0.79404730
 success  on    walmart-40%20(38).html. Confidence: 0.81192786
 success  on    walmart-40%20(39).html. Confidence: 0.83670110

@erikrose
Copy link
Copy Markdown
Contributor

This is the accuracy for Fathom to correctly identify a "product" on a page (i.e. correctly identifying all three product features: image, title and price).

How did you measure overall "product" success? Did you do manual math to intersect the per-page successes of the 3 types?

This is not an apples to apples comparison from the past test accuracy measure, since that was looking at whether a "product" (being the sum of its "image", "price" and "title") was found on a page.

I suspect from your linked description that you manually kept track of successes on the hacked-up copy of Price Tracker to determine the past score. Is that true? (I want to make sure you understand the old FathomFox Trainer also tested one type at a time.)

@biancadanforth
Copy link
Copy Markdown
Collaborator Author

How did you measure overall "product" success? Did you do manual math to intersect the per-page successes of the 3 types?

Yes. That may be a nice enhancement for the fathom-web CLI to have an option to do this for you in applications where we need multiple features to be correct on a page to consider it a success.

I suspect from your linked description that you manually kept track of successes on the hacked-up copy of Price Tracker to determine the past score. Is that true? (I want to make sure you understand the old FathomFox Trainer also tested one type at a time.)

I did manually keep track of successes, but I'm not sure what you mean by "hacked up copy" -- the copy of Price Tracker I used was this PR.

I do understand that Fathom's current and previous testing method runs on a per-feature basis. Sorry, I misspoke a bit when I said:

... but it's still nowhere near the upper 80s and 90s percent accuracy seen in the training and validation runs, despite these samples coming from the exact same corpus as the training and validation samples.

What I meant to say was that, while "title" and "price" testing accuracy (per page) is within a few percentage points of their training and validation accuracy, "image" testing accuracy per page is a full 10% lower at 75%.

@erikrose
Copy link
Copy Markdown
Contributor

erikrose commented Jul 30, 2019

"Hacked up": I was referring to the temporary changes you made in #317 (comment). I also had in mind some more extensive changes you'd made, but that must have been in a similar ticket. So it's not that hacked-up after all. :-)

What I meant to say was that, while "title" and "price" testing accuracy (per page) is within a few percentage points of their training and validation accuracy, "image" testing accuracy per page is a full 10% lower at 75%.

Oh, good. Now we're getting into the realm of explicability. Could be legit unluckiness at this point. For a 10% change, we'd have only to do worse on 2 samples out of the 20 used. For login-forms, I used more like 60.

@biancadanforth biancadanforth merged commit 7c0aac6 into master Aug 21, 2019
@biancadanforth biancadanforth deleted the fathom3 branch August 21, 2019 17:49
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants