Conversation
Training with Fathom 3.0 requires some changes to the structure of the coefficients and ruleset used and has a new vectorize step. Fathom changes include: * Fathom now handles weighting the rules by their coefficients. * Rule weighting is no longer exponential. * Each rule should return a value between 0 and 1, inclusive * Coefficients should be passed into Fathom as a [rule_name, coefficient] tuple * Fathom's 'rule' function now takes a second argument, an object literal with a single key, 'name'. Its value is the name of the rule. This string literal value must match the rule_name passed into Fathom as part of the tuple mentioned above. * [Fathom training](http://mozilla.github.io/fathom/training.html?highlight=vectorizer#running-the-trainer) now includes a vectorize step using the Vectorizer in FathomFox * The Vectorizer generates a 'vectors.json' file for training and validation for each feature (for Price Tracker which has three features (the product image, title and price), this would be 6 new files). * The main purpose of each 'vectors.json' file is to provide a 'feature' vector for each candidate element for a given feature. This feature vector has a floating point value for each rule. The ith value corresponds to the raw score for that element for the ith rule passed into Fathom for that feature's list of [rule_name, coefficient] tuples.
Sometimes a feature vector can output a 'null' value. This will [throw an error](mozilla/fathom-fox#35) during training with the new 'fathom-train' CLI. Possible causes include: * A name mismatch between the 'name' value passed into a rule function (its second argument) and the name of the rule in the list of [ruleName, coefficient] tuples referenced in the ruleset object. * A score callback might be failing to return a number. * A corner case of a DOM or CSSOM specification used by a score callback; e.g. innerText could return null instead of the empty string in Firefox. In this case, a score callback was failing to return a number if the width or height of the element passed into 'aspectRatio' was 0.
Previous vectors were based on an incomplete corpus of tagged product pages. These vectors are based on the complete set of Amazon, Ebay, Best Buy, Walmart and Home Depot samples tagged in the [Fathom Commerce Samples](https://drive.google.com/drive/folders/1YKfDHx2niy9nCrdKCSDt7lcU9uWbHzon) folder. These samples were divided into 3 buckets using ['fathom-pick'](https://github.com/mozilla/fathom/blob/master/cli/fathom_web/pick.py): training, validation and test set, which moves samples at random. The split percentage for the complete corpus into these buckets was 80/10/10.
Trained 'image' using the ['fathom-train'](https://github.com/mozilla/fathom/blob/master/cli/fathom_web/train.py) CLI and image training and validation vectors from FathomFox's Vectorizer. Copied the resulting coefficients and bias into trainees.js and its imports.
As mentioned in [this issue](mozilla/fathom-fox#35), it's possible that a feature vector will contain a 'null' value for one or more rules, which will cause 'fathom-train' to throw an error. In this case, the reason for the 'null' value was a name mismatch. I opted to change the method name as its naming convention did not match the other rules.
After fixing the issue with a 'null' feature vector value in the price vectors, the vectors were re-generated using FathomFox's Vectorizer. Now there are no longer any 'null' feature vector values and training can proceed.
In Price Tracker, the product 'title' and 'price' features are dependent upon the 'image' feature results (e.g. there is a rule for the 'price' feature called 'isNearImage' which scores a candidate 'price' element based on its proximity to the most likely 'image' element). As a result, the final weights and bias from training the 'image' feature need to be taken into account before vectorizing the 'title' and 'price' features. This commit updates the vectors for 'title' and 'price' to take this into account.
…rice' With updated vectors for 'title' and 'price' features based on the optimized coefficients and bias for 'image', the 'title' and 'price' features were trained using 'fathom-train' and their coefficients and biases updated.
Also removes the now unused getCoeffsInOrder function and updates some training-related comments
TL;DR:While Price Tracker's accuracy in this PR on the 30-page test set is low at 40%, it beats out the accuracy we get on this test set in the current Why is the test accuracy on Test Set Results -- This PR -- Fathom accuracy: 12/30 (40%)
How accuracy was tested The approach outlined below was used since In order to assess the accuracy of Price Tracker in this PR versus
diff --git a/web-ext-config.js b/web-ext-config.js
index 9791f8c..8097c55 100644
--- a/web-ext-config.js
+++ b/web-ext-config.js
@@ -1,6 +1,7 @@
module.exports = {
run: {
pref: [
+ 'extensions.shopping-testpilot@mozilla.org.extractionAllowlist=*',
'extensions.shopping-testpilot@mozilla.org.priceCheckInterval=30000',
'extensions.shopping-testpilot@mozilla.org.priceCheckTimeoutInterval=30000',
'extensions.shopping-testpilot@mozilla.org.iframeTimeout=10000', |
erikrose
left a comment
There was a problem hiding this comment.
That was a nice, fast port! If we make those 2-3 little tweaks, I think we're good to merge. That'll make this a decent real-world example to show people, aside from the corpus being unusually un-diverse.
I am unsettled that the testing accuracy is so far off from the training/validation (and bad). It makes me suspect either the sets are not representative of each other or we've missed a significant bug. However, the goal of this work was to get everybody spun up on Fathom 3, and that's been accomplished. We can come back and chase that mystery if we decide to get serious about this product again. What's more, I don't think doing a new release of Price Tracker is going to make any user's experience worse—right, Bianca?
Again, good job on the quick spin-up!
| */ | ||
| weightedIncludes(haystack, needle, coeff) { | ||
| return (this.caselessIncludes(haystack, needle) ? ONEISH : ZEROISH) ** coeff; | ||
| weightedIncludes(haystack, needle) { |
There was a problem hiding this comment.
Probably shouldn't be called "weighted" anymore
| /** Scores fnode with a '$' in its innerText */ | ||
| hasDollarSign(fnode) { | ||
| return (fnode.element.innerText.includes('$') ? ONEISH : ZEROISH) ** this.hasDollarSignCoeff; | ||
| return (fnode.element.innerText.includes('$') ? ONEISH : ZEROISH); |
There was a problem hiding this comment.
We should get rid of ONEISH and ZEROISH. They don't make anything better anymore, and they make make things ever so slightly worse (probably just slightly slower to converge).
| * Using coefficients passed into the constructor method, returns a weighted | ||
| * ruleset used to score elements in an HTML document. | ||
| * | ||
| * @param {Array[]} An array of [string, number] tuples where the first element |
There was a problem hiding this comment.
Is it legal to leave out the param name? In any case, we should either document both params or neither.
|
TL;DR: Using the new Longer version
This is not an apples to apples comparison from the past test accuracy measure, since that was looking at whether a "product" (being the sum of its "image", "price" and "title") was found on a page. This is much better than the 40% from my previous approach for testing accuracy (granted, that was on a different test set of 40 pages), but it's still nowhere near the upper 80s and 90s percent accuracy seen in the training and validation runs, despite these samples coming from the exact same corpus as the training and validation samples. Image testing accuracy (venv) bdanforth ~/Projects/price-tracker (fathom3) $ fathom-test src/extraction/fathom/vectors/vectors_test_image.json '{"coeffs": [["isAboveTheFoldImage", 16.586257934570312], ["isBig", 21.33774757385254], ["hasSquareAspectRatio", 1.1075100898742676], ["hasBackgroundInID", -10.287492752075195]], "bias": -10.176139831542969}'
Testing accuracy per tag: 0.99000 95% CI: (0.98412, 0.99588) FP: 0.003 FN: 0.007
Testing accuracy per page: 0.75000 95% CI: (0.56022, 0.93978)
Testing per-page results:
success on amazon-40%20(22).html. Confidence: 0.45426598
success on amazon-40%20(30).html. Confidence: 0.79204732
failure on amazon-40%20(37).html. Confidence: 0.71261239 Highest-scoring element was a wrong choice.
First target at index 7: 0.18024161
success on amazon-40%20(8).html. Confidence: 0.45773035
failure on best_buy-40%20(11).html. Confidence: 0.06214701 Highest-scoring element was a wrong choice.
First target at index 2: 0.03152284
success on best_buy-40%20(24).html. Confidence: 0.31307796
failure on best_buy-40%20(26).html. Confidence: 0.06214701 Highest-scoring element was a wrong choice.
First target at index 2: 0.05037770
success on best_buy-40%20(5).html. Confidence: 0.10311700
success on ebay-1%20(15).html. Confidence: 0.20423122
success on ebay-1%20(30).html. Confidence: 0.81114835
failure on home_depot-40%20(17).html. Confidence: 0.74345410 Highest-scoring element was a wrong choice.
First target at index 1: 0.73110878
success on home_depot-40%20(21).html. Confidence: 0.73110878
success on home_depot-40%20(25).html. Confidence: 0.97135264
success on home_depot-40%20(28).html. Confidence: 0.73110878
success on home_depot-40%20(32).html. Confidence: 0.73110878
failure on home_depot-40%20(38).html. Confidence: 0.74345410 Highest-scoring element was a wrong choice.
First target at index 1: 0.73110878
success on walmart-40%20(20).html. Confidence: 0.60515302
success on walmart-40%20(25).html. Confidence: 0.60668099
success on walmart-40%20(38).html. Confidence: 0.58588487
success on walmart-40%20(39).html. Confidence: 0.53866470Title testing accuracy (venv) bdanforth ~/Projects/price-tracker (fathom3) $ fathom-test src/extraction/fathom/vectors/vectors_test_title.json '{"coeffs": [["isNearImageTopOrBottom", 7.078092575073242]], "bias": -1.6698582172393799}'
Testing accuracy per tag: 0.83871 95% CI: (0.70923, 0.96818) FP: 0.000 FN: 0.161
Testing accuracy per page: 1.00000 95% CI: (1.00000, 1.00000)
Testing per-page results:
success on amazon-40%20(22).html. Confidence: 0.99099052
success on amazon-40%20(30).html. Confidence: 0.64029801
success on amazon-40%20(37).html. Confidence: 0.24906397
success on amazon-40%20(8).html. Confidence: 0.97334236
success on best_buy-40%20(11).html. Confidence: 0.24906397
success on best_buy-40%20(24).html. Confidence: 0.69247615
success on best_buy-40%20(26).html. Confidence: 0.24906397
success on best_buy-40%20(5).html. Confidence: 0.76132601
success on ebay-1%20(15).html. Confidence: 0.99085999
success on ebay-1%20(30).html. Confidence: 0.99085999
success on home_depot-40%20(17).html. Confidence: 0.24906397
success on home_depot-40%20(21).html. Confidence: 0.98728508
success on home_depot-40%20(25).html. Confidence: 0.82306099
success on home_depot-40%20(28).html. Confidence: 0.93691045
success on home_depot-40%20(32).html. Confidence: 0.93691045
success on home_depot-40%20(38).html. Confidence: 0.24906397
success on walmart-40%20(20).html. Confidence: 0.98574233
success on walmart-40%20(25).html. Confidence: 0.97179431
success on walmart-40%20(38).html. Confidence: 0.98574233
success on walmart-40%20(39).html. Confidence: 0.98574233Price testing accuracy (venv) bdanforth ~/Projects/price-tracker (fathom3) $ fathom-test src/extraction/fathom/vectors/vectors_test_price.json '{"coeffs": [["hasDollarSign", 1.0177843570709229], ["isAboveTheFoldPrice", -5.301823616027832], ["hasPriceInID", 5.333859443664551], ["hasPriceInParentID", -7.5635271072387695], ["hasPriceInClassName", 1.155443787574768], ["hasPriceInParentClassName", 3.0024354457855225], ["fontIsBig", 11.338400840759277], ["isNearImage", 0.7539440989494324], ["hasPriceishPattern", 5.222956657409668]], "bias": -7.09004545211792}'
Testing accuracy per tag: 0.99458 95% CI: (0.98847, 1.00000) FP: 0.000 FN: 0.005
Testing accuracy per page: 0.95000 95% CI: (0.85448, 1.00000)
Testing per-page results:
failure on amazon-40%20(22).html. Confidence: 0.01885081 Highest-scoring element was a wrong choice.
First target at index 1: 0.01707094
success on amazon-40%20(30).html. Confidence: 0.01656970 No target nodes. Assumed negative sample.
success on amazon-40%20(37).html. Confidence: 0.01240653 No target nodes. Assumed negative sample.
success on amazon-40%20(8).html. Confidence: 0.82405251
success on best_buy-40%20(11).html. Confidence: no candidate nodes. Assumed negative sample.
success on best_buy-40%20(24).html. Confidence: 0.88195562
success on best_buy-40%20(26).html. Confidence: no candidate nodes. Assumed negative sample.
success on best_buy-40%20(5).html. Confidence: 0.88763416
success on ebay-1%20(15).html. Confidence: 0.15543531 No target nodes. Assumed negative sample.
success on ebay-1%20(30).html. Confidence: 0.14137758
success on home_depot-40%20(17).html. Confidence: no candidate nodes. Assumed negative sample.
success on home_depot-40%20(21).html. Confidence: 0.99791569
success on home_depot-40%20(25).html. Confidence: 0.99840385
success on home_depot-40%20(28).html. Confidence: 0.99801970
success on home_depot-40%20(32).html. Confidence: 0.99818760
success on home_depot-40%20(38).html. Confidence: no candidate nodes. Assumed negative sample.
success on walmart-40%20(20).html. Confidence: 0.82440346
success on walmart-40%20(25).html. Confidence: 0.79404730
success on walmart-40%20(38).html. Confidence: 0.81192786
success on walmart-40%20(39).html. Confidence: 0.83670110 |
How did you measure overall "product" success? Did you do manual math to intersect the per-page successes of the 3 types?
I suspect from your linked description that you manually kept track of successes on the hacked-up copy of Price Tracker to determine the past score. Is that true? (I want to make sure you understand the old FathomFox Trainer also tested one type at a time.) |
Yes. That may be a nice enhancement for the
I did manually keep track of successes, but I'm not sure what you mean by "hacked up copy" -- the copy of Price Tracker I used was this PR. I do understand that Fathom's current and previous testing method runs on a per-feature basis. Sorry, I misspoke a bit when I said:
What I meant to say was that, while "title" and "price" testing accuracy (per page) is within a few percentage points of their training and validation accuracy, "image" testing accuracy per page is a full 10% lower at 75%. |
|
"Hacked up": I was referring to the temporary changes you made in #317 (comment). I also had in mind some more extensive changes you'd made, but that must have been in a similar ticket. So it's not that hacked-up after all. :-)
Oh, good. Now we're getting into the realm of explicability. Could be legit unluckiness at this point. For a 10% change, we'd have only to do worse on 2 samples out of the 20 used. For login-forms, I used more like 60. |
Training numbers are:
image
title
price