Skip to content

Refactor TrackerEvents for event payload creation (close #291)#293

Merged
Miranda Wilson (mscwilson) merged 7 commits intorelease/0.12.0from
issue/291-refactor_trackerevent
Jan 18, 2022
Merged

Refactor TrackerEvents for event payload creation (close #291)#293
Miranda Wilson (mscwilson) merged 7 commits intorelease/0.12.0from
issue/291-refactor_trackerevent

Conversation

@mscwilson
Copy link
Copy Markdown
Contributor

For issue #291, as part of a larger refactoring of Tracker and Emitter to allow event sending retry and batching by byte size.

Tracker.track(event) now directly creates a TrackerPayload, which is passed to the Emitter for storage and sending. The Tracker event processing is currently within the main thread.

TrackerEvent has been removed. Event sending callbacks have been partially removed - the functionality is gone but the builder options remain for now.

@snowplowcla Snowplow CLA bot (snowplowcla) added the cla:no [Auto generated] Snowplow Contributor License Agreement has not been signed. label Jan 17, 2022
@mscwilson Miranda Wilson (mscwilson) removed the cla:no [Auto generated] Snowplow Contributor License Agreement has not been signed. label Jan 17, 2022
Copy link
Copy Markdown
Collaborator

@AlexBenny AlexBenny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! 👍
I think it's all good, I just left some minor comments!

About the SimpleEmitter: we need to check if the behaviour is exactly the one we can achieve with BatchEmitter with bufferOption=1. In that case we can remove it (maybe in a different GH issue).

About the callbacks: we can remove them in the builder too. I'm sure someone use them but we can't provide any particular useful info except the number of events sent and failed, in case. It's something I wouldn't add now btw.

About the event-processing thread: Now, the Emitter has a pool of thread to send the events. Then, we will have the Emitter with a single thread waiting the response of the last request. So, in theory, the Emitter should have a single running thread polling on EventStore and using the back-off-retry feature, meanwhile the Tracker should have the thread-pool and a new thread would be launched each time an event is tracked.

Comment thread src/main/java/com/snowplowanalytics/snowplow/tracker/Tracker.java Outdated
Comment on lines +268 to +273
/**
* Builds the final event context.
*
* @param entities the base event context
* @return the final event context json with many entities inside
*/
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opinion sharing:
I don't see much value in overzealous javadoc on private methods (I know you just copied/pasted this code).
I feel that in private methods we can remove these when the method signature is already self-explanatory, or just leaving a simple comment like "Builds the final event context." without the other rows.
Not something to change, btw.

Comment thread src/main/java/com/snowplowanalytics/snowplow/tracker/emitter/BatchEmitter.java Outdated
@mscwilson
Copy link
Copy Markdown
Contributor Author

The main difference between SimpleEmitter and BatchEmitter is GET vs POST. So it's an opportunity to be opinionated :)

@mscwilson
Copy link
Copy Markdown
Contributor Author

I've copied and pasted the threadpool (and threadfactory) from Emitter into Tracker. I left the Emitter threads as is for now.

@mscwilson Miranda Wilson (mscwilson) marked this pull request as ready for review January 17, 2022 18:38
.build());

// Then
Thread.sleep(500);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised we have to slow the tests down this much, half a second is quite the delay, especially when we're dealing with a mock emitter. This makes me think of the F in FIRST.

As an aside, this also makes me curious about the performance of this change, does is have similar perfromance properties to previous releases? A new threading paradigm could have some unexpected consequences that we'd be wise to test more deeply with some performance testing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't try shorter waits - 500ms was already being used for the Emitter tests. I'll see if it can be improved.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried shorter wait times, but they all failed tests occasionally, even 450ms.

Copy link
Copy Markdown
Contributor

@paulboocock Paul Boocock (paulboocock) Jan 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this suggest that even in a "real" environment, the difference between created timestamp and sent timestamp is at least 500ms? That seems reasonably high to me. Might be out of scope here but worth investigating the cause if true.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we got any "real" Java tracker data lying around anywhere which we could compare timestamps?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is out of scope here yeah. We're planning to finish this PR soon without looking into performance. The simple-console is broken currently, so I'll fix that next and then extend it for performance testing.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the test is a good parameter to compare the performance. The Thread.sleep is mostly needed to allow the Thread to start and do the job before the test is finished and it mostly depends by the machine where they are running on. I think, if we have concerns on this we should compare the two implementations in a real test with simple-console as suggested. However, we need to decide what we want to test. I guess we want to test how fast the track method can return back. You mentioned the difference between created_tstamp and sent_tstamp but even if the gap is longer the derived_tstamp would mitigate the discrepancy. We base our data modelling on the derived_tstamp, so even a longer gap between the two shouldn't be disruptive for the customer. IMO if we are really concerned about the performance of the track method we could add a buffer in the input consumed by the tracker thread-pool which would make the performance exactly the same as before.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Some areas like SimpleEmitter look much cleaner, removing those callbacks makes things a lot nicer.

Copy link
Copy Markdown
Collaborator

@AlexBenny AlexBenny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM

I think we could follow this plan:

  • proceed merging this PR,
  • fix the simple-console adding a building step in the CI workflow
  • test performance between this and previous version
  • (optional) improve performance if needed
  • add back-off-retry

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants