Follow @nxtchg

Author Topic: Time Sync  (Read 2175 times)

NxtChg

  • Overlord
  • *****
  • Posts: 1114
  • Respect: +61
    • View Profile
Time Sync
« on: December 27, 2015, 09:51:03 am »
0
BTW Happy Holidays Overlord!

Overlord is working. No holidays for overlord.
Tentacle Overlord, The Deranged Genius of The Abyss

NxtChg

  • Overlord
  • *****
  • Posts: 1114
  • Respect: +61
    • View Profile
Re: Time Sync
« Reply #1 on: December 27, 2015, 12:56:32 pm »
0
Overlord is trying to figure out why one of his servers suddenly stopped running all the tasks (which was nicely detected by that neat network monitoring tool that he wrote)...

Overlord already installed a debugger on the server and attached it to the process.

He now suspects that the mean QueryPerformanceCounter() suddenly jumped in value for no apparent reason...

--
That's my holidays :)


UPDATE:
  Looks like I was right. The difference between current QPC time and what's in memory is about 64 days. Argh! That's not very nice...
  Now I have to worry about ridiculous timings in every place I use it.

  By the way, x64dbg is a very good program, and they accept Bitcoin donations.
  If you are a programmer and will find it useful, don't forget to send them some love.
Tentacle Overlord, The Deranged Genius of The Abyss

NxtChg

  • Overlord
  • *****
  • Posts: 1114
  • Respect: +61
    • View Profile
Re: Time Sync
« Reply #2 on: December 27, 2015, 10:46:18 pm »
0


This is NOT how you time-sync your servers ;D

Well, actually, it's not that bad, you can see that yellow and green are compensating for the abuse of the blue, but why the hell the blue went bananas is a complete mystery...

At first I thought it's because of the remote backup, but it runs at a different time...

And QPC troubles were on the yellow server (you can see it froze at some point), so it's not that.

Need to investigate more...
Tentacle Overlord, The Deranged Genius of The Abyss

NxtChg

  • Overlord
  • *****
  • Posts: 1114
  • Respect: +61
    • View Profile
Re: Time Sync
« Reply #3 on: December 28, 2015, 09:53:23 am »
0
OK, part of the mystery is solved - this is how Windows Time actually works:

Quote
When the time service has determined which time sample is best, based on the above criteria,
it adjusts the local clock rate to allow it to converge toward the correct time. If the time
difference between the local clock and the selected accurate time sample
is too large to correct by adjusting the local clock rate, the time service sets the local clock
to the correct time.

What is not clear is why it seems to be adjusting the time in the opposite direction.

It moved the time backwards more than 2 minutes during one hour, before giving up and finally just setting it to the correct value. Need to add more debug logging and run the servers again for a day or two...

This is, by the way, a new and improved time-sync algorithm, completely rewritten. I solved the small drift problem and increased performance by about 3 times.

You can see that when nobody intervenes it holds the value dead still for hours, which means all nodes are synchronized within 1 second.
Tentacle Overlord, The Deranged Genius of The Abyss

NxtChg

  • Overlord
  • *****
  • Posts: 1114
  • Respect: +61
    • View Profile
Re: Time Sync
« Reply #4 on: December 30, 2015, 12:04:49 pm »
0
So far, so good. All 3 servers are still synchronized within 1 second after 48 hours.

I've made some changes to the algorithm, and seems like they help a bit.

Of course, the real picture will be clear once I download and process all the telemetry, but so far it looks better than the previous run.
Tentacle Overlord, The Deranged Genius of The Abyss

wizzardTim

  • Jr. Minion
  • **
  • Posts: 80
  • Respect: +5
    • View Profile
Re: Time Sync
« Reply #5 on: December 30, 2015, 10:01:36 pm »
0
This is, by the way, a new and improved time-sync algorithm, completely rewritten. I solved the small drift problem and increased performance by about 3 times.

Good news ! This is an important achievement

NxtChg

  • Overlord
  • *****
  • Posts: 1114
  • Respect: +61
    • View Profile
Re: Time Sync
« Reply #6 on: December 30, 2015, 10:10:50 pm »
0
Good news ! This is an important achievement

Yes, time sync becomes very important when your goal is fast confirmation time.

We will see how it goes soon, maybe tomorrow, when I process the telemetry for the last 3 days...

It would also be interesting to turn off Windows Time and see how the algorithm would behave without interventions.

I looked in the log briefly and immediately noticed WM_TIMECHANGE messages, which I now track to detect time adjustments.
Tentacle Overlord, The Deranged Genius of The Abyss

NxtChg

  • Overlord
  • *****
  • Posts: 1114
  • Respect: +61
    • View Profile
Re: Time Sync
« Reply #7 on: January 01, 2016, 06:00:18 pm »
0
Well, the time roller coaster continues:



One thing is clear - VPS sucks. The system time and QPC on it immediately start to diverge in giant steps, compared to dedicated servers. That's probably the reason why Windows Time needs to constantly adjust it.

So when we will deploy the real network we should strive to avoid VPS.

This means we will probably still need some monitoring group or two, which will estimate node quality and either provide independent rating or reward the best N nodes.

This single VPS forces the other two nodes to constantly compensate for its antics. With only 3 servers, this leads to constant time drift; it should be more smooth with more servers.

But the time between all servers stays within 1.2 sec at all times!


(time diff is in centiseconds)

So, I think, the time sync can be considered finished for now.
Tentacle Overlord, The Deranged Genius of The Abyss

NxtChg

  • Overlord
  • *****
  • Posts: 1114
  • Respect: +61
    • View Profile
Attention to Details
« Reply #8 on: April 26, 2017, 11:36:21 am »
+1
I use simple smoothing function in Simcoin's time sync to adjust node's time difference:

Code: [Select]
time_diff = (time_diff * 7 + new_diff) / 8;
It provides nice integration, which smooths any sudden changes.

It also has a dead zone if you calculate it using integers:



This should help with filtering small random variations.

A faster equivalent of the above code is:

Code: [Select]
time_diff = ((time_diff << 3) - time_diff + new_diff) >> 3;
To my surprise, when I switched to it the standard deviation of time between nodes dropped from about 10 ms to less than 1! An order of magnitude improvement!

What the...? How is this possible?!

Let's investigate.

====

The first thing I noticed is that the dead zone is not symmetrical:



Oops. But since both formulas give exactly the same results for positive numbers, this can't be it.

Let's look at negative numbers:



Aha. These formulas stop being completely equivalent for negative numbers:

Code: [Select]
(-1000 * 7 - 999) /  8 =  -999
(-1000 * 7 - 999) >> 3 = -1000

But the difference is 1, how can this create such a dramatic effect? Let's dig deeper.

Plotting both functions side by side shows that when the value drops, the original function (red) has a dead zone, but the new one (yellow) does not:



What if the value grows? Then the picture flips:



Now the original doesn't have the dead zone and the new one does. So switching to the new function flips the asymmetric dead zone for negative numbers. So what?

Turns out this aligns dead zones for positive and negative numbers. This means that nodes above network time will not move in one direction on slight changes, and the nodes below network time will not move in the same direction! This creates a resistance wall, to which the value "sticks", greatly reducing variation.

Indeed, adding a simple explicit dead zone:

Code: [Select]
if(abs(time_diff - new_diff) < 8) return;
Produces the same result.

A simple tweak to make the dead zone symmetrical improves things further:

Code: [Select]
time_diff = (time_diff * 7 + new_diff + 4) >> 3;


Note, that this tweak doesn't work in case of integer division! For that you also need to clear the lowest 3 bits before dividing:

Code: [Select]
time_diff = (time_diff * 7 + new_diff + 4) & ~7) / 8;
Then both will be identical again.

====

The time sync algorithm is full of such surprises, where tiny, seemingly insignificant details have a big impact.
Tentacle Overlord, The Deranged Genius of The Abyss

NxtChg

  • Overlord
  • *****
  • Posts: 1114
  • Respect: +61
    • View Profile
Re: Attention to Details
« Reply #9 on: April 26, 2017, 07:30:41 pm »
0
Jesus, 27 pages to figure out how to correctly compute (a + b) / 2...

https://hal.archives-ouvertes.fr/file/index/docid/576641/filename/computing-midpoint.pdf
Tentacle Overlord, The Deranged Genius of The Abyss

NxtChg

  • Overlord
  • *****
  • Posts: 1114
  • Respect: +61
    • View Profile
Re: Time Sync
« Reply #10 on: April 29, 2017, 04:03:04 pm »
+1


Added clock drift + Windows Time correction into time sync simulator and did some tests.

One of the things it revealed is that using median is a poor choice. It should provide robust result, and most of the time it does, but it has a nasty property: if the central node drifts a little, but not enough to change its position in the median buffer, it can drag the whole network with it!

Starting to think that using a single time value won't cut it due to conflicting requirements: for convergence you need to arrive at a single value, as close between nodes as possible, but to anchor it to real time and prevent drifting you need to spread the votes apart so they pull in different directions.

Will probably have to split the algorithm into two parts: one for convergence and one for anchoring. This will require sending some additional data with each time vote. Will also probably replace median with interquartile mean or something similar...

The reason I spend so much time on time sync is because it's one of the two absolutely crucial parts of the system. The other is tx consensus.

Having <1 sec confirmations is impossible without tight time synchronization.
Tentacle Overlord, The Deranged Genius of The Abyss

NxtChg

  • Overlord
  • *****
  • Posts: 1114
  • Respect: +61
    • View Profile
The Mean Median
« Reply #11 on: April 30, 2017, 08:57:27 am »
+2
Median is not a very good choice for time sync because when the values converge closely, small random variations make it less stable.

The most dramatic case is when the central node drifts a little, but not enough to change its position in the median array: it then can drag the whole network with it since everybody uses its value as the new network time.

What we want is something like a mean; something that integrates over a large number of votes. But the mean has its own problem: outliers (either honest or malicious) can skew the network time and/or make it fluctuate significantly.

It would be nice to have something that behaves like a median when the values are far apart, but like a mean when they converge close together.

Enter the "Mean Median" :) I haven't read about it before, so I will assume I just invented it :)



Here's how it should be calculated:

1. Find the median of the data set.
2. Cap all the values at some distance from it (in the example above it's +-200).
3. Calculate the mean.

You can immediately see that when all the values are outside of the core range it behaves exactly like the median.
But once more and more values start to converge and fall within the core range it starts behaving like the mean.

The core range should be big enough to cover any random variations and measurement errors, but no bigger.

There is a similar technique called "winsorizing". The problem is that it uses arbitrary criteria (like Nth percentile) to cap the values.

I believe that for our purposes the mean median will be a lot better.

Now just need to test it.
Tentacle Overlord, The Deranged Genius of The Abyss

NxtChg

  • Overlord
  • *****
  • Posts: 1114
  • Respect: +61
    • View Profile
Re: Time Sync
« Reply #12 on: May 01, 2017, 05:06:54 pm »
0
Calculating median without sorting the array:

https://simtalk.org:444/index.php?topic=15.msg1457#msg1457
Tentacle Overlord, The Deranged Genius of The Abyss

NxtChg

  • Overlord
  • *****
  • Posts: 1114
  • Respect: +61
    • View Profile
Time Jumps
« Reply #13 on: May 07, 2017, 07:08:53 am »
+1
When the user or Windows Time adjust node's clock the time can jump suddenly and mess up our tight synchronization.

That's why we need to correctly handle time jumps.

On Windows there is the WM_TIMECHANGE message precisely for that, but we can't use it because it would require the core DLL to create a hidden window just to listen to Windows events and I don't think that's something a library should do. It's also less portable.

So we will do it manually: we will measure the time since the last call to our main time function and compare it to the time returned.

Long story short, here's the code:

Code: [Select]
class TimeJump
{
 public: msecs prev_ts; int64 prev_time;

    TimeJump(){ prev_time = 0; }

    int64 detect(int64 t)
    {
        int64 jump = 0; msecs ts = now;

        if(prev_time)
        {
            const int64 dt  = t  - prev_time;
            const float dts = ts - prev_ts;

            if(dt < -100 || (dt > 100 && dt > dts * 2)){ jump = dt - dts; }
        }

        prev_time = t; prev_ts = ts;
   
        return jump;
    }
};

If the time returned is more than 100 ms in the past or grew by more than 100 ms AND this growth is twice as big as what our time measurement returned, we consider it to be a time jump and calculate the correction value.

Once time jump was detected we need to adjust our time_diff variable (the difference between local time and network time), as well as all the time votes we received  because we store them relative to our local time.

In simulation it works perfectly. The rest of the program is completely oblivious that something has changed and the time values remain consistent.
Tentacle Overlord, The Deranged Genius of The Abyss