Posted on May 15, 2022, 8:56 a.m.
6 min read

So I want this to be a bit of a fast one. You could easily cover entire essays on this but I would not like to go into that level of detail.

Let's have a quick look at testing methodology. If you've requested a team build example from me at one point you might have noticed that I've chosen some songs, and I've always wondered if those songs are the best songs to use as examples for teambuilding, so I decided to do a quick study on this. Hopefully this should also answer some questions and shine some light on the general teambuilding process when it comes to validating team power.

Why timer neutrality is important

Let's have a look at the following simulation result. What kind of opinion does this result create?

It looks like the first team is significantly weaker than the second one isn't it? The second team looks about 8% stronger. Well, how about we have a look at a different song?

Wait now the first team is better? The first team is 3.7% better in this example. What is going on here?

This is an example of the issues when testing teams on songs that are not very timer neutral. In the example above, Valkyria is one of the most 11s biased Lv30 alltypes in the entire game. The first team is 7s, which has an estimated coverage of only 61.63% on that song, while the second team is 11s, which has an estimated coverage of 69.13% which is a significant difference. Interestingly, the difference in coverage is almost exactly 8%, the difference we saw in the computed results.

The next example was computed with Hifi days, which is a 7s favored song. 7s coverage is about 65.85% while 11s coverage is at 62.12% which is about a 3.7% difference. This is why timer neutrality is important when testing teams, because it makes it hard to compare team vs team if each team has a different timer and each song has a different timer bias. This adds more confounding factors that makes it hard for you to make an objective judgement of which team is actually better.

What Songs to Use

Let's begin by having a look at the two teams again. One of the most popular songs of all time to test on is Samakani M+. That song is a very well known timer neutral Lv30 MASTER+ and you might have seen me used it a lot. In fact, here's the two example teams again in Samakani M+. Notice how small the difference is now.

Let's think about the most popular High timers. The vast majority of cards in this game are 7s, 9s and 11s. There are also less common high timers like 13s, and there's the rare and insanely good 4s. Because we only really need ONE chart to test I'll just list the top few MASTER+ songs including their bias%, which is the maximum coverage difference between one timer to another.

All Type WITHOUT 13s

SARABA, Itoshiki Kanashimi tachi yo - 0.7927% Bias
New Bright Stars (LEGACY) - 0.8371% Bias
Life is HaRMONY - 1.2953% Bias
GOIN' - 1.3670% Bias
comic cosmic - 1.7447% Bias

All Type WITH 13s

Inshou - 2.3817% Bias
Kokoro Moyou - 2.3938% Bias
Samakani - 2.5619% Bias
∀NSWER - 2.6648% Bias
Hungry Bambi - 2.7953% Bias

An interesting to note fact is that all Alltype songs generally rank poorly in terms of neutrality. Samakani does not even make it to the top 20 most neutral songs when comparing all timers. 13s also dramatically increases the amount of maximum bias compared to without. In general for the top of the meta, 13s is not a relevant timer in alltype songs, but when considering any and all teams possible it would be preferred to consider 13s as well.

Cute WITHOUT 13s

Onedari Shall We - 0.6843% Bias
shabon song - 0.7061% Bias
Hanikami days - 0.8634% Bias
Priceless Donatcyu - 1.3541% Bias
Himitsu no Toware - 1.4213% Bias

Cute WITH 13s

Onedari Shall We - 1.5552% Bias
Hanikami days - 1.6476% Bias
S(mile)ING! - 1.7276% Bias
Priceless Donatcyu - 1.8372% Bias
Meruhen Debut! - 1.9578% Bias

Cute seems to have a consistent winner, where we have onedari as first place in both categories.

Cool WITHOUT 13s

Saite Jewel - 0.2379% Bias (Best song in the game)
Mikansei no Rekishi - 0.4666% Bias
Last Kiss - 0.4768% Bias
Mori no kuni kara - 1.4687% Bias
One Life - 1.8727% Bias

Cool WITH 13s

Saite Jewel - 0.5142% Bias (Best song in the game)
Last Kiss - 1.3451% Bias
Dangan Survivor - 2.2968% Bias
NATSU KOI - 2.4731% Bias
Mikansei no Rekishi - 2.6036% Bias

Saite Jewel is actually the best chart ever made. It's the most neutral chart in the game and it's also the only chart in the game to have a bias of under 1% when including 13s timers!

Passion WITHOUT 13s

O-Ku-Ri-Mo-No Sunday! - 0.7163% Bias
EVIL LIVE - 0.7638% Bias
Twinkle Tail - 1.0023% Bias
Miracle Telepathy - 1.3134% Bias
THE VILLAIN'S NIGHT -1.3600% Bias

Passion WITH 13s

Twinkle Tail - 1.0023% Bias
Honoo no Hana - 1.7376% Bias
THE VILLAIN'S NIGHT - 1.7622% Bias
Watashi Iro Gift - 2.1066% Bias
Osanpo Camera - 2.3134% Bias

And finally for passion it seems like there's a lot of good lv30 choices here.

The conclusion

So what songs am I going to use in the future for testing? Judging by the data on hand I think I will stop using Evil Live M+ (2.8046% bias to 13s), Fascinate M+ (5.1532% bias to 13s) and Babel M+ (5.4633% bias to 13s) in favor of Twinkle Tail M+, Saite Jewel M+ and Onedari Shall We M+ which will provide greater accuracy when comparing between teams. All in the name of science to improve testing methodology and provide better analysis across the board.

PS: These values were computed with the settings set on (]

Bonus: What makes a good test chart?

Is there a reason only MASTER+ was selected in this case? I think that going the MASTER+ route includes songs that have generally higher note count, which gives most timers (not just the 4 stated earlier) a better shot at being fair. MASTER+ is also the only difficulty that carries the Slide rhythm icon. Outside of MASTER+ there are actually several songs better than saite jewel MASTER+ in terms of raw bias values.

Saite Jewel MASTER - 0.4695% Bias
Hacking to the Gate DEBUT - 0.4615% Bias
Palette PRO - 0.3058% Bias
Honoo no Hana PRO - 0.3033% Bias

Also the note density matters for some cases, for example skills with healing effect benefit from high note count and also affect the viability of skills like life sparkle. I think that while it is possible to peg the note count at the average of all songs and look for a fair chart around that note count, but I don't think it gives you an idea about the full potential of the team that is being built.

This is also why I have a preference of choosing songs that are just higher in difficulty level if possible. Higher difficulty means higher multiplier which makes differences in composition much easier to see.

What else is there to consider? The note type distribution is one of them. Too much of any particular note type will make the score favor specific act cards. This is a difficult confounder to control as you don't really have a specific way of making sure the chart you're selecting is close to the average % slide/flick/whatever for the game. We simply have to accept that this is a problem that can exists, and prefer to choose non-act cards in general, while choosing act cards only when targeting specific charts.

Return to all Posts

Testing Methodology - Timer Neutrality 101

Why timer neutrality is important

What Songs to Use

The conclusion

Bonus: What makes a good test chart?