Hypothesis testing such as Anderson-Darling or Shapiro-Wilk's test check normality of a distribution. However, if the sample size is very large, the test is extremely "accurate" but practically useless because the confidence interval is too small. They will always reject the null, even if the distribution is reasonably normal enough. How should I test normality when sample size is very large, other than visualizing histograms? The motivation is that I want to automate checking normality of large data set in a software platform, where everything needs to be automated, not manually visualized and inspected by humans. One thing that came across me is that instead of using Shapiro-Wilk test, I calculate kurtosis and skewness of the distribution, and if they are $\pm 1.0$ , I can assume that my large dataset is "reasonably" normally distributed. Is my approach correct, or is there any other alternatives?
$\begingroup$ Don't know about just looking at skewness and kurtosis. Seems the main issue is to be realistic about what closeness to normality is needed for the application at hand. $\endgroup$
Commented Jun 24, 2019 at 6:41$\begingroup$ What's your motivation to check normality in the first place? Why is normality important in your application? $\endgroup$
Commented Jun 24, 2019 at 7:43$\begingroup$ Shapiro, not Saphiro. Hypothesis tests of assumptions answer the wrong question (e.g. see here); when it comes to assumptions of tests, I suggest avoiding them at any sample size. $\endgroup$
Commented Jun 24, 2019 at 8:44$\begingroup$ 1. Please fix the spelling in your question as pointed out earlier. 2. Some of the answers at that other post go rather further. Indeed, I'd say that large sample sizes just make the uselessness obvious, but more generally it's not only not useful, it's actually often counterproductive (often leading you into doing exactly the wrong thing and at the same time screwing up the properties of your significance levels and p-values). $\endgroup$
Commented Jun 25, 2019 at 1:35$\begingroup$ 3. I challenge the premise of the question -- given the problems with choosing analysis on the basis of what you find in the data, automated checking doesn't strike me as being as useful as building something that's more robust to violations of anything you can't reasonably assume. $\endgroup$
Commented Jun 25, 2019 at 1:39Continuation from comment: If you are using simulated normal data from R, then you can be quite confident that what purport to be normal samples really are. So there shouldn't be 'quirks' for the Shapio-Wilk test to detect.
Checking 100,000 standard normal samples of size 1000 with the Shapiro-Wilk test, I got rejections just about 5% of the time, which is what one would expect from a test at the 5% level.
set.seed(2019) pv = replicate( 10^5, shapiro.test(rnorm(1000))$p.val ) mean(pv
Addendum. By contrast, the distribution $\mathsf(20,20)$ "looks" very much like a normal distribution, but isn't exactly normal. If I do the same simulation for this approximate model, Shapiro-Wilk rejects about 7% of the time. Looked at from the perspective of power, that's not great. But it seems Shapiro-Wilk is sometimes able to detect that the data aren't exactly normal.
This is a long way from "always," but I think $\mathsf(20,20)$ is closer to normal than a lot of real-life "normal" data are. (And the link says always may be "a bit strongly stated." I suspect the greatest trouble may come with samples a lot bigger than 1000, and for some normal approximations that are quite useful--even if imperfect.) "Not every statistically significant difference is a difference of practical importance." Sometimes, people who should know better seem to forget that when doing goodness-of-fit tests.
set.seed(2019) pv = replicate( 10^5, shapiro.test(rbeta(1000, 20,20))$p.val ) mean(pv