Collectives™ on Stack Overflow
Find centralized, trusted content and collaborate around the technologies you use most.
Learn more about Collectives
Teams
Q&A for work
Connect and share knowledge within a single location that is structured and easy to search.
Learn more about Teams
I've inherited some old Stata code (Stata11) that uses the
xtile
function to categorize observations in a vector by their quantiles (in this case, just the standard 5 quintiles, 20%, 40%, 60%, 80%, 100%).
I'm trying to replicate a piece of the code in Python and I am using the SciPy.stats.mstats function
mquantiles()
for the computation.
As near as I can tell from Stata documentation and searching online, the Stata
xtile
method tries to invert the empirical CDF of the data, and uses the equal-weighted average of all observations for which the CDF is flat to make the cutpoint. This seems like a very poor way to categorize quantiles, but it is what it is and I am sure there are cases where this is the right thing to do.
My question is how to make
mquantiles()
produce the same sort of breaking convention. I noticed that this function has two parameters,
alphap
and
betap
(the documentation calls them
alpha
and
beta
but you need the extra 'p' to get it to work, at least I do... I get an error if I just use 'alpha' and 'beta' with Python 2.7.1 and SciPy 0.10.0). But even in the SciPy docs, I can't see whether there's a combination of these parameters that produces the mean over flat CDF ranges.
I see what looks like the option to compute as the median or mode of this range, but not mean (it's also not clear if these SciPy median/mode options with alpha and beta are computed as the median/mode of the
observations
or of the range that would produce the flat CDF value.)
Any help disambiguating these different options and finding some documentation that helps me recreate the Stata convention in Python would be great. Please refrain from answers that just say "write your own quantile function." Firstly, that doesn't help me understand the conventions of either Stata or SciPy, and secondly, given these numerical libraries, writing my own quantile function should be a last resort. I can certainly do it, but it would be bad all around if I need to.
–
–
–
–
–
The scipy.stats.mquantiles documentation was poor and
wrong
in places, fixed now so that might be helpful...
http://docs.scipy.org/scipy/docs/scipy.stats.mstats_basic.mquantiles/
. That process started when you pointed out the alpha/beta, alphap/betap discrepancy. Thank you.
The implementation of mquantiles follow R.
The biggest difference comes from that R has 9 discrete types, where because scipy.stats.mquantiles calculates 'm' from 'alphap' and 'betap', scipy has a continuous range of "types" (for lack of a better word).
I admit that I do not understand all of the ins and outs of the statistics involved so I settled on a brute force evaluation. I found an xtile example at
http://www.biostat.sdu.dk/~biostat/StataReferenceManual/StataRef.pdf
and was able to match the results with alphap=0.5, and betap=0.5 (piecewise linear). Not definitive nor exhaustive, but all I have right now.
In [1]: import scipy.stats as st
In [9]: st.mstats.mquantiles([23,56,67,123,99,17],prob=[0.5],alphap=0.5,betap=.5)
Out[9]: array([ 61.5])
In [10]: st.mstats.mquantiles([23,56,67,123,99,17],prob=[0.33,0.66],alphap=0.5,betap=.5)
Out[10]: array([ 38.84, 81.72])
In [11]: st.mstats.mquantiles([23,56,67,123,99,17],prob=[0.25,0.5,0.75],alphap=0.5,betap=.5)
Out[11]: array([ 23. , 61.5, 99. ])
The last is a little problematic since two of the division points are exactly on values in the data set. Stata/xtile (at least in the examples that I found) does not give the split points for the quantiles but gives the quantiles themselves. Given the sorted data set [17,23,56,67,99,123], Stata/xtile gave the categorization as [1,1,2,3,3,4] which means that for scipy.stat.mquantiles to match the upper bound of a quantile is greater than or equal to all values in that quantile.
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.