Python - speed up finding percentile of set which is greater than threshold

debugcn Published at Dev

kilojoules

I need to find which percentile of a group of numbers is over a threshold value. Is there a way that this can be speed up? My implementation is much too slow for the intended application. In case this changes anything, I am running my program using mpirun -np 100 python program.py. I cannot use numba, as the rest of this program uses try/except statements.

import numpy as np
my_vals = []
threshold_val = 0.065
for i in range(60000):
    my_vals.append(np.random.normal(0.05, 0.02))

for i in np.arange(0,100,0.001):
    if np.percentile(my_vals,i) > threshold_val:
        perc = 1*i
        break
else: perc = 100

Matt Jordan

Since the Gaussian (Normal) distribution produces a bell-curve, you should be able to calculate the percentile with the highest probability of being optimal and then write your code to check there first, and then use a modified binary search to find the optimal lowest threshold.

For example, if you determine that your parameters are most likely to favor e.g. 17.951 (this is an example only, I didn't actually bother computing it), then begin near that point rather than starting at 0. Treat this like a binary search - start your lower limit at 0 and your upper limit at 100.0, and set the point to bisect the list as the optimal percentile for the distribution.

If your current upper limit is over threshold_val, bisect the lower half to find the lowest such value that matches; if it is not over the threshold, bisect the upper half, etc. So, e.g. in the range 0.000 to 100.000, if you start at 17.951 and find that it is not above the threshold, adjust to bounds to 17.952 to 100.000 and try 58.976 (halfway between). As soon as you find a value that is above the threshold, then use that value as the upper bound (since it is a non-optimal answer). Continue this process until the lower and upper bounds are 0.001 apart, which gives you the optimal answer. On average, you should have to run about 17 tests rather than 100,000.

You may also be able to automate the computation of the optimal value in case your normal distribution will change, since the distribution produces a bell-curve, and you will know the statistics of that bell-curve based on the parameters anyway.

Your solution only needs to find the lowest value for which the percentile is above your threshold, so this approach should minimize the number of samples you need to check.

One more hint: np.percentile has to sorts the my_vals 100,000 times in your code; I do not know whether a pre-sorted list would help, but it may be worth checking (you'll probably have to test several possible sort parameters since it doesn't appear to be documented in which direction it sorts).

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-07-16

Comments

0 comments

From Dev

Related Related

Article