#1tsm2_mean_date xticklabel_histogram tick mark dist_Skewness Kurtosis_moment_P/E_t-Statist_returNAV



The Law of Averages

     We begin at the beginning, with the law of averages, a greatly misunderstood and misquoted principle. In trading, the law of averages is most often referred to when an abnormally long series of losses is expected to be offset by an equal and opposite run of profits. It is equally wrong to expect a market that is currently overvalued or overbought to next become undervalued or oversold期望 当前 被高估超买的 市场接下来被低估超卖同样是错误的. That is not what is meant by the law of averages. Over a large sample, the bulk of events will be scattered分散 close to the average in such a way that the typical values overwhelm the abnormal events and cause them to be insignificant

     FIGURE 2.1 The Law of Averages. The normal cases overwhelm the unusual ones. It is not necessary for the extreme cases to alternate—one higher, the next lower—to create a balance.

     This principle is illustrated in Figure 2.1, where the number of average items is extremely large, and the addition of a small abnormal grouping to one side of an average group of near-normal data does not affect the balance. It is the same as being the only passenger on a jumbo jet. Your weight is insignificant to the operation of the airplane and not noticed when you move about the cabin. A long run of profits, losses, or an unusually sustained price movement is simply a rare, abnormal event that will be offset over time by the overwhelming large number of normal events. Further discussion of this and how it affects trading can be found in “Gambling Techniques—The Theory of Runs,” Chapter 22.

In-Sample and Out-of-Sample Data

     Proper test procedures call for separating data into in-sample and out-of-sample sets. This will be discussed in Chapter 21, System Testing. For now, consider the most important points. All testing is overfitting the data, yet there is no way to find out if an idea or system works without testing it. By setting aside data that you have not seen to use for validation, you have a better chance that your idea will work before putting money on it. 

     There are many ways to select in-sample data. For example, if you have 20 years of price history, you might choose to use the first 10 years for testing and reserve the second 10 years for validation. But then markets change over time; they become more volatile and may be more or less trending. It might be best to use alternating periods of in-sample and out-of-sample data, in 2-year intervals最好每隔 2 年交替使用样本内和样本外数据, provided that you never look at the data during the out-of-sample periods. Alternating these periods may create a problem for continuous, long-term trends, but that will be resolved in Chapter 21.

     The most important factor when reserving out-of-sample data is that you get only one chance to use it. Once you have done your best to create the rules for a trading program, you then run that program through the unseen data. If the results are successful then you can trade the system, but if it fails, then you are also done. You cannot look at the reasons why it failed and change the trading method to perform better. You would have introduced feedback, and your out-of-sample data is considered contaminated. The second try will always be better, but it is now overfitted.

How Much Data Is Enough?

     Statisticians will say, “More is better.” The more data you test, the more reliable your results. Technical analysis is fortunate to be based on a perfect set of data. Each price that is recorded by the exchange, whether it’s IBM at the close of trading in New York on May 5, or the price of Eurodollar interest rates at 10:05 in Chicago, is a confirmed, precise value.

     Remember that, when you use in-sample and out-of-sample data for development, you need more data. You will only get half the combinations and patterns when 50% of the data has been withheld当 50% 的数据被保留时,您将只能获得一半的组合和模式.

Economic Data

     Most other statistical data are not as timely, not as precise, and not as reliable as the price and volume of stocks, futures, ETFs, and other exchange-traded products. Economic data, such as the Producer Price Index or Housing Starts, are released as monthly averages, and can be seasonally adjusted. A monthly average represents a broad range of numbers. In the case of the PPI, some producers may have paid less than the average of the prior month and some more, but the average was +0.02. The lack of a range of values, or a standard deviation of the component values, reduces the usefulness of the information. This statistical data is often revised in the following month; sometimes those revisions can be quite large. When working with the Department of Energy (DOE) weekly data releases, you will need to know the history of the exact numbers released as well as the revisions, if you are going to design a trading method that reacts to those reports. You may find that it is much easier to find the revised data, which is not what you really need.

     If you use economic data, you must be aware of when that data is released. The United States is very precise and prompt but other countries can be months or years late in releasing data. If the input to your program is monthly data and comes from the CRB(Commodity Research Bureau Index) Yearbook, be sure that you check when that data was actually available.

Sample Error

     When an average is used, it is necessary to have enough data to make that average accurate. Because much statistical data is gathered by sampling, particular care is given to accumulating a sufficient amount of representative data. This holds true with prices as well. Averaging a few prices, or analyzing small market moves, will show more erratic results. It is difficult to draw an accurate picture from a very small sample.

     When using small, incomplete, or representative sets of data, the approximate error, or accuracy, of the sample can be found using the standard deviation. A large standard deviation indicates an extremely scattered set of points, which in turn makes the average less representative of the data. This process is called the testing of significance显着性检验. Accuracy increases as the number of items becomes larger, and the measurement of sample error becomes proportionately smaller

     Therefore, using only one item has a sample error样本误差 of 100%; with 4 items, the error is 50%. The size of the error is important to the reliability of any trading system. If a system has had only 4 trades, whether profits or losses, it is very difficult to draw any reliable conclusions about future performance. There must be sufficient trades to assure a comfortably small error factor. To reduce the error to 5%, there must be 400 trades\frac{1}{\sqrt{400}} = \frac{1}{20}=0.05. This presents a dilemma/dɪˈlemə/窘境 for a very slow trend-following method that may only generate two or three trades each year. To compensate for this, the identical method can be applied across many markets and the number of trades used collectively (more about this in Chapter 21).

Representative Data

     The amount of data is a good estimate of its usefulness; however, the data should represent at least one bull market, one bear market, and some sideways periods. More than one of each is even better. If you were to use 10 years of daily S&P Index values from 1990 to 2000, or 25 years of 10-year Treasury notes through 2010, you would only see a bull market. A trading strategy would be profitable whenever it was a buyer, if you held the position long enough. Unless you included a variety of other price patterns, you would not be able to create a strategy that would survive a downturn in the market. Your results would be unrealistic.


     The average can be misleading in other ways. Consider coffee, which rose from $0.40 to $2.00 per pound in one year. The average price of this product may appear to be $1.20\frac{0.4+2}{2} = 1.2; however, this would not account for the time that coffee was sold at various price levels. Table 2.1 divides the coffee price into four equal intervals, then shows that the time spent at these levels was uniformly opposite to the price rise. That is, prices remained at lower levels longer and at higher levels for shorter time periods, which is very normal price behavior.
     When the time spent at each price level is included, it can be seen that the average price should be lower than $1.20. One way to calculate this, knowing the specific number of days in each interval, is by using a weighted average of the price
and its respective interval

     This result can vary based on the number of time intervals used; however, it gives a better idea of the correct average price. There are two other averages for which time is an important element—the geometric mean and the harmonic mean.

Geometric Mean几何平均数

     The geometric mean represents a growth function in which a price change from 50 to 100 is as important as a change from 100 to 200. If there are n prices, a_1, a_2, a_3, . . . , a_n, then the geometric mean is the nth root of the product of the prices

     To solve this mathematically, rather than using a spreadsheet, the equation above can be changed to either of two forms: 

ln(G) =ln( ( a_1 \times a_2 \times a_3 \times...\times a_n)^{1/n} )
ln(G) = \frac{1}{n } (ln(a_1) + ln(a_2) + ln(a_3) +...+(a_n) )   
     The two solutions are equivalent. The term ln is the natural log, or log base e. (Note that there is some software where the function log actually is ln.) Using the price levels in Table 2.1

Disregarding the time intervals, and substituting into the first equation:

 <== e^{4.6462} = 104.188
     While the arithmetic mean, which is time-weighted, gave the value of 105.71, the geometric mean shows the average as 104.19.

     The geometric mean has advantages in application to economics and prices.

  • A classic example compares a tenfold rise十倍上涨 in price from 100 to 1000 to a fall to one tenth十分之一的下跌from 100 to 10.
  • An arithmetic mean of the two values 10 and 1000 is 505, while the geometric mean gives 

and shows the relative distribution of prices as a function of comparable growth. Due to this property, the geometric mean is the best choice when averaging ratios that can be either fractions or percentages.

Quadratic Mean二次均值

The quadratic mean is most often used for estimation of error. It is calculated as:

     The quadratic mean is the square root of the mean of the square of the items (root-mean-square). It is most well known as the basis for the standard deviation(\sigma = \sqrt{variance}).
     This will be discussed later in this chapter in the section “Moments of the Distribution: Variance, Skewness, and Kurtosis.” 

Harmonic Mean调和平均值 

     The harmonic mean is another time-weighted average, but not biased toward higher or lower values as in the geometric mean. A simple example is to consider the average speed of a car that travels 4 miles at 20 mph, then 4 miles at 30 mph. An arithmetic mean would give 25 mph\frac{20+30}{2}=25, without considering that 12 minutesT=\frac{4}{20} * 60 \: minutes=12 were spent at 20 mph and 8 minutesT=\frac{4}{30} * 60 \: minutes=8 at 30 mph.
The weighted average would give 

The harmonic mean is  
which can also be expressed as

For two or three values, the simpler form can be used: 

This allows the solution pattern to be seen. For the 20 and 30 mph rates of speed, the solution is

     which is the same answer as the weighted average. Considering the original set of numbers again, the basic form of harmonic mean can be applied:

     We might apply the harmonic mean to price swings, where the first swing moved 20 points over 12 days and the second swing moved 30 points over 8 days.

Price Distribution

     The measurement of distribution is very important because it tells you generally what to expect. We cannot know what tomorrow’s S&P trading range will be, but if the current price is 1200, then

  • we have a high level of confidence that it will fall between 900 and 1500 this year,
  • but less confidence that it will fall between 1100 and 1300.
  • We have much less confidence that it will fall between 1150 and 1250,
  • and we have virtually no chance of picking the exact range.

The following measurements of distribution allow you to put a probability, or confidence level, on the chance of an event occurring.

     In all of the statistics that follow, we will use a limited number of prices or—in some cases—individual trading profits and losses as the sample data. We want to measure the characteristics of our sample, finding the shape of the distribution, deciding how results of a smaller sample compare to a larger one, or how similar two samples are to each other. All of these measures will show that the smaller samples are less reliable, yet they can be still be used if you understand the size of the error or the difference in the shape of the distribution compared to the expected distribution of a larger sample

Frequency Distributions

import pandas as pd

wheat = pd.read_excel('TSM Monthly_W-PPI-DX (C2).xlsx',

Log returns(log_return):between two times 0 < s < t are normally distributed.

when s=t-1 

import numpy as np
# Calculates the log returns
wheat['log_rtn'] = np.log( wheat['Wheat']/wheat['Wheat'].shift(1) )

# estimated future price = (log return * 100) + previous price
for i in range(1,len(wheat)):
                   list(wheat.columns).index( 'Wheat(X)' )
                  ]= wheat['Wheat(X)'].iloc[i-1] + 100*wheat['log_rtn'].iloc[i]

# plot the estimated future price

import matplotlib.pyplot as plt
import numpy as np
import matplotlib.dates as mdates

import datetime as dt
import matplotlib.ticker as ticker

fig, ax = plt.subplots(1, figsize=(8,6),#gridspec_kw={'height_ratios': [2, 1]} 
ax.plot(wheat.index, wheat['Wheat(X)'].values,
        c='gray', lw=4

ax.set_ylabel('Wheat Price(cents/bushhel)', fontsize=14)

# Enabling grid lines:
# ax.xaxis.set_major_formatter(mdates.DateFormatter('%m/%d/%Y'))
# ax.xaxis.set_major_locator( mdates.YearLocator() )
# ax.set_xticklabels( (pd.date_range( ax.get_xticklabels()[0].get_text(),
#                                    ax.get_xticklabels()[-1].get_text(),
#                                    freq="A"
#                                  )- pd.offsets.MonthBegin()
#                     ).date,
#                     rotation=90
#                   )

dates = [ dt.date(list(set(wheat.index.year))[0] + i, 12, 1) 
              for i in range(len( set(wheat.index.year ) ))
        ]# datetime.date(1985, 12, 1), ... datetime.date(2010, 12, 1)]

tick_values = mdates.date2num(dates)#########
# [ 5813.  6178.  6543.  6909.  7274.  7639.  8004.  8370.  8735.  9100.
#   9465.  9831. 10196. 10561. 10926. 11292. 11657. 12022. 12387. 12753.
#  13118. 13483. 13848. 14214. 14579. 14944.]
ax.xaxis.set_major_locator( ticker.FixedLocator(tick_values) )

ax.set_xlim( xmin=dt.date( list(set(wheat.index.year))[0] , 
# 12/01/1985, ... 12/01/2009
ax.set_xticklabels( [ label.get_text().replace('/0', '/') 
                      for label in ax.get_xticklabels()
                    ],#remove the leading zero from a day in a Matplotlib tick label formatting
# ax.set_xticklabels( ax.get_xticklabels(),#remove the leading zero from a day in a Matplotlib tick label formatting
#                     rotation=90
#                   )      
ax.set_yticks(np.arange(0, 300, 50), )
#ax.autoscale(enable=True, axis='x', tight=True)


FIGURE 2.2 Wheat prices, 1985–2010.

     The frequency distribution (also called a histogram) is simple yet can give a good picture of the characteristics of the data. Theoretically, we expect commodity prices to spend more time at low price levels and only brief periods at high prices. That pattern is shown in Figure 2.2 for wheat during the past 25 years. The most frequent occurrences are at the price where the supply and demand are balanced, called equilibrium(equilibrium price : the price that balances quantity supplied and quantity demanded). When there is a shortage of supply(excess demand供不应求), or an unexpected demand需求意外, prices rise for a short time until either the demand is satisfied (which could happen if prices are too high), or supply increases to meet demand
(With too many buyers chasing too few goods, sellers can respond to the shortage by raising their prices without losing sales. These price increases cause the quantity demanded to fall and the quantity supplied to rise. Once again, these changes represent movements along the supply and demand curves, and they move the market toward the equilibrium.
There is usually a small tail to the left where prices occasionally trade for less than the cost of production, or at a discounted rate during periods of high supply.

     To calculate a frequency distribution with 20 bins, we find the highest and lowest prices to be charted, and divide the difference by 19 to get the size of one bin. Beginning with the lowest price(left edge of first bin), add the bin size to get the second value(right edge of first bin), add the bin size to the second value to get the third value(right edge of second bin), and so on.

  • so I use bin_width=50 and bins=np.arange(0, 1000+bin_width, bin_width) to np.histogram() function to generate
    # len(hist_values), len(bin_edges) ==> (20 bins, 21 bin_edges)
    # hist_values
    # array([ 0,   0,   0,   0,  31,  58,  76,  54,  36,  18,  
    #         7,   6,   4,   1,   2,   1,   1,   3,   1,   0], dtype=int64)
    # bin_edges
    # array([ 0,  50, 100, 150, 200, 250, 300, 350, 400, 450,
    #       500, 550, 600, 650, 700, 750, 800, 850, 900, 950,
    #       1000])
  • import matplotlib.pyplot as plt
    import numpy as np
    fig, ax = plt.subplots(1, figsize=(8,6),#gridspec_kw={'height_ratios': [2, 1]} 
    bin_width = 50
    hist_values, bin_edges= np.histogram( wheat['Wheat'],
                                          # it defines a monotonically increasing array of bin edges, including the rightmost edge
                                          bins=np.arange(0, 1000+bin_width, bin_width), # 21 bin_edges
                                          #range=( wheat['Wheat'].min(), wheat['Wheat'].max() ),
                                          density=False, # False: count the number of samples in each bin
    # len(hist_values), len(bin_edges) ==> (20 bins, 21 bin_edges)
    # hist_values
    # array([ 0,   0,   0,   0,  31,  58,  76,  54,  36,  18,  
    #         7,   6,   4,   1,   2,   1,   1,   3,   1,   0], dtype=int64)
    # bin_edges
    # array([ 0,  50, 100, 150, 200, 250, 300, 350, 400, 450,
    #       500, 550, 600, 650, 700, 750, 800, 850, 900, 950,
    #       1000])
    #wheat.hist(column='Wheat(X)', bins=20, ax=ax)
    ax.grid(axis='y', )    
    ax.hist(x=bin_edges[:-1], # use the left edges
            rwidth=0.5, # The relative width of the bars as a fraction of the bin width(here is 50)
            zorder=2.0,# set the grid line below the bar 
           )# bar_width = bin_width * rwidth = 50*0.5 = 25
    #ax.set_xlim( left=100 )
    #set the xticks with the middle position of bar = left edge + bar_width
    ax.set_xticks(bin_edges[:-1]+bin_width/2. ) # with the value of the middle position on each bin
    ax.set_xlim( left=100+bin_width/2. )# set the start xtick
    ax.set_xticklabels( ax.get_xticklabels(),
    ax.set_yticks(np.arange(0,90, 10))
    ax.set_xlabel('Wheat Price (in cents/bushel)', fontsize=14)
    ax.set_ylabel('Frequency of Price \n Occuring in Bin', fontsize=14)

     Now I want to shift the histogram or each bar to the right edge of the bin
    OR show frequency distribution to the right tail

    import matplotlib.pyplot as plt
    import numpy as np
    fig, ax = plt.subplots(1, figsize=(8,6),#gridspec_kw={'height_ratios': [2, 1]} 
    hist_values, bin_edges= np.histogram( wheat['Wheat'],
                                          # it defines a monotonically increasing array of bin edges, including the rightmost edge
                                          bins=np.arange(0, 1000+bin_width, bin_width), # 21 bin_edges
                                          #range=( wheat['Wheat'].min(), wheat['Wheat'].max() ),
                                          density=False, # False: count the number of samples in each bin
    # len(hist_values), len(bin_edges) ==> (20 bins, 21 bin_edges)
    # hist_values
    # array([ 0,   0,   0,   0,  31,  58,  76,  54,  36,  18,  
    #         7,   6,   4,   1,   2,   1,   1,   3,   1,   0], dtype=int64)
    # bin_edges
    # array([ 0,  50, 100, 150, 200, 250, 300, 350, 400, 450,
    #       500, 550, 600, 650, 700, 750, 800, 850, 900, 950,
    #       1000])
    #wheat.hist(column='Wheat(X)', bins=20, ax=ax)
    ax.grid(axis='y', )
    ax.hist(x=bin_edges[1:], # use the right edges ########
            bins=bin_edges+shift_hist,             ########
            rwidth=0.5, # The relative width of the bars as a fraction of the bin width(here is 50)
            zorder=2.0,# set the grid line below the bar
           )# bar_width = bin_width * rwidth = 50*0.5 = 25
    #set the xticks with the middle position of bar = left edge + bar_width
    ax.set_xticks(bin_edges[1:] ) # use the right edges ######
    ax.set_xlim( left=100 )# set the start xtick
    ax.set_xticklabels( ax.get_xticklabels(),
    ax.set_yticks(np.arange(0,90, 10))
    ax.set_xlabel('Wheat Price (in cents/bushel)', fontsize=14)
    ax.set_ylabel('Frequency of Price \n Occuring in Bin', fontsize=14)
    ax.tick_params(width=0) # Tick line width in points

    FIGURE 2.3 Wheat frequency distribution showing a tail to the right.

    import matplotlib.pyplot as plt
    import numpy as np
    fig, ax = plt.subplots(1, figsize=(8,6),#gridspec_kw={'height_ratios': [2, 1]} 
    hist_values, bin_edges= np.histogram( wheat['Wheat'],
                                          # it defines a monotonically increasing array of bin edges, including the rightmost edge
                                          bins=np.arange(0, 1000+bin_width, bin_width), # 21 bin_edges
                                          #range=( wheat['Wheat'].min(), wheat['Wheat'].max() ),
                                          density=False, # False: count the number of samples in each bin
    # len(hist_values), len(bin_edges) ==> (20 bins, 21 bin_edges)
    # hist_values
    # array([ 0,   0,   0,   0,  31,  58,  76,  54,  36,  18,  
    #         7,   6,   4,   1,   2,   1,   1,   3,   1,   0], dtype=int64)
    # bin_edges
    # array([ 0,  50, 100, 150, 200, 250, 300, 350, 400, 450,
    #       500, 550, 600, 650, 700, 750, 800, 850, 900, 950,
    #       1000])
    #wheat.hist(column='Wheat(X)', bins=20, ax=ax)
    #ax.grid(axis='y', )
    ax.hist(x=bin_edges[1:], # use the right edges ########
            bins=bin_edges+shift_hist,             ########
            rwidth=0.5, # The relative width of the bars as a fraction of the bin width(here is 50)
            zorder=2.0,# set the grid line below the bar
           )# bar_width = bin_width * rwidth = 50*0.5 = 25
    #set the xticks with the middle position of bar = left edge + bar_width
    ax.set_xticks(bin_edges[1:] ) # use the right edges ######
    ax.set_xlim( left=100 )# set the start xtick
    ax.set_xticklabels( ax.get_xticklabels(),
    ax.set_xlabel('Wheat Price (in cents/bushel)', fontsize=14)
    ax.set_ylabel('Frequency of Price \n Occuring in Bin', fontsize=14)
    ax.tick_params(width=0) # Tick line width in points
    for i,p in enumerate(ax.patches):
            x, w, h = p.get_x(), p.get_width(), p.get_height()
            if h > 0:
                # https://matplotlib.org/stable/gallery/text_labels_and_annotations/text_alignment.html
                ax.text(x + w / 2, h, # the location(x + w/2, h) of the anchor point
                        ha='center',  # center of text_box(square box) as the anchor point
                        va='center',  # center of text_box(square box) as the anchor point

    FIGURE 2.3 Wheat frequency distribution showing a tail to the right.



    import matplotlib.pyplot as plt
    import numpy as np
    fig, ax = plt.subplots(1, figsize=(8,6),#gridspec_kw={'height_ratios': [2, 1]} 
    hist_values, bin_edges= np.histogram( wheat['Wheat'],
                                          # it defines a monotonically increasing array of bin edges, including the rightmost edge
                                          bins=np.arange(0, 1000+bin_width, bin_width), # 21 bin_edges
                                          #range=( wheat['Wheat'].min(), wheat['Wheat'].max() ),
                                          density=False, # False: count the number of samples in each bin
    # len(hist_values), len(bin_edges) ==> (20 bins, 21 bin_edges)
    # hist_values
    # array([ 0,   0,   0,   0,  31,  58,  76,  54,  36,  18,  
    #         7,   6,   4,   1,   2,   1,   1,   3,   1,   0], dtype=int64)
    # bin_edges
    # array([ 0,  50, 100, 150, 200, 250, 300, 350, 400, 450,
    #       500, 550, 600, 650, 700, 750, 800, 850, 900, 950,
    #       1000])
    #wheat.hist(column='Wheat(X)', bins=20, ax=ax)
    #ax.grid(axis='y', )
    ax.hist(x=bin_edges[1:], # use the right edges ########
            bins=bin_edges+shift_hist,             ########
            rwidth=0.5, # The relative width of the bars as a fraction of the bin width(here is 50)
            zorder=2.0,# set the grid line below the bar
           )# bar_width = bin_width * rwidth = 50*0.5 = 25
    #set the xticks with the middle position of bar = left edge + bar_width
    ax.set_xticks(bin_edges[1:] ) # use the right edges ######
    ax.set_xlim( left=100 )# set the start xtick
    ax.set_xticklabels( ax.get_xticklabels(),
    ax.set_xlabel('Wheat Price (in cents/bushel)', fontsize=14)
    ax.set_ylabel('Frequency of Price \n Occuring in Bin', fontsize=14)
    ax.tick_params(width=0) # Tick line width in points
    for p in ax.patches:
            x, w, h = p.get_x(), p.get_width(), p.get_height()
            if h > 0:
                pct=h / len(wheat['Wheat'])
                # https://matplotlib.org/stable/gallery/text_labels_and_annotations/text_alignment.html
                ax.text(x + w / 2, h, # the location(x + w/2, h) of the anchor point
                        ha='center',  # center of text_box(square box) as the anchor point
                        va='center',  # center of text_box(square box) as the anchor point

     When completed, you will have 20 bins that begin at the lowest price and end at the highest price. You then can count the number of prices that fall into each bin, a nearly impossible task, or you can use a spreadsheet to do it. In Excel, you go to Data/Data Analysis/Histogram and enter the range of bins (which you need to set up in advance) and the data to be analyzed, then select a blank place on the spreadsheet for the output results (to the right of the bins is good) and click OK. The frequency distribution will be shown instantly. You can then plot the results seen in Figure 2.3. 

     The frequency distribution shows that the most common price fell between $3.50 and $4.00 per bushel (check the following normal distributin plot)but the most active range was from $2.50 to $5.00. The tail to the right extends to just under $10/bushel and clearly demonstrates the fat tail in the price distribution. If this was a normal distribution, there would be no entries past $6.0(based on 95% confidence interval).


density   bool, optional

     If False, the result will contain the number of samples in each bin.

     If True, the result is the value of the Probability Density Function at the bin, normalized such that the integral over the range is 1. Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function.请注意,直方图值的总和将不等于 1,除非选择单位宽度的 bin(对数据进行z-score transformationts8_Outlier Detection_plotly_sns_text annot_modified z-score_hist_Tukey box_cdf_resample freq_Quanti_LIQING LIN的博客-CSDN博客
Notice how the shape of the data did not change, hence why the z-score is called a lossless transformation. The only difference between the two is the scale (units).
); 它不是概率质量函数

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import norm

fig, ax = plt.subplots(1, figsize=(8,6),#gridspec_kw={'height_ratios': [2, 1]} 



hist_values, bin_edges= np.histogram( wheat['Wheat'],
                                      # it defines a monotonically increasing array of bin edges, including the rightmost edge
                                      bins=np.arange(0, 1000+bin_width, bin_width), # 21 bin_edges
                                      #range=( wheat['Wheat'].min(), wheat['Wheat'].max() ),
                                      density=True, # False: count the number of samples in each bin
# len(hist_values), len(bin_edges) ==> (20 bins, 21 bin_edges)
# hist_values
# array([ 0,   0,   0,   0,  31,  58,  76,  54,  36,  18,  
#         7,   6,   4,   1,   2,   1,   1,   3,   1,   0], dtype=int64)
# bin_edges
# array([ 0,  50, 100, 150, 200, 250, 300, 350, 400, 450,
#       500, 550, 600, 650, 700, 750, 800, 850, 900, 950,
#       1000])
#wheat.hist(column='Wheat(X)', bins=20, ax=ax)
#ax.grid(axis='y', )
ax.hist(x=bin_edges[:-1], # use the left edges
        rwidth=1, # The relative width of the bars as a fraction of the bin width(here is 50)
        zorder=2.0,# set the grid line below the bar 
       )# bar_width = bin_width * rwidth = 50*0.5 = 25

#set the xticks with the middle position of bar = left edge + bar_width
ax.set_xticks(bin_edges ) 
ax.set_xlim( left=0 )# set the start xtick

ax.set_xticklabels( ax.get_xticklabels(),

# Fit a normal distribution to
# the data:
# mean and standard deviation
mu, std = norm.fit(wheat['Wheat'])
# mu, std : (362.7483277591973, 115.83166808628707)
# Plot the PDF.
# mu, std = wheat['Wheat'].mean(), wheat['Wheat'].std()
# (362.7483277591973, 116.02585375167453)

x = np.linspace(0, #start
                1000, #stop
                100#Number of samples to generate
p = norm.pdf(x, mu, std)
r_min = norm.ppf(0.025, mu, std)   # 
r_max = norm.ppf(1-0.025, mu, std) # 
ax.plot(x, #since we shift all bars to the right edge
        p, 'blue', linewidth=2, zorder=3.0)
plt.fill_between(x, p, 0, where= (x<r_min),color='orange',zorder=3.0 )
plt.fill_between(x, p, 0, where= (r_max<x),color='orange',zorder=3.0 )

ax.set_xlabel('Wheat Price (in cents/bushel)', fontsize=14)
ax.set_ylabel('Probability Density Function', fontsize=14, color='blue')

ax.tick_params(width=0) # Tick line width in points


import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import norm

fig, ax = plt.subplots(1, figsize=(8,6),#gridspec_kw={'height_ratios': [2, 1]} 



hist_values, bin_edges= np.histogram( wheat['Wheat'],
                                      # it defines a monotonically increasing array of bin edges, including the rightmost edge
                                      bins=np.arange(0, 1000+bin_width, bin_width), # 21 bin_edges
                                      #range=( wheat['Wheat'].min(), wheat['Wheat'].max() ),
                                      density=True, # False: count the number of samples in each bin
# len(hist_values), len(bin_edges) ==> (20 bins, 21 bin_edges)
# hist_values
# array([ 0,   0,   0,   0,  31,  58,  76,  54,  36,  18,  
#         7,   6,   4,   1,   2,   1,   1,   3,   1,   0], dtype=int64)
# bin_edges
# array([ 0,  50, 100, 150, 200, 250, 300, 350, 400, 450,
#       500, 550, 600, 650, 700, 750, 800, 850, 900, 950,
#       1000])
#wheat.hist(column='Wheat(X)', bins=20, ax=ax)
#ax.grid(axis='y', )
ax.hist(x=bin_edges[:-1], # use the left edges
        rwidth=1, # The relative width of the bars as a fraction of the bin width(here is 50)
        zorder=2.0,# set the grid line below the bar 
       )# bar_width = bin_width * rwidth = 50*0.5 = 25

#set the xticks with the middle position of bar = left edge + bar_width
ax.set_xticks(bin_edges ) 
ax.set_xlim( left=100 )# set the start xtick

ax.set_xticklabels( ax.get_xticklabels(),

# Fit a normal distribution to
# the data:
# mean and standard deviation
mu, std = norm.fit(wheat['Wheat'])
# mu, std : (362.7483277591973, 115.83166808628707)
# Plot the PDF.
# mu, std = wheat['Wheat'].mean(), wheat['Wheat'].std()
# (362.7483277591973, 116.02585375167453)

x = np.linspace(mu-2*std, # 95% confidence interval
p = norm.pdf(x, mu, std)
ax.plot(x, #since we shift all bars to the right edge
        p, 'blue', linewidth=2, zorder=3.0)  

ax.set_xlabel('Wheat Price (in cents/bushel)', fontsize=14)
ax.set_ylabel('Probability Density Function', fontsize=14, color='blue')

ax.tick_params(width=0) # Tick line width in points


the area under the black line : 95% confidence interval 

FIGURE 2.4 Normal distribution showing the percentage area included within one standard deviation about the arithmetic mean.

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import norm

fig, ax = plt.subplots(1, figsize=(8,6),#gridspec_kw={'height_ratios': [2, 1]} 



hist_values, bin_edges= np.histogram( wheat['Wheat'],
                                      # it defines a monotonically increasing array of bin edges, including the rightmost edge
                                      bins=np.arange(0, 1000+bin_width, bin_width), # 21 bin_edges
                                      #range=( wheat['Wheat'].min(), wheat['Wheat'].max() ),
                                      density=True, # False: count the number of samples in each bin
# len(hist_values), len(bin_edges) ==> (20 bins, 21 bin_edges)
# hist_values
# array([ 0,   0,   0,   0,  31,  58,  76,  54,  36,  18,  
#         7,   6,   4,   1,   2,   1,   1,   3,   1,   0], dtype=int64)
# bin_edges
# array([ 0,  50, 100, 150, 200, 250, 300, 350, 400, 450,
#       500, 550, 600, 650, 700, 750, 800, 850, 900, 950,
#       1000])
#wheat.hist(column='Wheat(X)', bins=20, ax=ax)
#ax.grid(axis='y', )
ax.hist(x=bin_edges[:-1], # use the left edges
        rwidth=0.5, # The relative width of the bars as a fraction of the bin width(here is 50)
        zorder=2.0,# set the grid line below the bar 
       )# bar_width = bin_width * rwidth = 50*0.5 = 25

#set the xticks with the middle position of bar = left edge + bar_width
ax.set_xticks(bin_edges[:-1]+bin_width/2. ) # with the value of the middle position on each bin
ax.set_xlim( left=100+bin_width/2. )# set the start xtick

ax.set_xticklabels( ax.get_xticklabels(),

# Fit a normal distribution to
# the data:
# mean and standard deviation
mu, std = norm.fit(wheat['Wheat'])
# mu, std : (362.7483277591973, 115.83166808628707)
# Plot the PDF.
# mu, std = wheat['Wheat'].mean(), wheat['Wheat'].std()
# (362.7483277591973, 116.02585375167453)

x = np.linspace(mu-2*std, # 95% confidence interval
p = norm.pdf(x, mu, std)
ax.plot(x, #since we shift all bars to the right edge
        p, 'k', linewidth=2, zorder=3.0)  

ax.set_xlabel('Wheat Price (in cents/bushel)', fontsize=14)
ax.set_ylabel('Probability Density Function', fontsize=14)

ax.tick_params(width=0) # Tick line width in points


import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import norm

fig, ax = plt.subplots(1, figsize=(8,6),#gridspec_kw={'height_ratios': [2, 1]} 



hist_values, bin_edges= np.histogram( wheat['Wheat'],
                                      # it defines a monotonically increasing array of bin edges, including the rightmost edge
                                      bins=np.arange(0, 1000+bin_width, bin_width), # 21 bin_edges
                                      #range=( wheat['Wheat'].min(), wheat['Wheat'].max() ),
                                      density=True, # False: count the number of samples in each bin
# len(hist_values), len(bin_edges) ==> (20 bins, 21 bin_edges)
# hist_values
# array([ 0,   0,   0,   0,  31,  58,  76,  54,  36,  18,  
#         7,   6,   4,   1,   2,   1,   1,   3,   1,   0], dtype=int64)
# bin_edges
# array([ 0,  50, 100, 150, 200, 250, 300, 350, 400, 450,
#       500, 550, 600, 650, 700, 750, 800, 850, 900, 950,
#       1000])
#wheat.hist(column='Wheat(X)', bins=20, ax=ax)
#ax.grid(axis='y', )
ax.hist(x=bin_edges[1:], # use the right edges ########
        bins=bin_edges+shift_hist,             ########
        rwidth=0.5, # The relative width of the bars as a fraction of the bin width(here is 50)
        zorder=2.0,# set the grid line below the bar
       )# bar_width = bin_width * rwidth = 50*0.5 = 25

#set the xticks with the middle position of bar = left edge + bar_width
ax.set_xticks(bin_edges[1:] ) # use the right edges ######
ax.set_xlim( left=100 )# set the start xtick

ax.set_xticklabels( ax.get_xticklabels(),

# Fit a normal distribution to
# the data:
# mean and standard deviation
mu, std = norm.fit(wheat['Wheat'])
# mu, std : (362.7483277591973, 115.83166808628707)
# Plot the PDF.
# mu, std = wheat['Wheat'].mean(), wheat['Wheat'].std()
# (362.7483277591973, 116.02585375167453)

x = np.linspace(mu-2*std, # 95% confidence interval
p = norm.pdf(x, mu, std)
ax.plot(x, #since we shift all bars to the right edge
        p, 'k', linewidth=2, zorder=3.0)  

ax.set_xlabel('Wheat Price (in cents/bushel)', fontsize=14)
ax.set_ylabel('Probability Density Function', fontsize=14)

ax.tick_params(width=0) # Tick line width in points


The absence of price data below $2.50 is due to the cost of production. Below that price farmers would refuse to sell at a loss; however, the U.S. government has a price support program that guarantees a minimum return for farmers.

     The wheat frequency distribution can also be viewed net of inflation or changes in the U.S. dollar(扣除通货膨胀或美元变化后). This will be seen at the end of this chapter 

Short-Term Distributions 

     The same frequency distributions occur even when we look at shorter time intervals, although the pattern is more erratic as the time interval gets very small. If we take wheat prices for the calendar year 2007 (Figure 2.4) we see a steady move up during midyear年中, followed by a wide-ranging sideways pattern at higher level;

# https://www.macrotrends.net/2534/wheat-prices-historical-chart-data
wheat40 = pd.read_csv('wheat-prices-historical-chart-data.csv',
# #wheat40['2007-01-01':'2008-01-01']
# #Calculates the log returns
wheat40['log_rtn'] = np.log( wheat40['Wheat']/wheat40['Wheat'].shift(1) )

# # estimated future price = (log return * 100) + previous price
for i in range(1,len(wheat40)):
                     list(wheat40.columns).index( 'Wheat(X)' )
                    ]= wheat40['Wheat(X)'].iloc[i-1] + 100*wheat40['log_rtn'].iloc[i]


FIGURE 2.4 Wheat daily prices, 2007.
however, the frequency distribution in Figure 2.5 shows a pattern similar to the long-term distribution, with the most common value at a low price level and a fat tail to the right. 

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import norm

fig, ax = plt.subplots(1, figsize=(8,6),#gridspec_kw={'height_ratios': [2, 1]} 



hist_values, bin_edges= np.histogram( wheat40['2007-01-01':'2008-01-01']['Wheat']*100,
                                      # it defines a monotonically increasing array of bin edges, including the rightmost edge
                                      bins=np.arange(0, 1000+bin_width, bin_width), # 21 bin_edges
                                      #range=( wheat['Wheat'].min(), wheat['Wheat'].max() ),
                                      density=True, # False: count the number of samples in each bin
# len(hist_values), len(bin_edges) ==> (20 bins, 21 bin_edges)
# hist_values
# array([ 0,   0,   0,   0,  31,  58,  76,  54,  36,  18,  
#         7,   6,   4,   1,   2,   1,   1,   3,   1,   0], dtype=int64)
# bin_edges
# array([ 0,  50, 100, 150, 200, 250, 300, 350, 400, 450,
#       500, 550, 600, 650, 700, 750, 800, 850, 900, 950,
#       1000])
#wheat.hist(column='Wheat(X)', bins=20, ax=ax)
#ax.grid(axis='y', )
ax.hist(x=bin_edges[1:], # use the right edges ########
        bins=bin_edges+shift_hist,             ########
        rwidth=0.5, # The relative width of the bars as a fraction of the bin width(here is 50)
        zorder=2.0,# set the grid line below the bar
       )# bar_width = bin_width * rwidth = 50*0.5 = 25

#set the xticks with the middle position of bar = left edge + bar_width
ax.set_xticks(bin_edges[1:] ) # use the right edges ######
ax.set_xlim( left=400 )# set the start xtick

ax.set_xticklabels( ax.get_xticklabels(),

# Fit a normal distribution to
# the data:
# mean and standard deviation
mu, std = norm.fit(wheat40['2007-01-01':'2008-01-01']['Wheat']*100)
# mu, std : (362.7483277591973, 115.83166808628707)
# Plot the PDF.
# mu, std = wheat['Wheat'].mean(), wheat['Wheat'].std()
# (362.7483277591973, 116.02585375167453)

x = np.linspace(mu-2*std, # 95% confidence interval
p = norm.pdf(x, mu, std)
ax.plot(x, #since we shift all bars to the right edge
        p, 'k', linewidth=2, zorder=3.0)  

ax.set_xlabel('Wheat Price (in dollars/bushel)', fontsize=14)
ax.set_ylabel('Probability Density Function', fontsize=14)

ax.tick_params(width=0) # Tick line width in points


FIGURE 2.5 Frequency distribution of wheat prices, intervals of $0.50, during 2007.

     If we had picked the few months just before prices peaked in September 2007, the chart might have shown the peak price further to the right and the fat tail on the left. For commodities, this represents a period of price instability, and expectations that prices will fall.

  • Common-Stock Ratios :
    • Finally, there are a number of common-stock ratios that convert key bits of information about the company to a per-share basis. Also called market ratios市场比率, they tell the investor exactly what portion of total profits, dividends, and equity is allocated to each share of stock投资者总利润、股息和股本的确切比例分配给每股股票. Popular common-stock ratios include Earnings Per Share (EPS)每股收益, Price-to-Earnings Ratio(P/E ratio)市盈率, dividends per share, dividend yield, payout ratio, and book value per share.流行的普通股比率包括每股收益、市盈率、每股股息、股息收益率、派息率和每股账面价值。 
  • Price-to-Earnings Ratio(P/E ratio)市盈率 :
    • This measure, an extension of the earnings per share ratio, is used to determine how the market is pricing the company’s common stock. The price-to-earnings (P/E) ratio relates the company’s Earnings Per Share (EPS)每股收益 to the market price of its stock. To compute the P/E ratio, it is necessary to first know the stock’s EPS. Using the earnings per share equation, we see that the EPS for UVRS in 2013 was
      In this case, the company’s profits of $139.7 million translate into earnings of $2.26 for each share of outstanding common stock公司 1.397 亿美元的利润转化为每股已发行普通股 2.26 美元的收益。. (Note in this case that dividends are shown as $0 because the company has no preferred stock outstanding.) Given this EPS figure and the stock’s current market price (assume it is currently trading at $41.50), we can use Equation 7.14 to determine the P/E ratio for Universal.

      In effect, the stock is currently selling at a multiple of about 18 times its 2013 earnings
      实际上,该股目前的市盈率约为其 2013 年收益的 18 倍

      Price-to-earnings multiples市盈率倍数 are widely quoted in the financial press and are an essential part of many stock valuation models. Other things being equal, you would like to find stocks with rising P/E ratios because higher P/E multiples usually translate into higher future stock prices and better returns to stockholders. But even though you’d like to see them going up, you also want to watch out for P/E ratios that become too high (relative either to the market or to what the stock has done in the past). When this multiple gets too high, it may be a signal that the stock is becoming overvalued (and may be due for a fall).

      One way to assess the P/E ratio is to compare it to the company’s rate of growth in earnings盈利增长率. The market has developed a measure of this comparison called the PEG ratio. Basically, it looks at the latest P/E relative to the 3- to 5-year rate of growth in earnings. (The earnings growth rate can be all historical—the last 3 to 5 years—or perhaps part historical and part forecasted.) The PEG ratio is computed as:

           As we saw earlier, Universal Office Furnishings had a P/E ratio of 18.4 times earnings市盈率为 18.4 倍 in 2013. If corporate earnings for the past 5 years had been growing at an average annual rate of, say, 15%如果过去 5 年的公司盈利以平均每年 15% 的速度增长, then its PEG ratio would be:

           A PEG ratio this close to 1.0 is certainly reasonable. It suggests that the company’s P/E is not out of line with the earnings growth of the firm这表明公司的市盈率与公司的盈利增长并不矛盾. In fact, the idea is to look for stocks that have PEG ratios that are equal to or less than 1. In contrast, a high PEG means the stock’s P/E has outpaced its growth in earnings and, if anything, the stock is probably “fully valued.” Some investors, in fact, won’t even look at stocks if their PEGs are too high—say, more than 1.5 or 2.0. At the minimum, PEG is probably something you would want to look at because it certainly is not unreasonable to expect some correlation between a stock’s P/E and its rate of growth in earnings.
    • P/E, F P/E: P/E stands for price-earnings ratio (P/E ratio, also called the “earnings multiple”市盈率). The P/E ratio is the stock’s price divided by the income, or profit, earned by the firm on a per-share basis over the previous 12 months. In effect, it states the multiple that investors are willing to pay for one dollar of earnings投资者愿意为一美元收益支付的倍数.
      • High P/Es may result as investors are willing to pay more for a dollar of earnings because they believe that earnings will grow dramatically in the future.
      • Low P/Es are generally interpreted as an indication of poor or risky future prospects.
      • F P/E is the forward price-earnings ratio, and uses estimated earnings over next 12 months. If there is no estimate, it is not given.

     It should be expected that the distribution of prices for a physical commodity, such as agricultural products, metals, and energy will be skewed toward the left (more occurrences at lower prices) and have a long tail at higher prices toward the right of the chart.应该预料到,农产品、金属和能源等实物商品的价格分布将向左倾斜(价格越低出现次数越多),并且在图表右侧的价格较高时有一条长尾巴 This is because prices remain at relatively higher levels for only short periods of time while there is an imbalance in supply and demand.

  • The reason for this is that physical commodities are typically subject to supply and demand factors that can result in price fluctuations.
    • In many cases, the supply of the commodity is greater than the demand, which can lead to an excess supply and lower prices.
    • Conversely, when demand is greater than supply, prices may rise.
    • This often results in a skewed distribution of prices with more occurrences at lower prices.
  • In addition, physical commodities may also be subject to extreme price movements due to unexpected events such as natural disasters, political unrest, or changes in government policies. These extreme price movements can result in a long tail at higher prices towards the right of the chart.

In the stock market, history has shown that stocks will not sustain exceptionally high price/earnings (P/E) ratios indefinitely; however, the period of adjustment can be drawn out over many years, unlike an agricultural product that begins again each year. When observing shorter price periods, patterns that do not fit the standard distribution may be considered in transition. Readers who would like to pursue this topic should read Chapter 18, especially the sections “Distribution of Prices” and “Steidlmayer’s Market Profile.”

     The measures of central tendency discussed in the previous section are used to describe the shape and extremes of price movement shown in the frequency distribution. The general relationship between the three principal means when the distribution is not perfectly symmetric is 当分布不完全对称时,三个主要均值之间的一般关系是

  • Arithmetic mean > Geometric mean > Harmonic mean

Median and Mode

  • #1_Statistics_agent_policy_explanatory_predictor_response_numeric_mode_Hypothesis_Type I_Chi-squ_LIQING LIN的博客-CSDN博客Median(==Q2, the second quartile): This is the midpoint of the data, and is calculated by either arranging it in ascending or descending order. If there are N observations.
    Application: Binary search or students' score(if the average(mean)>60, and your professor want at least half of students pass the course, but if your professor use the mean as measure, then most number of students will fail since few students got very high score and the mean is affected by the high scores(this means the average will not reflect the “typically” student scores), so he will consider to use the median of the scores as measure)

    Find the median of a batch of n numbers is easy as long as you remember to order the values first.
    If n is odd, the median is the middle value. Counting in from the ends, we find the value in the (n+1)/2 position
    if n is even, there are two middle values. So, in the case, the median is the average of the two values in positions n/2 and n/2+1

  • Mode(众数): This is the most repetitive data point in the data: the most frequent number!
  • (symmetrical distribution VS asymmetrical distribution)

     Two other measurements, the median and the mode, are often used to define distribution. The median, or “middle item,” is helpful for establishing the “center” of the data; when the data is sorted, it is the value in the middle. The median has the advantage of discounting extreme values中位数的优点是可以忽略极值, which might distort the arithmetic mean极值可能会扭曲算术平均值. Its disadvantage is that you must sort all of the data in order to locate the middle point. The median is preferred over the mean except when using a very small number of items.

     The mode is the most commonly occurring value. In Figure 2.5, the mode is the highest bar in the frequency distribution, at bin 500.

     In a normally distributed price series, the mode, mean, and median all occur at the same value; however, as the data becomes skewed, these values will move farther apart. The general relationship is: Mode < Median < Mean  (positively skewed)
                                     Mean < Median < Mode  (negatively skewed) 

     A normal distribution is commonly called a bell curve, and values fall equally on both sides of the mean. For much of the work done with price and performance data, the distributions tend to be skewed to the right (Mode < Median < Mean: toward higher prices or higher trading profits), and appear to flatten or cut off on the left (lower prices or trading losses). If you were to chart a distribution of trading profits and losses based on a trend system with a fixed stop-loss, you would get profits that could range from zero to very large values, while the losses would be theoretically limited to the size of the stop-loss如果您要根据具有固定止损的趋势系统绘制交易利润和损失的分布图,您将获得从零到非常大的值利润,而损失在理论上将限制在止损的大小. Skewed distributions will be important when we measure probabilities later in this chapter. There are no “normal” distributions in a trading environment.

Characteristics of the Principal Averages主要平均线的特征

     Each averaging method has its unique meaning and usefulness. The following summary points out their principal characteristics:

  • The arithmetic mean is affected by each data element equally, but it has a tendency to emphasize extreme values more than other methods. It is easily calculated and is subject to algebraic/ˌældʒɪˈbreɪɪk/代数的,关于代数学的 manipulation.
  • The geometric mean ==>gives less weight to extreme variations than the arithmetic mean and is most important when using data representing ratios or rates of change. It cannot be used for negative numbers but is also subject to algebraic manipulation.
  • The harmonic mean ==>is most applicable to time changes and, along with the geometric mean, has been used in economics for price analysis. It is more difficult to calculate; therefore, it is less popular than either of the other averages, although it is also capable of algebraic manipulation.
  • The mode is the most common value and is only determined by the frequency distribution. It is the location of greatest concentration and indicates a typical value for a reasonably large sample. With an unsorted set of data, such as prices, the mode is time-consuming to locate and is not capable of algebraic manipulation.
  • The median is the middle value, and is most useful when the center of an incomplete set is needed. It is not affected by extreme variations and is simple to find; however, it requires sorting the data, which causes the calculation to be slow. Although it has some arithmetic properties, it is not readily/ ˈredɪli /轻而易举地 adaptable to computational methods.

Moments of the Distribution: Variance, Skewness, and Kurtosis

     The moments of the distribution describe the shape of the data points, which is the way they cluster around the mean. There are four moments: mean, variance, skew, and kurtosis, each describing a different aspect of the shape of the distribution. Simply put, the mean is the center or average value, the variance is the distance of the individual points from the mean, the skew is the way the distribution leans to the left or right relative to the mean, and the kurtosis is the peakedness of the clustering. We have already discussed the mean, so we will start with the 2nd moment.

Mean(first moments) and Mean Deviation

     In the following calculations, we will use the bar notation, , to indicate the average of a list of n prices. The capital P refers to all prices and the small P to individual prices.

     The Mean Deviation (MD) is a basic method for measuring distribution and may be calculated about any measure of central location, such as the arithmetic mean.

     Then MD is the average of the differences between each price and the arithmetic mean of those prices, or some other measure of central location, with all differences treated as positive numbers

Variance (2nd Moment第二矩)

     Variance (Var), which is very similar to mean deviation, the best estimation of dispersion/dɪˈspɜːrʒn/分布, will be used as the basis for many other calculations. It is

     This is the mean of squared deviations from the mean (x_i = data points, µ = mean of the data, N = number of data points). The dimension of variance is the square of the actual values. The reason to use denominator N-1 for a sample instead of N in the population is due the degree of freedom. 1 degree of freedom lost in a sample by the time of calculating variance is due to extraction of substitution of sample:

     Notice that the variance is the square of the standard deviation, , one of the most commonly used statistics. In Excel, the variance is the function var(list).

     The standard deviation (s), most often shown as (sigma), is a special form of σ measuring average deviation from the mean, which uses the root-mean-square

     where the differences between the individual prices and the mean are squared to emphasize the significance of extreme values, and then the total value is scaled back using the square root function.  The standard deviation is the square root of variance. By applying the square root on variance, we measure the dispersion with respect to the original variable rather than square of the dimension:

     The standard deviation is the most popular way of measuring the dispersion of data.

  • The value of 1 standard deviation about the mean represents a clustering of about 68% of the data,
  • 2 standard deviations from the mean include 95.5% of all data
  • and 3 standard deviations encompass 99.7%, nearly all the data.
  • While it is not possible to guarantee that all data will be included, you can use 3.5 standard deviations to include 100% of the data in a normal distribution. These values represent the groupings of a perfectly normal set of data, shown in Figure 2.6. 
    FIGURE 2.6 Normal distribution showing the percentage area included within one standard deviation about the arithmetic mean.

Skewness (3rd moment)

  • The mean or average is defined as follows:      (1)
  • The sample standard deviation, that is, σ, is the squared root of the variance (2)
  • The skewness defined by the following formula indicates whether the distribution is skewed to the left or to the right. For a symmetric distribution, its skewness is zero(正态分布的偏度为0):
    For a sample of n values, two natural estimators of the population skewness are
  • The sample skewness is computed as the Fisher-Pearson coefficient of skewness:(3)
    m_m = \frac{1}{n} \sum_{i=1}^{n} (R_i - \bar{R})^m is the biased sample mth central moment
    where  is the sample mean,  is the (biased) estimate sample second central moment, and is the sample third central moment.  is a method of moments estimator.
    pfc1_whylog return Nominal Inflation_CPI_Realized Volati_outlier_distplot_Jarque–Bera_pAcf_sARIMAx_LIQING LIN的博客-CSDN博客​​​​​​
  • scipy.stats.skew(a, axis=0, bias=True, nan_policy='propagate', *, keepdims=False) : If bias is False, the calculations are corrected for bias and the value computed is the adjusted Fisher-Pearson standardized moment coefficient,如果偏差为 False,则计算会针对偏差进行校正,计算出的值是调整后的 Fisher-Pearson 标准化力矩系数 i.e.

    Another common definition of the sample skewness is

    pffs15_Shapiro_Anderson_normaltest_skewness_Sortino_Bartlett_homoskedasticity_heteroskedasticity_LIQING LIN的博客-CSDN博客

     Most price data, however, are not normally distributed. For physical commodities, such as gold, grains, energy, and even interest rates (expressed at yields以收益率表示), prices tend to spend more time at low levels and much less time at extreme highs( physical commodity, such as agricultural products, metals, and energy will be skewed toward the left (more occurrences at lower prices) and have a long tail at higher prices toward the right of the chart). While gold peaked at $800 per ounce for one day in January 1980, it remained between $250 and $400 per ounce for most of the next 20 years. If we had taken the average at $325 then is would be impossible for the price distribution to be symmetric. If 1 standard deviation is $140, then a normal distribution would show a high likelihood of prices dropping to $185(325-140=185), an unlikely scenario. This asymmetry is most obvious in agricultural markets, where a shortage of soybeans or coffee in one year will drive prices much higher, but a normal crop the following year will return those prices to previous levels

     The relationship of price versus time, where markets spend more time at lower levels, can be measured as skewness—the amount of distortion from a symmetric distribution, which makes the curve appear to be short on the left and extended to the right (higher prices). The extended side is called the tail, and a longer tail to the right is called positive skewness. Negative skewness has the tail extending toward the left. This can be seen in Figure 2.7. 
FIGURE 2.7 Skewness. Nearly all price distributions are positively skewed, showing a longer tail to the right, at higher prices.

     In a perfectly normal distribution, the mean, median, and mode all coincide. As prices become positively skewed,typical of a period of higher prices, the mean will show the greatest change, the mode will show the least, and the median will fall in between. The difference between the mean and the mode, adjusted for dispersion using the standard deviation of the distribution使用分布的标准偏差 针对 离差进行调整, gives a good measure of skewness

     The distance between the mean and the mode, in a moderately skewed distribution适度偏态分布, turns out to be three times the difference between the mean and the median; the relationship can also be written as:

     To show the similarity between the 2nd and 3rd moments (variance and skewness) the more common computational formula is(The sample skewness is computed as the Fisher-Pearson coefficient of skewness, i.e.)
S_K = \frac{\sum_{i=1}^{n}(p_i - \bar{P})^3}{n^{1/2} \sigma^3}

m_m = \frac{1}{n} \sum_{i=1}^{n} (R_i - \bar{R})^m is the biased sample mth central moment


     The skewness of a data series can sometimes be corrected using a transformation. Price data may be skewed in a specific pattern. For example, if there are 3 occurrences at twice the price, and 1/9 of the occurrences at 3 times the price, this may result in a positively skewed distribution(如果价格数据的模式是价格的两倍出现 3 次,价格的三倍出现 1/9,这可能会导致正偏分布). In this case, taking the square root of each data item can transform the data into a normal distribution. The characteristics of price data often show a logarithmic, power, or square-root relationship.  For example,

  •  it is common to use a logarithmic transformation of price data to account for the effect of compounding.通常使用价格数据的对数变换来说明复利的影响
  • ibm_df['log_rtn'] = np.log( ibm_df['Adj Close']/ibm_df['Adj Close'].shift(1) )

  • Power transformations may also be used to adjust for the fact that price changes are often proportional to the current price level. 幂转换也可用于调整 价格变化通常与当前价格水平成正比的事实
  • Square-root transformations may be used to account for the fact that variance in price changes tends to increase with the price level.平方根变换可用于说明 价格变化的方差往往随价格水平而增加的 事实
  • It is important to note that the specific transformation used will depend on the characteristics of the data and the underlying distribution. It is also important to check the transformed data for normality using statistical tests, such as the Shapiro-Wilk test or the Kolmogorov-Smirnov test, before using standard normal probability calculations. If the transformed data is not normal, other nonparametric methods may be used to estimate probabilities.

         The normality diagnostic is a statistical test based on a null hypothesis that you need to determine whether you can accept or reject. Conveniently, the following tests that you will implement have the same null hypothesis. The null hypothesisH_0​ states that the data is normally distributed; for example, you would reject the null hypothesisH_0if the p-value is less than 0.05, making the time series not normally distributed. Let's create a simple function, is_normal() , that will return either Normal or Not Normal based on the p-value:
    from scipy.stats import shapiro, kstest, normaltest
    from statsmodels.stats.diagnostic import kstest_normal, normal_ad
    def is_normal( test, p_level=0.05, name='' ):
        stat, pvalue = test
        print( name + ' test')
        print( 'statistic: ', stat )
        print( 'p-value:', pvalue)
        return 'Normal' if pvalue>0.05 else 'Not Normal'
    normal_args = ( np.mean(ibm_df['sim_rtn'].dropna()), np.std(ibm_df['sim_rtn'].dropna()) )
    # The Shapiro-Wilk test tests the null hypothesis that 
    # the data was drawn from a normal distribution.
    print( is_normal( shapiro(ibm_df['sim_rtn'].dropna()), name='##### Shapiro-Wilk' ) )
    # Test whether a sample differs from a normal distribution.
    # statistic:                     z-score = (x-mean)/std
    #           s^2 + k^2, where s is the z-score returned by skewtest 
    #           and k is the z-score returned by kurtosistest.
    print( is_normal( normaltest(ibm_df['sim_rtn'].dropna()), name='##### normaltest' ) )
    # Anderson-Darling test for normal distribution unknown mean and variance.
    print( is_normal( normal_ad(ibm_df['sim_rtn'].dropna()), name='##### Anderson-Darling' ) )
    # Test assumed normal or exponential distribution using Lilliefors’ test.
    # Kolmogorov-Smirnov test statistic with estimated mean and variance.
    print( is_normal( kstest_normal(ibm_df['sim_rtn'].dropna()), name='##### Kolmogorov-Smirnov') )
    # The one-sample test compares the underlying distribution F(x) of
    # a sample against a given distribution G(x).
    # The two-sample test compares the underlying distributions of 
    # two independent samples. 
    # Both tests are valid only for continuous distributions.
    print( is_normal( kstest( ibm_df['sim_rtn'].dropna(), 
                            ), name='##### KS' 
    # https://blog.csdn.net/Linli522362242/article/details/128294150

    The output from the tests confirms the data does not come from a normal distribution. You do not need to run that many tests. The shapiro test, for example, is a very common and popular test that you can rely on. 

     To calculate the probability level of a distribution based on the skewed distribution of price, we can convert the normal probability to the exponential probability equivalent, , using要计算基于价格偏态分布的 分布的概率水平,我们可以将正态概率转换为等价的指数概率 

  • While the normal probability, P, understates the probability of occurrence in a price distribution正态概率 P 低估了价格分布中出现的概率,
  • the exponential distribution, , will overstate the probability指数分布则高估了概率.
  • Whenever possible, it is better to use the exact calculation;
  • however, when calculating risk, it might be best to err on the side of slightly higher than expected risk在计算风险时,最好选择略高于预期的风险。.

Skewness in Distributions at Different Relative Price Levels不同相对价格水平下的分布偏度

FIGURE 2.7 Skewness. Nearly all price distributions are positively skewed, showing a longer tail to the right, at higher prices.

     Because the lower price levels of most commodities are determined by production costs, price distributions show a clear tendency to resist moving below these thresholds. This contributes to the positive skewness in those markets. Considering only the short term, when prices are at unusually high levels, they can be volatile and unstable, causing a negative skewness that can be interpreted as being top heavy. Somewhere between the very high and very low price levels, we may find a frequency distribution that looks normal. Figure 2.8 shows the change in the distribution of prices over, for example, 20 days as prices move sharply higher图 2.8 显示了价格分布在例如 20 天内随着价格急剧上涨而发生的变化. The mean shows the center of the distributions as they change from positive to negative skewness. This pattern indicates that a normal distribution is not appropriate for all price analysis, and that a log, exponential, or power distribution would only apply best to long-term analysis.此模式表明正态分布并不适合所有价格分析,对数分布、指数分布或幂分布仅最适用于长期分析

FIGURE 2.8 Changing distribution at different price levels. A, B, and C are increasing mean values of three shorter-term distributions and show the distribution changing from positive to negative skewness.

Kurtosis (4th Moment)

     One last measurement, kurtosis, is needed to describe the shape of a price distribution. Kurtosis is the peakedness or flatness of a distribution as shown in Figure 2.9. This measurement is good for an unbiased assessment of whether prices are trending or moving sideways.

  • If you see prices moving steadily higher,(occurs when the market is trending) then the distribution will be flatter and cover a wider range. This is call negative kurtosis.
  • If prices are rangebound区间波动(typical of a sideways market), then the frequency will show clustering around the mean and we have positive kurtosis.

Steidlmayer’s Market Profile, discussed in Chapter 18, uses the concept of kurtosis, with the frequency distribution accumulated dynamically using real-time price changes频率分布使用实时价格变化动态累积

FIGURE 2.9 Kurtosis. A positive kurtosis is when the peak of the distribution is greater than normal, typical of a sideways market. A negative kurtosis, shown as a flatter distribution, occurs when the market is trending.

Following the same form as the 3rd moment, skewness, kurtosis can be calculated as

     The kurtosis refects the impact of extreme values because of its power of four. There are two types of definitions with and without minus three; refer to the following two equations. The reason behind the deduction of three in equation (4B), is that for a normal distribution, its kurtosis based on equation (4A) is three(对于正态分布,基于等式 (4A) 的峰度为三):

data standardization is a process of converting data to z-score values based on the mean and standard deviation of the data 

scipy.stats.kurtosis(a, axis=0, fisher=True, bias=True, nan_policy='propagate', *, keepdims=False)
fisher bool, optional

  • If True, Fisher’s definition is used (normal ==> 0.0).
  • If False, Pearson’s definition is used (normal ==> 3.0).

     Some books distinguish these two equations by calling equation (4B) excess kurtosis超峰度. However, many functions based on equation (4B) are still named kurtosis. Since we know that a standard normal distribution has a zero mean, unit standard deviation, zero skewness(为方便计算,将峰度值-3 ( Kurtosis-3 ),因此正态分布的峰度变为0,方便比较。), and zero kurtosis (based on equation 4B).

An alternative calculation for kurtosis is 

Standard unbiased estimator

     Given a sub-set of samples from a population, the sample excess kurtosis  above is a biased estimator of the population excess kurtosis. An alternative estimator of the population excess kurtosis, which is unbiased in random samples of a normal distribution, is defined as follows 

 其中k4是第四个累积量的唯一对称无偏估计k2是第二个累积量(cumulant)的无偏估计(与样本方差的无偏估计相同),m4是关于均值的第四个样本矩,m2是第二个样本矩 关于均值,xi 是第 i 个值,并且是样本均值。 调整后的 Fisher-Pearson 标准化力矩系数 是在 Excel 和几个统计软件包(包括 Minitab、SAS 和 SPSS)中找到的版本。

不幸的是,在非正态样本中 本身通常是有偏见的。

     Most often the excess kurtosis is used, which makes it easier to see abnormal distributions. Excess kurtosis, KE = K – 3 because the normal value of the kurtosis is 3 (为方便计算,将峰度值-3 ( Kurtosis-3 ),因此正态分布的峰度变为0,方便比较。). 

     Kurtosis is also useful when reviewing system tests. If you find the kurtosis of the daily returns, they should be somewhat better than normal if the system is profitable; however, if the kurtosis is above 7 or 8, then it begins to look as though the trading method is overfitted. A high kurtosis means that there are an overwhelming number of profitable trades of similar size, which is not likely to happen in real trading. Any high value of kurtosis should make you immediately suspicious. 峰态在审查系统测试时也很有用。 如果您发现每日收益的峰度,如果系统有利可图,它们应该比正常情况好一些; 但是,如果峰度高于 7 或 8,则交易方法开始看起来好像过度拟合。 高峰度意味着存在大量类似规模的获利交易,这在实际交易中不太可能发生任何高峰度值都会让您立即产生怀疑

Choosing between Frequency Distribution and Standard Deviation

     Frequency distributions are important because the standard deviation doesn’t work for skewed distributions, which is most common for most price data. For example, 

  • if we look back at the histogram for wheat, the average price over the past 25 years was $3.62 and the standard deviation of those prices was $1.16,
  • then 1 standard deviation to the left of the mean is $2.46(=3.62-1.16), a bin which has no data.
  • On the right side, 3.5 standard deviations, which should contain 100% of the data, is $7.68(=3.62+3.5*1.16), far below the actual high price.

     Then using the standard deviation can fail on both ends of the distribution for highly skewed data, while the frequency distribution(count) gives a very clear and useful picture. If we wanted to know the price at the 10% and 90% probability levels based on the frequency distribution, we would sort all the data from low to high. If there were 300 monthly data points, then the 10% level would be in position 30 and the 90% level in position 271. The median price would be at position 151. This is shown in Figure 2.10.

FIGURE 2.10 Measuring 10% from each end of the frequency distribution. The dense clustering at low prices will make the lower zone look narrow, while high prices with less frequent data will appear to have a wide zone.

     When there is a long tail to the right, both the frequency distribution and the standard deviation imply that large moves are to be expected. When the distribution is very symmetric, then we are not as concerned. For those markets that have had extreme moves, neither method will tell you the size of the extreme move that could occur. There is no doubt that, given enough time, we will see profits and losses that are larger than we have seen in the past, perhaps much larger. 


     Serial correlation or autocorrelation means that there is persistence in the data; that is, future data can be predicted (to some degree) from past data. Such a quality could indicate the existence of trends.序列相关或自相关意味着数据存在持久性; 也就是说,可以根据过去的数据(在某种程度上)预测未来的数据。 这种质量可能表明趋势的存在。 A simple way of finding autocorrelation is to put the data into column A of a spreadsheet, then copy it to column B while shifting the data down by 1 row. Then find the correlation of column A and column B. Additional correlations can be calculated shifting column B down 2, 3, or 4 rows, which might show the existence of a cycle.

  • The ACF and PACF plots show significant autocorrelation or partial autocorrelation above the confidence interval. The shaded portion represents the confidence interval, which is controlled by the alpha parameter in both pacf_plot and acf_plot functions. The default value for alpha in statsmodels is 0.05 (or a 95% confidence interval). Being significant could be in either direction; strongly positive the closer to 1 (above) or strongly negative the closer to -1 (below).https://blog.csdn.net/Linli522362242/article/details/127932130
  • If there is a strong correlation between past observations at lags 1, 2, 3, and 4, this means that the correlation measure at lag 1 is influenced by the correlation with lag 2, lag 2 is infuenced by the correlation with lag 3, and so on. ACF (y_t,y_{t-1},y_{t-2},...,y_{t-(p-1)},y_{t-p})
  • The ACF measure at lag 1 will include these influences of prior lags if they are correlated.
  • In contrast, a PACF at lag 1 will remove these influences to measure the pure relationship at lag 1 with the current observation. PACF(y_t,y_{t-p})

    If we assume an AR(p) model(autoregression model:\large AR(p) = y_t = \alpha + \theta_1 y_{t-1} + \theta_2 y_{t-2}+ ... + \theta_p y_{t-p} + \epsilon_tORy_t= \beta_0 + \sum_{p=1}^{P}\beta_p y_{t-p} + \epsilon_tOR), then we may wish to only measure the association between y_t and y_{t-p} and filter out the linear influence of the random variables that lie in between (i.e.,), which requires a transformation on the time series. Then by calculating the correlation of the transformed time series we obtain the partial autocorrelation function (PACF).https://blog.csdn.net/Linli522362242/article/details/127558757

What Is the Durbin Watson Statistic? 

     The Durbin Watson (DW) statistic is a test for autocorrelation in the residuals from a statistical model or regression analysis.(The Durbin Watson statistic is a test for autocorrelation in a regression model's output.) The Durbin-Watson statistic will always have a value ranging between 0 and 4.

  • A value of 2.0 indicates there is no autocorrelation detected in the sample(OR with a value of 2.0 indicating zero autocorrelation.).
  • Values from 0 to less than 2 point to positive autocorrelation and
  • values from 2 to 4 means negative autocorrelation.
  • Autocorrelation can be useful in technical analysis, which is most concerned with the trends of security prices using charting techniques in lieu/luː/替代 of a company's financial health or management.

     A stock price displaying positive autocorrelation would indicate that the price yesterday has a positive correlation on the price today—so if the stock fell yesterday, it is also likely that it falls today. A security that has a negative autocorrelation, on the other hand, has a negative influence on itself over time—so that if it fell yesterday, there is a greater likelihood it will rise today.

A similar assessment can be also carried out with the Breusch–Godfrey test and the Ljung–Box test(ts10_Univariate TS模型_circle mark pAcf_ETS_unpack product_darts_bokeh band interval_ljungbox_AIC_BIC_LIQING LIN的博客-CSDN博客).

     A formal way of finding autocorrelation is by using the Durbin-Watson test, which gives the d-statistic. This approach measures the change in the errors (e), the difference between N data points and their average value.
The Hypotheses for the Durbin Watson test are:

  • H0 = no first order autocorrelation.
  • H1 = first order correlation exists.

(For a first order correlation, the lag is one time unit).
Assumptions are:

     Autocorrelation of residuals is a measure of the correlation between the error terms (residuals) of a time series model at different lags. It is important to interpret the autocorrelation of residuals because it can indicate whether the model is adequately capturing all the systematic patterns in the data.

     If the autocorrelation of residuals is close to zero for all lags, it suggests that the model is capturing all the systematic patterns in the data, and the residuals are behaving randomly and independently. This is desirable because it indicates that the model is a good fit to the data.

     On the other hand, if the autocorrelation of residuals is high for one or more lags, it suggests that the model is not capturing all the systematic patterns in the data. Specifically, it suggests that there is some pattern or trend in the residuals that is not accounted for by the model, and this can lead to biased parameter estimates, inflated standard errors, and invalid test statistics.

     If e_t is the residual(the difference between the calculated/observed value and the predicted value for a particular observation) given by 

<== 残差自相关
<==residual r_t measures the relationship between y_tand,
       r_t = y_t - y_{t-1} = \rho r_{t-1}+ \nu_t ==>
      residual r_{t-1} measures the relationship between  and y_{t-2}
      r_{t-1} = y_{t-1} - y_{t-2} = \rho r_{t-2}+ \nu_t 
<==AR(1) = AR(P=1) =In this regression model, the response variable in the previous time period has become the predictor and the errors have our usual assumptions about errors\epsilon_t (white noise) in a simple linear regression model(note is a constant) https://blog.csdn.net/Linli522362242/article/details/127558757<==\large AR(p) = y_t = \alpha + \theta_1 y_{t-1} + \theta_2 y_{t-2}+ ... + \theta_p y_{t-p} + \epsilon_t
the Durbin-Watson test statistic is

where T is the number of observations. For large Td is approximately equal to 2(1 − ), where  is the sample autocorrelation of the residuals, d = 2 therefore indicates no autocorrelation. The value of d always lies between 0 and 4. If the Durbin–Watson statistic is substantially less than 2(d<2), there is evidence of positive serial correlation. As a rough rule of thumb, if Durbin–Watson is less than 1.0, there may be cause for alarm. Small values of d indicate successive error terms are positively correlated较小的 d 值表示连续误差项呈正相关. If d > 2, successive error terms are negatively correlated. In regressions, this can imply an underestimation of the level of statistical significance.在回归中,这可能意味着低估了统计显着性水平 

     A positive autocorrelation, or serial correlation, means that a positive error factor has a good chance of following another positive error factor正自相关或序列相关意味着正误差因子很有可能跟随 另一个正误差因子.https://en.wikipedia.org/wiki/Durbin%E2%80%93Watson_statistic

 #######(5)6.3自相关的检验 - 百度文库

####### https://www.investopedia.com/terms/d/durbin-watson-statistic.asp#:~:text=The%20Durbin%20Watson%20statistic%20is,above%202.0%20indicates%20negative%20autocorrelation.

Assume the following (x,y) data points:

  • Pair One=(10,   1,100)
  • Pair Two=(20,   1,200)
  • Pair Three=(35, 985)
  • Pair Four=(40,   750)
  • Pair Five=(50,   1,215)
  • Pair Six=(45,     1,000)

     Using the methods of a least squares regression to find the "line of best fit," the equation for the best fit line of this data is:

Y=−2.6268*x +1,129.2

     This first step in calculating the Durbin Watson statistic is to calculate the expected "y" values using the line of best fit equation. For this data set, the expected "y" values are:

  • ExpectedY(1)=(−2.6268*10)+1,129.2=1,102.9
  • ExpectedY(2)=(−2.6268*20)+1,129.2=1,076.7
  • ExpectedY(3)=(−2.6268*35)+1,129.2=1,037.3
  • ExpectedY(4)=(−2.6268*40)+1,129.2=1,024.1
  • ExpectedY(5)=(−2.6268*50)+1,129.2=997.9
  • ExpectedY(6)=(−2.6268*45)+1,129.2=1,011

Next, the differences of the actual "y" values versus the ExpectedY values, the errors, are calculated:

  • Error(1)=(1,100−1,102.9)=−2.9
  • Error(2)=(1,200−1,076.7)=123.3
  • Error(3)=(985−1,037.3)=−52.3
  • Error(4)=(750−1,024.1)=−274.1
  • Error(5)=(1,215−997.9)=217.1
  • Error(6)=(1,000−1,011)=−11​

Next these errors must be squared and summed:

Next, the value of the error minus the previous error are calculated and squared:

  • Difference(1)=(123.3 − (−2.9))=126.2
  • Difference(2)=(−52.3 − 123.3)=−175.6
  • Difference(3)=(−274.1 − (−52.3))=−221.9
  • Difference(4)=(217.1 − (−274.1))=491.3
  • Difference(5)=(−11 − 217.1)=−228.1
  • Sum of Differences Square=389,406.71​

Finally, the Durbin Watson statistic is the quotient of the squared values:

     Durbin Watson : d=389,406.71/140,330.81=2.77

Note: Tenths place may be off due to rounding errors in the squaringDurbin Watson Test - GeeksforGeeks

Probability of Achieving a Return 

FIGURE 2.6 Normal distribution showing the percentage area included within one standard deviation about the arithmetic mean.

     If we see the normal distribution (Figure 2.6) as the annual returns for the stock market over the past 50 years, then the mean is about 8%, and one standard deviation is 16%. In any one year, we can expect the returns to be 8%; however, there is a 32% chance that it will be either greater than 24% (= 8% + 16%) or less than –8% (= 8% - 16%). If you would like to know the probability of a return of 20% or greater, you must first rescale the values如果您想知道 20% 或更高回报的概率,您必须首先重新调整值,
Probability \; of \; reaching \; objective = \frac{Objective - Mean}{Standard \; deviation}
If your objective is 20%, we calculate 

     Table A1.1, Appendix 1 gives the probability for normal curves. Looking up the standard deviation of 0.75 gives 27.34%, a grouping of 54.68%(=27.34%*2) of the data. That leaves one half of the remaining data, or 22.66%(0.2266=0.5-0.2734 or 0.2266=1-0.7734)), above the target of 20%.
Table A1.1, Appendix 1
使用Table Va的值需要减去0.5

pr(X \geq 0.2) = pr(Z \geq \frac{(0.2-\mu)}{\sigma}) = 1- pr(Z \leq \frac{(0.2-\mu)}{\sigma}) 

 pr(X \geq 0.2) = pr(Z \geq \frac{(0.2-0.08)}{0.16}) = 1- pr(Z \leq \frac{(0.2-0.08)}{0.16})

left tail area = right tail area                     = 1- pr(Z \leq 0.75)=1-0.7734=0.2266#1_Statistics_agent_policy_explanatory_predictor_response_numeric_mode_Hypothesis_Type I_Chi-squ_LIQING LIN的博客-CSDN博客

# confidence (interval)= 0.95 ==>(1-confidence)/2 = 0.025 (right tail area) ==>1-0.025=0.975 
# P(z<Z?) = 0.975 ==> Z=1.96

cdf : P(z<1.96) ==> 0.975 (probability(area) under the curve)calculate the area under the curve that corresponds to a particular z value (the standard deviation)
mpf3_Nonlinearit_Black-Scholes_option_Implied volati_Type I & II_Incremental_bisection_Newton_secant_the volatility is below intrinsic value_LIQING LIN的博客-CSDN博客#left z <== ppf(q, loc=0, scale=1) = ppf(prob, theta, sigma)   # Pr(Z<=z) = prob
#right z <== ppf(q, loc=0, scale=1) = ppf(1-prob, theta, sigma)   # Pr(Z>=z) = 1-prob   mpf5_定价Bond_yield curve_Spot coupon_duration_有效利率_连续复利_远期_Vasicek短期_CIR模型Derivatives_Tridiagonal_ppf_LIQING LIN的博客-CSDN博客

from scipy import stats
# Calculating z-score
z = (xbar-mu0)/s # z = (67-52)/16.3
# Calculating probability under the curve
p_val = 1 - stats.norm.cdf(z)
print( "Prob. to score more than 67 is", round(p_val*100, 2), "%" )


Calculating the Probability Automatically

     It is inconvenient to look up the probability values in a table when you are working with a spreadsheet or computer program, yet the probabilities are easier to understand than standard deviation values. You can calculate the area under the curve that corresponds to a particular z value (the standard deviation), using the following approximation. 

Let z′ = |z|, the absolute value of z. Then
Then the probability, P, that the returns will equal or exceed(>=) the expected return is
Using the example where the standard deviation z = 0.75, we perform the calculation
Substituting the value of into the equation for P, we get
     Then there is a 22.7% probability that a value will exceed 0.75 standard deviations (that is, fall on one end of the distribution outside the value of 0.75). The chance of a value falling inside the band formed by ±0.75 standard deviations is 1 – (2 × 0.2266) = 0.5468, or 54.68%. That is the same value found in Table A1.1, Appendix 1. 

     For those using Excel, the answer can be found with the function normdist(p,mean, stdev,cumulative), where

  • p is the current price or value
  • mean is the mean of the series of p’s
  • stdev is the standard deviation of the series of p’s, and
  • cumulative is “true” if you want the z value.
  • Then the result of normdist(35,20,5,true) is 0.99865, or a 99.8% probability, and if cumulative is “false” then the result is 0.000866
  • vs

Standard Error 

     Throughout the development and testing of a trading system, we want to know if the results we are seeing are as expected. The answer will always depend on the size of the data sample and the amount of variance that is typical of the data during this period. 

     标准误差 (SE) 实际上是统计数据变异性或离散性的度量,例如根据数据样本估计的样本均值。 它被定义为统计量抽样分布的标准差,即从 相同大小和总体的样本中 可以获得的统计量 的所有可能值的分布

     SE 通常用于量化基于样本统计量的 总体参数估计的精度或准确度。 具体来说,假设采样过程重复多次,它提供了样本统计量真实总体参数之间的平均距离的估计

example:#1_Statistics_agent_policy_explanatory_predictor_response_numeric_mode_Hypothesis_Type I_Chi-squ_LIQING LIN的博客-CSDN博客

SE 使用以下公式计算:
SE = s / sqrt(n),
其中 s 是样本标准差,n 是样本大小。

     SE 常用于假设检验和置信区间估计。 在假设检验中(known:sample standard deviation ),SE 用于计算检验统计量并确定 p 值

p_val = stats.t.sf( np.abs(t_sample), n-1 )
print( "Lower tail p-value from t-table: ", p_val )

,p 值(P-value)是在 假设原假设为真的情况下 观察到的样本统计量(Critical t value(临界 t 值, n-1from tables  )与 根据样本计算出的统计量一样极端的概率在置信区间估计中,SE 用于围绕样本统计量构建一个区间,该区间可能包含具有一定置信度的真实总体参数

  • P-value(Rejection Region Area, 拒绝域的面积): The probability of obtaining a test statistic result is at least as extreme as the one that was actually observed( the probability we obstain a more extreme value than the observed test statistic Z. 当原假设为真时,所得到的样本观察结果或更极端结果出现的概率), assuming that the null hypothesis is true (usually in modeling, against each independent variable, a p-value < 0.05 is considered significant and > 0.05 is considered insignificant; nonetheless, these values and definitions may change with respect to context).

     One descriptive measure of error, called the standard error(SE), uses the variance, which gives the estimation of error based on the distribution of the data using multiple data samples. It is a test that determines how the sample means differ from the actual mean of all the data. It addresses解决 the uniformity of the data.

  • It is important to note that the SE measures the precision of the estimate of the population parameter, not the uniformity of the data.
  • The uniformity of the data is usually assessed被评估 using other measures of dispersion, such as the range, interquartile range, or coefficient of variation.

(standard error) SESE = \sqrt{\frac{Var}{n}}

  • where Var = the variance of the sample means
  • n = the number of data points in the sample means

     Sample means refers to the data being sampled a number of times, each with n data points, and the means of these samples are used to find the variance. In most cases, we would use a single data series and calculate the variance as shown earlier in this chapter.

test Statistic for Hypothesis about a Population Mean(\mu_0 :known): \sigma unknown

Unknown:population variance or polulation standard deviation

but, known:sample standard deviation s
 OR   (9.2)

One-Tailed Test
  • let us consider an example of a one-tailed test about a population mean for the unknown case. a business travel magazine wants to classify transatlantic gateway airports according to the mean rating for the population of business travelers. a rating scale with a low score of 0 and a high score of 10 will be used, and airports with a population mean rating greater than 7 will be designated as superior service airports. the magazine staff surveyed a sample of 60 business travelers at each airport to obtain the ratings data. the sample for london’s Heathrow airport provided a sample mean rating of \bar{x} = 7.25 and a sample standard deviation of s = 1.052. do the data indicate that Heathrow should be designated as a superior service airport?

    we want to develop a hypothesis test for which the decision to reject H_0 will lead to the conclusion that the population mean rating for the Heathrow airport is greater than 7. thus, an upper tail test with H_a: \mu > 7 is required. the null and alternative hypotheses for this upper tail test are as follows:

    we will use \alpha = .05 as the level of significance for the test.
         using equation (9.2) with \bar{x} = 7.25, \mu_0 = 7, s = 1.052, and n = 60, the value of the test statistic is

          the sampling distribution of t has n − 1 = 60 − 1 = 59 degrees of freedom. because the test is an upper tail test, the p-value is P (t ≥ 1.84), that is, the upper tail area corresponding to the value of the test statistic.

         the t distribution table provided in most textbooks will not contain sufficient detail to determine the exact p-value, such as the p-value corresponding to t = 1.84. for instance, using table 2 in appendix b, the t distribution with 59 degrees of freedom provides the following information.

    we see that t = 1.84 is between 1.671 and 2.001. although the table does not provide the exact p-value, the values in the area in upper tail row show that the p-value must be less than .05 and greater than .025. with a level of significance of \alpha = .05, this placement is all we need to know to make the decision to reject the null hypothesis and conclude that Heathrow should be classified as a superior service airport.

    because it is cumbersome to use a t table to compute p-values, and only approximate values are obtained, we show how to compute the exact p-value using excel or Minitab. the directions can be found in appendix f at the end of this text. using excel or Minitab with t = 1.84 provides the upper tail p-value of .0354 for the Heathrow airport hypothesis test. with .0354 < .05, we reject the null hypothesis and conclude that Heathrow should be classified as a superior service airport.
    from scipy import stats
    import numpy as np
    xbar = 7.25 # sample' mean
    mu0 = 7     # population's mean
    s=1.052     # sample standard deviation
    n=60        # sample size 
    t_sample = (xbar - mu0) / ( s/np.sqrt(float(n)) )
    print("Test Statistic t: ", round(t_sample,2))
    alpha = 0.05
    t_alpha = stats.t.ppf(alpha, n-1)       ######right tail
    print( "Critical value from t-table: ", -round(t_alpha, 3) )
    # upper tail p-value from t-table:
    p_val = stats.t.sf( np.abs(t_sample), n-1 )
    print( "Upper tail p-value from t-table: ", round(p_val,4) )

    the decision whether to reject the null hypothesisH_0 in the unknown case can also be made using the critical value approach临界值方法. the critical value corresponding to an area of \alpha = .05 in the upper tail of a t distribution with 59 degrees of freedom is t_{.05} = 1.671. thus the rejection rule using the critical value approach is to reject H_0if t ≥ 1.671. because t = 1.84 > 1.671H_0is rejected. Heathrow should be classified as a superior service airport.

TABLE 9.3 Summary of Hypothesis about a Population Mean:  unknown case 

t Test Formula & Calculation | How to Find t Value with Examples - Video & Lesson Transcript | Study.com

t-Statistic and Degrees of Freedom

     When fewer prices or trades are used in a distribution, we can expect the shape of the curve to be more variable. For example, it may be spread out分散 so that the peak of the distribution will be lower分布的峰值较低 and the tails will be higher尾部较高. A way of measuring how close the sample distribution of a smaller set is to the normal distribution (of a large sample of data) is to use the t-statistic测量较小集合的样本分布 与(大数据样本)正态分布 的接近程度的一种方法是使用 t 统计量 (also called the student’s t-test, developed by W. S. Gossett). The t-test is calculated according to its degrees of freedom (df), which is n-1, where n is the sample size, the number of prices used in the distribution.

     The more data in the sample, the more reliable the results. We can get a broad view of the shape of the distribution by looking at a few values of t in Table 2.2, which gives the values of t corresponding to the upper tail areas of 0.10, 0.05, 0.025, 0.01, and 0.005. The table shows that as the sample size n increases, the values of t approach those of the standard normal values of the tail areas.

     The values of t that are needed to be significant can be found in Appendix 1, Table A1.2, “t-Distribution.” The column headed “0.10” gives the 90% confidence level, “0.05” is 95%, and “0.005” is 99.5%. For example, if we had 20 prices in our sample, and wanted the probability of the upper tail to be 0.025, then the value of t would need to be 2.086. For smaller samples, the value of t would be larger in order to have the same confidence

     When testing a trading system, degrees of freedom can be the number of trades produced by the strategy. When you have few trades, the results may not represent what you will get using a longer trading history. When testing a strategy, you will find a similar relationship between the number of trades and the number of parameters, or variables, used in the strategy. The more variables used, the more trades are needed to create expectations with an acceptable confidence level.

2-Sample -Test

     You may want to compare two periods of data to decide whether the price patterns have changed significantly. Some analysts use this to eliminate inconsistent data, but the characteristics of price and economic data change as part of the evolving process, and systematic trading should be able to adapt to these changes. This test is best applied to trading results in order to decide if a strategy is performing consistently. This is done with a 2-sample t-test:


and the two periods being compared are mutually exclusive. The degrees of freedom, df needed to find the confidence levels in Table A1.2 can be calculated using Satterthwaite’s approximation, where s is the standard deviation of the data values:

     When using the t-test to find the consistency of profits and losses generated by a trading system, replace the data items by the net returns of each trade, the number of data items by the number of trades, and calculate all other values using returns rather than prices. 

two samples come from the same population distribution

Pictured are

  • two distributions of data, X1 and X2 , with unknown means and standard deviations.
  • The second panel shows the sampling distribution of the newly created random variable  . This distribution is the theoretical distribution of many many sample means from population 1 minus sample means from population 2.
  • 样本均值差异的理论抽样分布都是正态分布的
    The Central Limit Theorem tells us that this theoretical sampling distribution of differences in sample means is normally distributed, regardless of the distribution of the actual population data shown in the top panel.
  • Because the sampling distribution is normally distributed,we can develop a standardizing formula and calculate probabilities from the standard normal distribution in the bottompanel, the Z distribution.

Example 10.3

     一个有趣的研究问题是不同类型的教学形式对学生成绩的影响(如果有的话)。 为了调查这个问题,我们从混合班级中抽取了一个学生成绩样本,并从标准讲座形式的班级中抽取了另一个样本。 两堂课都是针对同一主题的。 35 名混合学生的平均课程成绩百分比为 74,标准差为 16标准讲座班的 40 名学生的平均成绩为 76%,标准差为 9以 5% 进行测试,看看是否存在 标准讲座课程和混合课程之间的总体平均成绩是否存在显着差异

     首先我们注意到我们有两组,来自混合班级的学生和来自标准讲座格式班级的学生。 我们还注意到,我们感兴趣的随机变量是学生的成绩,一个连续随机变量。 我们可以用不同的方式提出研究问题并有一个二元随机变量。 例如,我们可以研究成绩不及格或成绩为 A 的学生的百分比。 这两者都是二元的,因此是比例测试,而不是像这里的情况那样的手段测试。 最后,没有假设哪种格式可能会导致更高的分数,因此假设被表述为双尾检验

      As would virtually always be the case, we do not know the population variances of the two distributions and thus our test statistic is:

To determine the critical value of the Student's t we need the degrees of freedom. For this case we use: df = n1+ n2 - 2 = 35 + 40 -2 = 73. This is large enough to consider it the normal distribution thus t_{\alpha/2} = 1.96 <|-0.65|.(落在拒绝域中) Again as always we determine if the calculated value is in the tail determined by the critical value. In this case we do not even need to look up the critical value: the calculated value of the difference in these two average grades is not even one standard deviation apart. Certainly not in the tail.

     Conclusion: Cannot reject the null at α=5%. Therefore, evidence does not exist to prove that the grades in hybrid and standard classes differ.

How to Calculate Degrees of Freedom for Any T-Test - Statology

anova - Is there ever a statistical reason NOT to use Satterthwaite's method to account for unequal variances? - Cross Validated

The Welch two-sample t test uses the Satterthwaite DF, which is often smaller than the DF= n1+n2−2 of the pooled 2-sample t test (never larger). This means that the power of the Welch 2-sample t test is somewhat smaller than the power of the pooled test, often not enough smaller to matter for practical purposes. But some statisticians do make an exception to standard practice when sample sizes are very small and sample standard deviations are similar.

Standardizing Risk And Return

     In order to compare one trading method with another, it is necessary to standardize both the tests and the measurements used for evaluation. If one system has total returns of 50% and the other 250%, we cannot decide which is best unless we know the duration of the test and the volatility of the returns, or risk.

  • If the 50% return was over 1 year and the 250% return over 10 years, then the first one is best.
  • Similarly, if the first return had an annualized risk of 10% and the second a risk of 50%, then both would be equivalent.
    50%/10%=5 VS 250%/50%=5

The return relative to the risk is crucial to performance as will be discussed in Chapter 21, System Testing. For now it is only important that returns and risk be annualized or standardized to make comparisons valid.

annualized simple return and annualized log return

     The calculations for both 1-period returns and annualized returns will be an essential part of all performance evaluations. In its simplest form, the 1-period rate of return, R, or the holding period rate of return is often given as
For the stock market, which has continuous prices, this can be written
\small R_t = \frac{P_t - P_{t-1}}{P_{t-1}} = \frac{P_t}{P_{t-1}} - 1
pct_change() command that gives us the percentage change over the prior period values

where P_0 = P_{t-1=0} is the initial price and P_1 = P_{t-1=1} is the price after one period has elapsed. The securities industry often prefers a different calculation, 

\small r_t = log(\frac{P_t}{P_{t-1}}) = log(P_t) - log(P_{t-1}) 
Note: the logarithmic base to use. Default is 'e' ==> r_t = ln(\frac{P_t}{P_{t-1}})pfc1_whylog return Nominal Inflation_CPI_Realized Volati_outlier_distplot_Jarque–Bera_pAcf_sARIMAx_LIQING LIN的博客-CSDN博客


     Both methods have advantages and disadvantages. Neither one is the “correct” calculation. Note that in some software, the function log is actually the natural log, andlog_{10} is the log base 10. It is best to always check the definitions. In order to distinguish the two calculations, the first method will be called the standard method and the second the ln method

     In the following spreadsheet example, shown in Table 2.3 over 22 days, the standard returns are in column D and the ln returns in column E. The differences seem small, but the averages are 0.00350 and 0.00339. The standard returns are better by 3.3% over only one trading month. At this rate, the standard method would have yielded returns that were nearly 40%=3.3%*12 higher after one year. The Net Asset Value (NAV), used extensively throughout this book, compounds the periodic returns对定期回报进行复利, and most often has a starting value,
NAV_0 = 100
NAV_t = NAV_{t-1} \times (1+R_t)
for NAV using LN
NAV_0 =100
NAV_t = NAV_{t-1} + NAV_{0} \times ln(1+R_t)
NAV_0 =100
NAV_t = NAV_0 \times \sum_{0}^{t} r_t and 

r_T= \sum_{t=1}^{T} r_t 
Note:ln\left ( (1+R_t)(1+R_t)...(1+R_t) \right ) = \sum ln(1+R_t) = \sum r_t

NAV_T= NAV_0 \times r_T
NAV_t = NAV_0 \times \sum_{0}^{t} r_t

TABLE 2.3 Calculation of Returns and NAVs from Daily Profits and LossesMagic of Log Returns: Practical – Part 2

Compound interest

      Compound interest is interest calculated on the initial principal and also on the accumulated interest of previous periods of a deposit or loan. The effect of compound interest depends on frequency.

     Assume an annual interest rate of 12%. If we start the year with $100 and compound only once, at the end of the year, the principal grows to $112 (P∗(1+r∗t)=100∗(1+0.12∗1)= $112). Interest applied only to the principal is referred to as simple interest.

  If we instead compound each month at 1%(monthly-compounded yield=R=), we end up with more than $112 at the end of the year(T=1 year). That is,=112.68. (It's higher because we compounded more frequently.)

  • The future value (FV) of an investment of present value (PV or P) dollars earning interest at an annual rate of r compounded n times per year for a period of T years is
  • P: the principal, the amount invested, or present value (PV)
  • r: the annual rate (in decimal form),
  • n: the number of times it is compounded or compound frequency.
  • is the interest per compounding period
  • T: the number of year
  • t=n∗T : is the number of compounding periods

    If we known the monthly-compounded yield is 1%, and the value is $112.68 after one year, we want to calculate the present value: 


Annualizing Returns

     In most cases, it is best to standardize the returns by annualizing. This is particularly helpful when comparing two sets of test results, where each covers a different time period. When annualizing, it is important to know that

  • Government instruments use a 360-day rate (based on 90-day quarters).
  • • A 365-day rate is common for most other data that can change daily.
  • Trading returns are best with 252 days, which is the typical number of days in a trading year for the United States (262 days for Europe).

    #########rolling window https://blog.csdn.net/Linli522362242/article/details/122955700 

         There are exactly 252 trading days in 2021. January and February have the fewest (19), and March the most (23), with an average of 21 per month, or 63 per quarter.

         Out of a possible 365 days, 104 days(365/7=52*2=104) are weekend days (Saturday and Sunday) when the stock exchanges are closed. Seven of the nine holidays which close the exchanges fall on weekdays, with Independence Day being observed on Monday, July 5, and Christmas on Friday, December 24. There is one shortened trading session on Friday, November 26 (the day after Thanksgiving Day). ==> 365-104-7-2=252

     The following formulas use 252 days, which will be the standard throughout this book except for certain interest rate calculations; however, 365 or 360 may be substituted, or even 260 for trading days in other parts of the world.

     The Annualized Rate Of Return (AROR) on a simple-interest basis单利 年化回报率 for an investment over n days is 

     where E_0 is the starting equity or account balance, E_n is the equity at the end of the nth
, and 252./n are the years expressed as a decimal.

When the 1-period returns are calculated using the standard method, then the annualized compounded rate of return is 

     Note that AROR or R (capital) refers to the annualized rate of return while is the daily r or 1-period return. Also, the form of the results is different for the two calculations. An increase of 25% for the simple return will show as 0.25 while the same increase using the compounded returns will be 1.25.

     When the 1-period returns use the LN method, then the annualized rate of return is the sum of the returns divided by the number of years

Average_{ln \;\; method} = \frac{\sum_{i=1}^{n} r_i}{n} and r_i = ln(1 + R_i) = ln(\frac{P_i}{P_{i-1}})
Annualized Daily Log Return Ratee^{\frac{252}{n}\sum_{i=1}^{n}ln(1+r_i)}-1


note : e^{r_t} = R_t +1

r_t (k) = ln\left ( \prod_{j=0}^{k-1} (1+e^{r_{t-j}}-1) \right )  Note: log(ab) = log(a) + log(b)
r_t(k) = \sum_{j=0}^{k-1} ln(e^{r_{t-j}})
r_t(k) = \sum_{j=0}^{k-1}r_{t-j}
r_N= \sum_{i=1}^{N} r_i ==>n is the number of years and get the average:r_{mean}= \frac{\sum_{i=0}^{N} r_i}{N}

     An example of this can be found in column F, the row labeled AROR, in the previous spreadsheet. Note that the annualized returns using the ln method are much lower than those using division and compounding. The compounded method will be used throughout this book.

Probability of Returns

     The use of the standard deviation and compounded rate of return are combined to find the probability of a return objective. In the following calculation, the arithmetic mean of continuous returns is ln(1+R_g), and it is assumed that the returns are normally distributed.


  • z = standardized variable
  • T = target value or rate-of-return objective
  • B = beginning investment value
  • ln(\frac{T}{B}) : target log return rate, for each period : ln(\frac{T}{B})^{1/n} = \frac{ln(\frac{T}{B})}{n}
  • R_g = geometric average of periodic returns
    geometric average 

    1+R_t(k) = (1+R_{t})(1+R_{t-1})...(1+R_{t-k+1})
    ln( 1+R_t(k) )= ln( (1+R_{t})(1+R_{t-1})...(1+R_{t-k+1}) ) ==> calculate the average:\frac{ln( 1+R_t(k) )}{t-k}= \frac{ln( (1+R_{t})(1+R_{t-1})...(1+R_{t-k+1}) )}{t-k}
    \frac{ln( 1+R_g )}{n}= \frac{ln( (1+R_{n})(1+R_{n-1})...(1+R_{1}) )}{n}

  • n = number of periods (from B to T)
  • s = standard deviation of the logarithms of the quantities 1 plus the periodic returns 

Risk and Volatility 

     While we would always like to think about returns, it is even more important to be able to assess risk. With that in mind, there are two extreme risks. The first is event risk, which takes the form of an unpredictable price shock. The worst of these is/ˌkætəˈstrɑːfɪk/灾难性的;极糟的 catastrophic risk, which will cause fatal losses or ruin/ ˈruːɪn /毁灭;破产. The second risk is self-induced自感应的,自诱导的 by overleverage, or gearing up使换快档,增加 your portfolio, until a sequence of bad trades causes ruin. The risk of price shocks and leverage will both be discussed in detail later in other chapters.

     The standard risk measurement is useful for comparing the performance of two systems. It is commonly applied to the returns of a single stock or an entire portfolio compared to a benchmark, such as the returns of the S&P 500 or a bond fund. The most common estimate of risk is the standard deviation, σ, of returns,  r, shown earlier in this chapter. For most discussions of risk, the standard deviation will also be called volatility. When we refer to the target volatility of a portfolio, we mean the percentage of risk represented by 1 standard deviation of the returns, annualized当我们提到投资组合的目标波动率时,我们指的是年化回报率的 1 个标准差所代表的风险百分比. For example, in the previous spreadsheet, columns D and E show the daily returns. The standard deviations of those returns are shown in the same columns in the row “Std Dev” as 0.01512 and 0.01495. Looking only at column D, 1 standard deviation of 0.01502 means that there is a 68.26% chance of a daily profit or loss less than 1.502%. However, target volatility always refers to annualized risk, and to change a daily return to an annualized one we simply multiply by . Then the daily standard deviation of returns of 1.512% becomes an annualized volatility of 23.8%, also shown at the bottom of the spreadsheet example. Because we only care about the downside risk, there is a 15.86% chance that we could lose greater than 23.84% in one year. The greater the standard deviation of returns, the greater the risk.

import scipy.stats as stats

mu = 0.00350    # np.mean( daily_returns )
sigma = 0.01502 # np.std( daily_returns )

stats.norm.cdf(mu+sigma, mu, sigma) - stats.norm.cdf(mu-sigma, mu, sigma)

 #68.26% chance of a daily profit or loss less than 1.502%
Annualized Average Return = (1 + Daily Average Return) ^ 252 - 1
= (1 + 0.00350) ^ 252 - 1
= 1.4120091974741036 (expressed as a negative value, since the average return is less than 1)
Annualized Volatility = Daily Standard Deviation * sqrt(252) * 100%
= 0.01502 * sqrt(252) * 100%
= 0.2384 or 23.84%

annual_mean = mu/252
annual_std = sigma*np.sqrt(252)#0.2384 # target volatility

stats.norm.cdf(annual_mean-annual_std, annual_mean, annual_std)

  # the probabiliy of the loss <= annual_mean-annual_std # left tail area


     Beta (β) is commonly used in the securities industry to express the relationship of a single market to an portfolio or index. If beta is zero then there is no relationship; if it is positive then the single series tends to move with the index, both above and below. As beta gets larger the volatility of the single market tends to be increasingly greater than the index. Specifically, 

  • 0 < β < 1, the volatility of the single market is less than the index
  • β = 1, the volatility of the single market is the same as the index
  • β > 1, the volatility of the single market is greater than the index

     A negative beta is similar to a negative correlation, where the moves of the market are generally opposite to the index.

     Beta is found by calculating the linear regression of the single market with the index. It is the slope of the single market divided by the slope of the index. Alpha, the added value, is the y-intercept of the solution. The values can be found using Excel, and are discussed in detail in Chapter 6. A general formula for beta is

where A is the single market and B is a portfolio or index.

调整目标波动率Adjusting to the Target Volatility 

annual_std =0.12
stats.norm.cdf(annual_mean-annual_std, annual_mean, annual_std)


     If we have a target volatility of 12%, that is, we are willing to accept a 16% chance of losing 12% in one year, but the actual returns show an annualized volatility of 23.8% based on an investment of $100,000, then to correct to a 12% target we simply increase the investment by 23.8/12.0=1.98, or a factor of 1.98 to $198,000, while holding the same position size(incorrect). This is essentially deleveraging your positions by trading a smaller percentage of your account value(not accurate)去杠杆化通常是指减少投资组合中借入资金或杠杆的数量,而本声明是指调整投资规模以达到所需的波动水平. Alternatively, we can reduce the position size by dividing by 1.98 and keeping the investment the same(也无法达到预期目标。 这只会减少标的资产的风险敞口exposure, 并按比例减少潜在收益和损失).

  • If the target volatility of 12%, that means you are willing to accept a certain level of risk in your investment portfolio. If the actual returns show an annualized volatility of 23.8%, that means the portfolio is more volatile than you are comfortable with意味着该投资组合的波动性超出了您的承受范围.
  • To reduce the volatility to the target level of 12%, you would need to adjust your portfolio by reducing the position size in the assets with higher volatility and increasing the position size in the assets with lower volatility. Simply increasing the investment amount by a factor of 1.98 to $198,000 while holding the same position size would not necessarily reduce the volatility to the desired level.在持有相同头寸规模的情况下,简单地将投资金额增加 1.98 倍至 198,000 美元,并不一定会将波动性降低至所需水平
  • The correct approach to reducing the volatility of your portfolio to the target level of 12% would depend on your specific portfolio and investment strategy. One possible way to achieve this would be to diversify your portfolio across different asset classes and investment strategies that have lower volatility. Another way would be to use risk management techniques such as hedging or stop-loss orders to limit the downside risk in your portfolio.
  • 总体而言,关于增加投资1.98倍至198,000美元以实现12%的目标波动率的说法是准确的,但去杠杆和减少头寸规模的解释并不准确。

     All results in this book will be shown at the target volatility of 12% unless otherwise stated. This is considered a modest risk level, which can be as high as 18% for some hedge funds. It will allow you to compare various systems and test results and see them at a level of risk that is most likely to represent targeted trading results.

Annualizing Daily and Monthly Returns

     The previous examples used daily data and an annualization factor of . For monthly data, which is most common in published performance tables, we would take the monthly returns and multiply by 12 . In general, annualizing can be done by multiplying the data by the square root of the number of data items in a year. Then we use 252 for daily data, 12 for monthly, 4 for quarterly, and so on.

  • Annualized Daily Return = (1 + Daily Return) ^ 252 - 1
    where 252 represents the number of trading days in a year
    • The daily expected return formula is:
           Expected Daily Return = (Sum of Daily Returns) / (Number of Observations)

      For example, if a stock has daily returns of 0.2%, -0.1%, 0.3%, -0.4%, and 0.1% over a five-day period, the expected daily return would be:

      (0.2% - 0.1% + 0.3% - 0.4% + 0.1%) / 5 = 0.02%

    • For example, if a stock has a daily return of 0.2%, the annualized daily return would be:

      Annualized Daily Return = (1 + 0.2%) ^ 252 - 1 = 64.14%
      This means that if the stock continues to perform at the same rate as it did over the past year, the annualized return would be approximately 64.14%.

    • The daily expected return gives you an idea of the average return you can expect to earn on a daily basis, while the annualized daily return gives you an idea of the average return you can expect to earn over a one-year period, after compounding the returns.

  • For weekly data, the annualization factor is typically 52 (which assumes 52 weeks in a year).
    Annualized weekly Return = (1 + Monthly Return) ^ 52- 1
  • Annualized Monthly Return = (1 + Monthly Return) ^ 12 - 1
    where 12 represents the number of months in a year

Monthly Data Always Appears Less Volatile每月数据总是显得波动较小

     Although monthly performance results are common in financial disclosure documents, this convention works to the advantage of the person publishing the performance. It is unlikely that the highest or lowest daily net asset value (NAV) will occur on the last day of the month, so the extremes will rarely be seen, and the performance statistics will appear smoother than when using daily returns. Those responsible for due diligence before investing in a new product will often require daily return data in order to avoid overlooking a large drawdown that occurred mid-month.

     Using the S&P index as an example, the annualized volatility of the daily returns from 1990 through 2010 was 18.6%, but based on monthly returns it was only 15.3%. The risk based on monthly returns is 17.7% lower, but the annualized rate of return would be the same because it only uses the beginning and ending values.

Downside Risk

     Because the standard deviation is symmetric, a series of jumps in profits will be interpreted as larger risk. Some analysts believe that it is more accurate to measure the risk as limited to only the downside returns or drawdowns. The use of only losses is called lower partial moments, where lower refers to the downside risk, and partial means where only one side of the return distribution is used. The easiest way to see this is semi-variance, which measures the dispersion that falls below the mean, \bar{R} , or some target value测量低于平均值 \bar{R} 或某个目标值的离差,

  • The semivariance formula can be used to measure a portfolio's downside risk.
  • Semivariance only considers observations that are below the mean of a data set.
  • Suppose you are analyzing the daily returns of a stock over the last month. The daily returns are as follows:

    Day 1: -1%
    Day 2: 2%
    Day 3: -3%
    Day 4: -1%
    Day 5: 0%
    Day 6: 1%
    Day 7: -2%
    Day 8: -1%
    Day 9: 0%
    Day 10: 1%
    Day 11: -2%
    Day 12: -1%
    Day 13: -3%
    Day 14: 0%

    You want to calculate the semivariance for the returns that are below 0%. To do this, you first need to calculate the average return for the entire period:

    Average Daily Return = (Sum of Daily Returns) / (Number of Observations)
    = (-1% + 2% - 3% - 1% + 0% + 1% - 2% - 1% + 0% + 1% - 2% - 1% - 3% + 0%) / 14
    = -0.357%

    daily_return=np.array( [-0.01, 0.02,-0.03,-0.01,
                   0.00, 0.01, -0.2, -0.01,
                   0.00, 0.01, 0.2, -0.01,
                   -0.03, 0.01

    Next, you need to calculate the number of observations where the returns are below -0.357% and the squared deviation of those returns from the average return:

    def semi_var(ser):
        average = np.nanmean(ser)
        r_below = ser[ser < average]
        return 1/len(r_below) * np.sum( (average - r_below)**2 )

     However, the most common calculation for system performance is to take the daily drawdowns, that is, the net loss on each day that the total equity is below the peak equity. For example, if the system returns had produced an equity of $25,000 on day t, followed by a daily loss of $500, and another loss of $250, then we would have two values as input 500/25000=0.02 and 750/25000= 0.03. Only those net returns below the most recent peaks are used in the semivariance calculation只有那些低于最近峰值的净回报才会用于半方差计算. Alternatively, you could just take the standard deviation of these daily drawdowns to find the probable size of the drawdowns.或者,您可以仅采用这些每日回撤的标准差来找出可能的回撤规模

     One concern about using only the drawdowns to predict other drawdowns is that it limits the number of cases and discards the likelihood that higher-than-normal profits can be related to higher overall risk. In situations where there are limited amounts of test data, using both the gains and losses will give more robust results. When there is a large amount of data, the use of drawdowns can be a very good measurement.

     A full discussion of performance measurements can be found in Chapter 21, System Testing, and also in Chapter 23 under the headings “Measuring Return and Risk” and “Ulcer Index.”

The Index

     The purpose of an average is to transform individuality into classification. In doing that, the data is often smoothed, and useful information is gained. Indexes have attracted enormous popularity in recent years. Where there was only the Value Line and S&P 500 trading as futures markets in the early 1980s, now there are stock index futures contracts representing the markets of every industrialized country. The creation of trusts, such as SPDRs (called “Spyders,” the S&P 500, ^GSPC), Diamonds (DIA, the Dow Jones Industrials, ^DJI), and Qs (QQQ, the NASDAQ 100, ^IXIC) have given stock traders a familiar vehicle to invest in the broad market rather than pick individual shares. Industrial sectors, such as pharmaceuticals/ˌfɑ:məˈsju:tɪkəlz/制药, health care, and technology, first appeared as mutual funds, then as ETFs, and now can also be traded as futures. These index markets all have the additional advantage of not being constrained by having to borrow shares in order to sell short, or by the uptick rule (if it is reinstated) requiring all short sales to be initiated on an uptick in price这些指数市场都具有额外的优势,即不受必须借入股票才能卖空的限制,也不受要求所有卖空都在价格上涨时启动的上涨规则(如果恢复)的限制.

     Index markets allow both individual and institutional participants a number of specialized investment strategies. They can buy or sell the broad market, they can switch from one sector to another (sector rotation板块轮换), or they can sell an overpriced sector while buying the broad market index (statistical arbitrage). Institutions find it very desirable, from the view of both costs and taxes, to temporarily hedge their cash stock portfolio by selling S&P 500 futures rather than liquidating stock positions. They may also hedge using options on the S&P futures or SPYs. An index simplifies the decision-making process for trading. If an index does not exist, one can be constructed to satisfy most purposes.

     The index holds an important role as a benchmark for performance. Most investors believe that a trading program is only attractive if it has a better return-to-risk ratio than a portfolio of 60% stocks (as represented by the S&P 500 index), and 40% bonds (the Lehman Brothers Treasury Index). Beating the index is called creating alpha, proving that you’re smarter than the market.

Constructing an Index

     An index is standardized way of expressing price movement, normally an accumulation of percentage changes. Most indexes have a starting value of 100 on a specific date. The selection of the base year is often “convenient” but can be chosen as a period of price stability. The base year for U.S. productivity and for unemployment is 1982; for consumer confidence, 1985; and for the composite of leading indicators, 1987. The CRB Yearbook shows the Producer Price Index (PPI) from as far back as 1913. For example, the PPI, which is released monthly, had a value of 186.8 in October 2010 and 185.1 in September 2010, a 0.9184%=(186.8/185.1-1)x100% increase in one month. An index value less than 100 means that the index has less value than when it started.

Each index value is calculated from the previous value as:
and the 1-period returns are calculated in the same way as shown previously in this chapter.

Calculating the Net Asset Value—Indexing Returns 

     The last calculations shown in the spreadsheet, Table 2.3, are the Net Asset Value (NAV), calculated two ways. This is essentially the returns converted to an index, showing the compounded rate of return based on daily profit and losses relative to a starting investment. In the spreadsheet, this is shown in column F using standard returns and G using LN returns.

The process of calculating NAVs can be done with the following steps:

  • 1. Establish the initial investment, in this case $100,000, shown at the top of column C. This can be adjusted later based on the target volatility.
  • 2. Calculate the cumulative account value by adding the daily Profits or Losses ( column B) to the previous account value (column C).
  • 3. Calculate the daily returns by either (a) dividing today’s Profit or Loss by yesterday’s account value to get R,
    or (b) taking the natural log of 1 + r.
  • 4. If using method (a) then each subsequent NAV_t = NAV_{t-1} \times (1+R), and if using method (b) then each NAV_t = NAV_{t-1} + NAV_0 \times ln(1+R_t) = NAV_{t-1} + NAV_0 \times r_t = NAV_0 \times\sum_{0}^{t}r_t.

     The final values of the NAV are in the last dated rows. The U.S. government requires that NAVs be calculated this way, although it doesn’t specify whether returns should be based on the natural log. This process is also identical to indexing, which turns any price series into one that reflects percentage returns.


#1tsm2_mean_date xticklabel_histogram tick mark dist_Skewness Kurtosis_moment_P/E_t-Statist_returNAV 的相关文章


  • 区块链学习笔记(三)——从商鞅变法谈“共识机制”

    区块链学习笔记 三 从商鞅变法谈 共识机制 前言 一 商鞅变法的故事 总结一下 二 共识机制 1 什么是共识机制 2 要点 总结 前言 区块链健康运行的灵魂是 共识机制 那么什么是 共识机制 呢 一 商鞅变法的故事 相信大家都听过这么一个故
  • OpenStack的部署(八)------cinder组件

    目录 一 在CT控制节点配置Cinder 1 创建数据库实例和角色 2 创建用户 修改配置文件 3 开启Cinder服务 4 验证 二 在计算节点c2配置Cinder 存储节点 1 准备工作 2 配置cinder模块 3 开启cinder卷
  • 背景建模--Vibe 算法优缺点分析

    一 Vibe 算法的优点 Vibe背景建模为运动目标检测研究邻域开拓了新思路 是一种新颖 快速及有效的运动目标检测算法 其优点有以下两点 1 思想简单 易于实现 Vibe通常随机选取邻域20个样本为每个像素点建立一个基于样本的背景模型 具有
  • Python工具箱系列:使用Python实现简单的文本加密和解密

    Python工具箱系列 使用Python实现简单的文本加密和解密 在计算机科学中 数据加密是一项重要的技术 用于保护敏感数据的安全性 Python作为一种强大而受欢迎的编程语言 提供了丰富的库和工具 使得实现文本加密和解密变得相对容易 本文
  • 第40节 指北针实例

    目录 本节内容 本节代码 思路 以下为全部代码 本节内容 受网友提问 本节实现一个指北针 上面左下角的指北针由两部分组成 一部分是指针 一部分是底盘 底盘动 指针在动 默认朝向Z轴负方向是北 朝向X轴正方向是东 你可以感受一下 其它的方向就
  • termux下安装centos安装python

    安装termux 链接 f droid 下载安装 授存储权限 termux setup storage 提示输入y 换源 termux change repo 获权 termux chroot 先安装依赖 pkg install pytho
  • 冯乐乐之二 shader的数学

    冯乐乐目录 第2章渲染流水线介绍 第三章 Unity shader基础 基础shaderLab语言 shader结构 属性properties 主角SubShader 备胎Fallback shader三大类型 Unity宠儿表面着色器 聪
  • Python 程序设计练习1.2

    从键盘输入三个数到a b c中 按公式值输出 在同一行依次输入三个值a b c 用空格分开 输出 b b 4 a c的值 输入格式 在一行中输入三个数 输出格式 在一行中输出公式值 输入样例 在这里给出一组输入 例如 1 7 3 输出样例
  • Linux下查找和删除7天以前的文件

    在工作做 项目里runtime目录下产生很多日志文件 需要定期去删除 记一次linux下清理过期日志的过程 环境说明 删除 var log 下7天以前的 log文件 用到的命令 find rm 命令示例 find data www runt
  • 统一观测丨借助 Prometheus 监控 ClickHouse 数据库

    引言 ClickHouse 作为用于联机分析 OLAP 的列式数据库管理系统 DBMS 最核心的特点是极致压缩率和极速查询性能 同时 ClickHouse 支持 SQL 查询 在基于大宽表的聚合分析查询场景下展现出优异的性能 因此 获得了广
  • 你离知道做到之间差的是什么?

    文章目录 知道却没有做到的原因 1 信息超载 2 消极过滤 3 缺少跟进 如何知道做到 1 少而精 2 绿灯思维 3 跟进 贯穿始终的做法 重复 重复 再重复 前言 非常感谢我的人生导师 始终告诉我当你接触高人的思想 方法指导时 不要怀疑
  • mvc:annotation-driven注解的作用

  • SpringCloud 和 Dubbo,哪个才是微服务主流?

    目录 一 什么是SpringCloud 二 什么是Dubbo 三 SpringCloud 和Dubbo哪个好 四 微服务的优势 一 什么是SpringCloud Spring Cloud是一个基于Spring Framework的开源微服务
  • C语言用辗转相除法求最大公约数

  • 页面退出时,清空Vuex中的数据

    点击退出时 需要做的操作有 1 清空sessionStorage里面的值 2 删除vuex中的值 让当前的页面刷新 3 跳转回登录页 sessionStorage clear this router push login 想清空vuex中的
  • [QT杂谈]QtCreator调试模式下打断点调试模式立刻暂停并进入汇编界面

    最近碰到到一件奇怪的事 就是 QtCreator在调式模式下打断点调式模式立刻暂停并进入汇编界面 然后点击恢复后又没有什么致命的影响 但是非常难受的是每次都要重新点恢复调式非常影响效率 最后网络上搜寻解决办法时 相关的问题都是什么路径有中文
  • 编码与调制(曼彻斯特编码、调制解调器等)

    基带信号与 宽带信号 信道 信号的传输媒介 一般用来表示向某一个方向传送信息的介质 因此一条通信线路往往包含一条发送信道和一条接收信道 信道有两种分类手段 可按照传输信号分为模拟信道和数字信道 按照传输介质则可以分为无线信道和有线信道 信道
  • 探索隧道ip如何助力爬虫应用

    在数据驱动的世界中 网络爬虫已成为获取大量信息的重要工具 然而 爬虫在抓取数据时可能会遇到一些挑战 如IP封禁 访问限制等 隧道ip TunnelingProxy 作为一种强大的解决方案 可以帮助爬虫应用更高效地获取数据 本文将探讨隧道ip
  • [转]最详细的CentOS 6与CentOS 7对比(三):性能测试的对比

    本主题将从3个角度进行对比 常见设置 CentOS 6 vs CentOS 7 服务管理 Sysvinit vs Upstart vs Systemd 性能测试 cpu mem io oltp 本文为第三部分 性能测试的对比 1 CPU测试
  • #1tsm2_mean_date xticklabel_histogram tick mark dist_Skewness Kurtosis_moment_P/E_t-Statist_returNAV

    ABOUT DATA AND AVERAGING The Law of Averages We begin at the beginning with the law of averages a greatly misunderstood