Hi Nand, Group,
please take a look at my comments below...
Nand Kishore Singh <nk_singh1@.. .> wrote:
1. In your case no distribution fits (as table shows) and is worst situation. Even in better situation where many distributions may fit to same data, only goodness of fit is not sufficient. Distributions have their own characterizing property like independence of mean and variance for normal distribution, memory less property for exponential distribution. You should have reason to assume these characteristics (as per your knowledge of domain).
I will confirm the goodness of fit test with these properties once I settle on the right distribution. Are the normal and exponential/ geometric the only ones that these properties? or also the lognormal, Weilbul, etc...
2. Role of fitting a known distribution, comes when you try to generalize the sample result to population. First make some conclusion on basis of sample. Then think how your sample data has been obtained. If it is not collected (or generated) by established mechanism of statistics, there is no benefit to fit a data. You have not disclosed relation between sample and population for you data yet.
I honestly had not thought of this more carefully.
I have defined a population to study which is the financial transactions of a group of investors that meet certain criteria such as a winning percentage above 95%, a minimum number of transactions of 50, among other criteria.
So for each investor that meets the criteria above, I extract from the database every single historical
transaction that they have opened and closed since they began until the day of the extraction of the data. So as you can see there's really no sampling involved, just blindly selecting all transactions, I don't add or subtract any so theres no sampling bias as far as i can tell. This results in:
1. a total of 20 investors that met the criteria
2. a total of 7,346 historical transactions
So every transaction has a bunch of data, but for whichever model i choose for my tests i'm only interested in three variables:
a.
the time from the opening of each transaction to the close of each transaction.
b.
the profit or loss of each transaction
c.
the drawdown of each transaction. Let me explain this one. Suppose you have opened a position for a financial instrument such as a stock share, a foreign exchange currency lot, a futures option, or whatever... You buy a share for $USD50.00 at 9:00am. Then
at 9:30am the stock price is down to USD$40.00. Then at 10:00am the stock price is up to USD$55.00. You decide to close the position at 10:00am. Therefore your profit was$5. And the drawdown is the maximum number of units (in this case is USD) that the price went below the entering price of the position. Therefore the drawdown for this transaction was $10. This is an important variable to study because its a measure of the risk.
The first thing I have been trying to do is find correlations between the variables. So for example, when I perform a report for descriptive statistics or a Goodness-of- fit test I am taking ALL 7,346 transactions as the sample.
So this is where your comment kicks in. And honestly I'm a little confused and a bit ashamed (this would probably embarass my statistics teacher hahaha) because this is such basic stuff - the population and the sample - and I had not carefully thought of it.
I figured
that if I take all 7,346 transactions, its a large sample and the power would increse. And I would be able to draw conclusions about the whole population - in this case the whole population would be the 7,346 transactions being studied plus the
n number of transactions that are placed in the future. So in my logic, if I can reach conclusions about this sample of 7,346 transactions, then they would also apply for future transactions that are done by investors that meet the same criteria.
But as you saw, there's a big problem that the data does not fit any distribution.
So. Now I will try to generate random samples and treat the 7,346 transactions as my population instead of my sample. I will try to find a good read on sampling. If you have a good reference please send it to me. Hopefully the sample will fit a distribution without getting rid of the sources of variance which is very important because the outliers in this data are
strong and meaningful. But if the whole population of 7,346 items did not fit a distribution, I dont see how an appropriate random sample of this same data will do so. It makes no sense to me.
3. Use exploratory data analysis at initial stage (See attachment for help). Check assumption of classical regression tool and try to apply them (no assumption of normality is needed to get estimate of coefficient of parameter). Be satisfied with new information and knowledge (for given context) gain at every step. Do not try for best. Believe in better with clear information why further step for betterment can not be taken.
I will have to read that chapter on regression again... it seems like it might be a good choice before going nonparametric. ..
I would love to also do ANOVA to compare the performance of the individual investors. And off course correlation.
4. You are also user of statistics. Statistician generally use three type of data- experimental (where data generated according to some design), survey (where data collected according to some plan) and census. You belong to second group (data collection mechanism may not be statistically sound).
I hadn't thgouht of myself as a user of statistics, but I guess I have crossed the line from a consumer to a user. I will think of myself as an "amateur novice basic" user. hahaha
5. Still I could not gain feeling for your data. I could not understand following description given by you (can you clear me marked terms):
a. The first variable "hours" is the number of hours it takes to achieve an outcome X, specifically, the number of hours to close a financial position.
b. The last variable is "DD" or the drawdown of the transaction
As explained above...
6. For understanding difference between interval and ratio scale, you should read the material send by me.
Did so again but still have the same black and white notions about ratio having an absolute zero and interval having no limits...
Thanks for keen interest to use statistics
and thank you so much for your kind help! hope I can return the favor by getting good results on my theories about investing.
Kind regards.
> With regards
> Nand Kishore--- On Sun, 8/9/09, bigitop doctormuniz@ ... wrote:
>
>
> From: bigitop doctormuniz@ ...
> Subject: [Statisticians_ group] Re: help to determine the best statistical approach for this data
> To: Statisticians_ group@yahoogroup s.co.in
> Date: Sunday, August 9, 2009, 10:44 PM
>
>
>
>
>
>
> Thanks to Madan, Nand and svsharma for your informative replies . Here are my comments.
>
> svsharma: This data that i'm using are from real financial transactions and all the data
has been supplied in excel format and there are no missing values or errors. I agree with you about better data preparation. So for example I'm converting all drawdown data from negative numbers to positive numbers (it would mean the exact same thing), this way I can perform some tests that require the data is above zero. I'm also converting the zero values of this variable to 1 which wouldnt affect my inferences. Also standardize the profits variable in a range from 1 to 1000 for testing purposes.
>
> I ran the "Distribution Identification" function of MiniTab which performs a goodness-of- fit test of the data against several different distributions. As Madan suggested the closest distribution is the 3-parameter lognormal (for the DD variable) so I will be looking at geometric mean and coefficient of variation for measures.
>
> However, eventhough the 3-parameter lognormal gave the best Anderson-Darling statistic (10.1), the
goodnes-of-fit test cannot be calculated by MiniTab for this distribution. And none of the p-values for all the other tests are high enough so that I can say the data fits the distribution. So basically the data doesnt fit any distribution as per the p-value og the GOF test. Look at the results:
>
> Goodness of Fit Test - Drawdown variable
>
> Distribution AD P LRT P
> Normal 901.247 <0.005
> Box-Cox Transformation 11.737 <0.005
> Lognormal 16.200 <0.005
> 3-Parameter
Lognormal 10.113 * 0.000
> Exponential 452.395 <0.003
> 2-Parameter Exponential 504.829 <0.010 0.000
> Weibull 104.360 <0.010
> 3-Parameter Weibull 74.893 <0.005 0.000
> Smallest Extreme Value 1467.161 <0.010
> Largest Extreme Value 538.865 <0.010
> Gamma 168.564 <0.005
> 3-Parameter Gamma
131.464 * 0.000
> Logistic 617.287 <0.005
> Loglogistic 25.660 <0.005
> 3-Parameter Loglogistic 16.056 * 0.000
>
> So what would be the best course of action in this case? stick with the 3-parameter lognormal distribution?
>
> The other variables are even worse:
>
> Goodness of Fit Test of profit variable
>
> Distribution AD P LRT P
>
Normal 1010.837 <0.005
> Box-Cox Transformation 1010.077 <0.005
> Lognormal 1156.915 <0.005
> 3-Parameter Lognormal 1011.975 * 0.000
> Exponential 3053.583 <0.003
> 2-Parameter Exponential 3044.862 <0.010 0.000
> Weibull 1300.869 <0.010
> 3-Parameter Weibull 1299.359
<0.005 0.001
> Smallest Extreme Value 1415.156 <0.010
> Largest Extreme Value 1675.373 <0.010
> Gamma 1076.513 <0.005
> 3-Parameter Gamma 208686.950 * 1.000
> Logistic 580.124 <0.005
> Loglogistic 574.104 <0.005
> 3-Parameter Loglogistic 574.147 * 0.000
>
> What should I do with this data to make it fit a
distribution? ?
>
>
> Nand, thank you very much for your insightful comments. I would add a third group of people that is related to statistics., aside from the developers and users - are the consumers of statistics. I would be in the third group, as a practicing physician I need to have a basic understanding of statistics but just enough to understand the basic research literature conclusions. I'd say this is the group where most professionals fall into, since I'm not really a user of statistics in the sense that I do not need to analyse and design experiments in order to do my work. My current experiment, which is the subject of this post, is just an attempt to better manage my personal finances, so I've been hit in the face by the complexity of the statisticians task which has increased my respect for this profession substantially.
>
> I was very aware of the problems of manipulated data by
scientists and the article you sent is very illustrative of this phenomenon. I personally limit my reading to strong peer-reviewed journals such as the New England Journal of Medicine. I'll keep this article very close to me the next time I read the medical literature. I completely understand your point about the limitations that pure statisticians face. For example, econometricians have a bunch of methods that are more appropiate in their field so a pure statistician trying to analyze this kind of data is going to have many problems. On the other hand, a statistician that is very versatile and resourceful will be able to use methods from different models to achieve their goal. I'm kind of taking statistics as sort of a hobby, hopefully I'll become more versatile with the methods of different models. At least the basic ones!
>
> Finally in reference to your numbered list of concerns: 1) I dont know how to attach from web post but here
are the links to the images: hours, profit, drawdown. 2) As far as I know, the profit would be interval and not ratio because it doesn't have an absolute zero. In theory you could make or loose an infinite amount of money (I hope the first option happens to me :) 3) I think what I need so far is correlation analysis, therefore I'd be better off normalizing the data as Madan suggested. 4) I can't give you a robust reference but I've been using, among other sources, Wikipedia. From what I have read and understand, the Pearson correlation can be performed but it may not be safe to infer any conclusions in the case of nonnormality. See this link .
>
> Thank you all once again.
>
> Warm Regards.
>
>
>
> --- In Statisticians_ group@yahoogroup s.co.in, Nand Kishore Singh nk_singh1@ .> wrote:
> >
> > Dear […]
> > Thanks for raising problems which many of us are
facing. I want to put some general observation (which may be indirectly associated with your problem) for user of statistics and then discuss your problem specifically.
> > People associated with statistics can be divided in two groups- group of developers and users. Challenges for both are different. Developers of statistical methods work on assumption while users work on ground reality. At initial phase of development of statistics this gap was narrow but now it is widening.
> > As user of statistics, our primary aim should be enrich concerned domain. One can do it in following steps
> > (1) Check which type of abstract ideas and believes are prevailing in concerned domain.
> > (2) Think how believes and abstract ideas (based on intuition) may be represented through data. Three things are important here (1) What characteristics (like caste, land, welfare)
should used on what unit (household, community etc) (2) How these characteristics should be represented through data (3) What are dependent and independent characteristics (4) How independent characteristics are related- additively, interactively etc. This is very crucial step. Here it is pertinent to mention that there may be more than one way to represent abstract idea (and characteristics) . For example, welfare (a characteristics) of household may be represented in many ways through data. Similarly there may be different theories (set of independent variables) to explain the production. So basic model comes from expertise of domain. Statistical tool should be used to estimate (an test) the parameter of model so that comparison can be made. Statistician can
> > also help in searching a better model by inclusion of more suitable characteristics or taking different function of characteristics.
> >
(3) Collect seemingly concerned data according to statistical methods (as far as possible)
> > (4) Use statistical tool to explore, estimate and test parameter of model.
> > (5) Revise initial model so that it may be supported through data in better way.
> >
> > It is ground reality that there may be limitation to use various statistical method. What you have to do is to show all your limitation in report. For example you are using secondary data and it is not random. In this case you should mention what are possible source of bias. See attach file “How to lie with statistics.pdf”.
> >
> > Problem of pure statistician is that generally socio economic data is not suitable of advance statistical tool. For applied statistician creativity is in using tools of one domain in others. For example life
table method generally used by demographer but it may be used to understand dropout in education. Similarly hazard based model used in medical may be used in economics. Real challenge before pure statistician is to get sufficient expertise in different domain quickly and explore whatever data can say and publish it with its limitation.
> > Generally without getting domain expertise (or collaboration of domain expert) we want to apply statistical tool in absolute. One of the reason that we are taught by experts (as tool developers not user) who never emphasize role of context. We are taught in terms of random variable. That is why we think statistical tools may be applied in absolute.
> >
> > Above mentioned steps are nearer to causal model which covers large proportion of human thinking. There may be other type of modeling (like used to explain queues and network). Steps used for such model will be different.
> >
> > As conclusion, I want to say search method and data as per need of problem in place of searching a problem and methods which is suitable for data. Do not exercise for changing your body to adjust with already created (some time second hand) shirt (data and method). Better to create a shirt which fit on your body. I know this philosophy will not suits to many applied statistician who are under pressure to create more research paper. Applied statistics in socio economic area is long way which starts from case study and participatory (qualitative surveys) to use of data of large scale quantitative survey for which data has been collected by others.
> > Now I am coming to your questions.
> > 1. I could not see your graphs, my browser is not opening it (may be due to security reason). Please send it as attachment.
> >
2. I would like to revise my idea on interval and ratio scale. I could not understand why profit is not ratio scale? Whether ratio of two profits are not meaningful? Please see
> > http://dogsbody. psych.mun. ca/VassarStats/ webtext.html
> > or
> > http://faculty. vassar.edu/ lowry/ch1pt1. html (if above is not working)
> > 3. Without condition of normality, you can do a lot of things. For classical regression analysis, condition of normality is not required if you want to estimate parameters of model (with st error). Condition of normality is required if you want to test parameter. Even estimated values provide a lot of information to enrich the knowledge of domain. For testing you can suitably transform your dependent (and independent variable if needed) (as Mr. Madan said).
> > 4. I would
like to see reference requirement of normality for use of Pearson correlation (I am not posing question)
> >
> > With regards
> > Nand Kishore
> > --- On Sun, 8/9/09, Madan Kundu madan4331@ . wrote:
> >
> >
> > From: Madan Kundu madan4331@ .
> > Subject: Re: [Statisticians_ group] help to determine the best statistical approach for this data
> > To: Statisticians_ group@yahoogroup s.co.in
> > Date: Sunday, August 9, 2009, 8:33 AM
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > Your data seems to be following either exponential or lognormal distribution. For this kind of distributions, the best measures of central tendency and dispersion are
Geometric mean and Coefficient of variation.
> >
> > Regarding testing of hypotheses, please log-transform your data to make them normal. Then apply appropriate parametric tests.
> >
> > Hope this helps.
> >
> > Regards
> > Madan Gopal Kundu
> >
> > --- On Sun, 9/8/09, bigitop <doctormuniz@ gmail.com> wrote:
> >
> >
> > From: bigitop <doctormuniz@ gmail.com>
> > Subject: [Statisticians_ group] help to determine the best statistical approach for this data
> > To: Statisticians_ group@yahoogroup s.co.in
> > Date: Sunday, 9 August, 2009, 9:22 AM
> >
> >
> >
> >
> > Hello All. I have an "Introduction to Statistics" (biostatistics, actually) college course under my belt. All the basics are good to know for general knowledge but when it
comes to actually applying it to specific basic problems, I'm afraid it is just not enough. That's why I need your help.
> >
> > I have data for three variables with strongly skewed distributions.
> >
> > The first variable "hours" is the number of hours it takes to achieve an outcome X, specifically, the number of hours to close a financial position. This variable is of the type ratio scale as the values cannot go below zero.
> >
> > The second variable "PPS" is the actual outcome of the transaction: profit or loss. Positive values represent profit, negative values represent losses. This is interval data.
> >
> > The last variable is "DD" or the drawdown of the transaction. These are all negative numbers with a maximum value of zero, therefore is ratio scale type data.
> >
> > Below are the graphical summaries generated by MiniTab, and then my questions.
Please bear with me!
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > Ok now the questions.
> >
> > Regarding the descriptive stats:
> >
> >
> > 1. As far as I know in these cases the best measure of central tendency would be the median. Are there other measures that are better?
> >
> > 2. Since the distributions are not normal, the standard deviation is not a good measure in this case. What is the best way to determine the dispersion aside from quartiles? can I use the absolute deviation from the median or the mean? how true is this measure? what do you suggest?
> >
> > Next I would like to infer certain conclusions based on this data. For example: Are transactions that take longer to close more profitable than the ones that take less time to
close?, How does the duration of the trade relate to the drawdown?, etc...
> >
> >
> > 1. I'm confused about what should I use: parametric or non-parametric tests? On one hand I have interval and ratio data which tells me I should use parametric tests. On the other hand I have non-normal distributions which tells I better go with non-parametric. .
> >
> > 2. Since the data is not normal, I can't use Pearson correlation. Are there any other tests that would be ok to use for this data. My main objective is to determine the relationships between the different variables.
> >
> > 3. What steps can I take to determine the best statistical model to use for my data? Or should I even worry about this?? or a better question: when should I start worrying about the appropriateness of the model?
> >
> > I can only think of these issues in black or white because I dont have a deep
intuitive understandingof of the intricacies of statistics, but I've tried to make it as clear as possible. This is my first post to the group. Sorry for the length of the post. I couldn't explain my concerns in a shorter one. Thank you in advance for your replies and suggestions.
> >
> >
> >
> >
> >
> >
> > See the Web's breaking stories, chosen by people like you. Check out Yahoo! Buzz.
> >
>