
You can draw a random sample here. Select size and classification to use and hit "submit". You will receive a list of languages relevant for your research. The sample is genetically and areally stratified. It can take some time (several minutes) to draw the sample, so please be patient.
On the left side, you can configure the parameters of your sample. Some of the parameters are obvious while others look more obscure. Sample size obviously refers to the quantity if languages you wish to include in your sample. The languages will come from a maximally diverse set of families. In order to determine such a set, we have to rely on a classification . The default "Glottolog 2012" is a sensible classification, but specialists might want to adopt a different classification. Document types refers to the types of documents your research will draw on. If you are interested in words, dictionaries will be more useful than grammars, whereas morphosyntacticians might have exactly opposite preferences. When determining the sample, language families without any documentation of the relevant sort will be excluded from the outset.
There are a number of minor parameters to tweak when drawing the sample. All of these have sensible defaults and can be left untouched unless you have very special requirements. Percentage of isolates is a measure intended to keep your sample from being flooded with one-member-families. This is by default set to 10%. Maximum number of members for isolates allows you to redefine isolates as to include small families as well. The default '1' does not treat small families as isolates, but by adjusting the value to '4' or '5', you can change this. You can force the sampling procedure to fail if there are not enough releveant documents by checking abort if insufficient number of documents
By default, the sampling procedure controls for area. You can deselect this if you prefer a sample unstratified for area. Highest node is useful if you want to draw a sample from a subtree of a language family, e.g. Indo-Iranian. In that case, you have to retrieve the id of the relevant node from the languoid description and enter it into the field. Only subnodes of that node will be considered then. Stratification level refers to a way to assure that large language families are not underrepresented. This is done by sampling on the basis of genera rather than phyla. This is only useful if you choose the classification "Dryer 2005". See Matthew Dryers works on sampling for the rationale of using genera. Special algorithms are needed when the sample size exceeds the number of language families. In these cases, some language families will be represented by more than one member. The default option "Random" will select the families with "extra representation" on random. From these families, additional subfamilies will be included. "Size" chooses the largest language families to be providers of additional languages. "Diversity Value" is a different technique, which takes into account the internal constitution of a family. The reader is referred to Bakker et al. (1993,1998) for further information about the diversity value method.
The sample drawing algorithm only includes works since 1800 since older works are normally few and far between and also often difficult to interpret. You can adjust the time frame of your research by changing this value to a time period more to your taste if you like.