We provide systematic evidence on the potential for estimating household well-being from mobile phone data. Using data from four countries - Afghanistan, Cote d'Ivoire, Malawi, and Togo - we conduct parallel, standardized machine learning experiments to assess which measures of welfare can be most accurately predicted, which types of phone data are most useful, and how much training data is required. We find that long-term poverty measures such as wealth indices (Pearson's rho = 0.20-0.59) and multidimensional poverty (rho = 0.29-0.57) can be predicted more accurately than consumption (rho = 0.04 - 0.54); transient vulnerability measures like food security and mental health are very difficult to predict. Models using calls and text message behavior are more predictive than those using metadata on mobile internet usage, mobile money transactions, and airtime top-ups. Predictive accuracy improves rapidly through the first 1,000-2,000 training observations, with continued gains beyond 4,500 observations. Model performance depends strongly on sample heterogeneity: nationally-representative samples yield 20-70 percent higher accuracy than urban-only or rural-only samples.